Jump to content

Recommended Posts

1 minute ago, Stefan Glienke said:

@Mahdi Safsafi I would not use that code as evidence because there actually the for loop produces better code than the other one. I was talking about while x do loop vs if x then repeat until not x conversion.

Yes I know that. The code was just a snippet from the paper I read to point out about emitting if branch.

Quote

Also it makes a difference if you have a static count or not

Event if the variable is not static ... Some language takes contract as seriously : assert(i > 0); 😉

Share this post


Link to post

By the way:

LoopStuff.dpr.26: while i < count do
0041BDB0 3BFB             cmp edi,ebx
0041BDB2 7E0C             jle $0041bdc0
LoopStuff.dpr.28: dosomething(i);
0041BDB4 8BC3             mov eax,ebx
0041BDB6 E8D5FFFFFF       call dosomething
LoopStuff.dpr.29: inc(i);
0041BDBB 43               inc ebx
LoopStuff.dpr.26: while i < count do
0041BDBC 3BFB             cmp edi,ebx
0041BDBE 7FF4             jnle $0041bdb4
LoopStuff.dpr.26: if (i < count) then // this branch can be emited if i holds at compile time.
0041BDB0 3BFB             cmp edi,ebx
0041BDB2 7E0C             jle $0041bdc0
LoopStuff.dpr.29: dosomething(i);
0041BDB4 8BC3             mov eax,ebx
0041BDB6 E8D5FFFFFF       call dosomething
LoopStuff.dpr.30: inc(i);
0041BDBB 43               inc ebx
LoopStuff.dpr.31: until (i = count);
0041BDBC 3BFB             cmp edi,ebx
0041BDBE 75F4             jnz $0041bdb4

As I said: sometimes the compiler does the check at the top for the while loop. 

Share this post


Link to post
10 minutes ago, Stefan Glienke said:

As always: it depends - while I certainly would not use a for-to loop when dealing with some string parsing but rather prever PChar I just found recently that using a for-to loop with indexing into a pointer was faster than inc(thatpointer).

As I mentioned before, Compiler is much able to understand integer behavior over pointer behavior. CPU can also be smart to prefetch next data for you on the background.

Share this post


Link to post
10 minutes ago, Stefan Glienke said:

As always: it depends - while I certainly would not use a for-to loop when dealing with some string parsing but rather prever PChar I just found recently that using a for-to loop with indexing into a pointer was faster than inc(thatpointer).

As I mentioned before, Compiler is much able to understand integer behavior over pointer behavior. CPU can also be smart to prefetch next data for you on the background.

Share this post


Link to post
Posted (edited)
20 hours ago, Stefan Glienke said:

@Mahdi Safsafi I just found recently that using a for-to loop with indexing into a pointer was faster than inc(thatpointer).

It depends on the complexity of the loop.
For instance, about how variables are assigned to local stack variables or registers. If the use of the for-to index is not mapped into a register, then thatpointer[ i ] will be slower...
Sometimes, just using a small local functions help the Delphi compiler make better register allocation, and may end up be faster...

 

Remember 

 🙂

Edited by Arnaud Bouchez
  • Thanks 1

Share this post


Link to post
Posted (edited)
20 hours ago, Mahdi Safsafi said:

As I mentioned before, Compiler is much able to understand integer behavior over pointer behavior. CPU can also be smart to prefetch next data for you on the background.

This is not as simple as this. IIRC prefecthing has no difference between indexed (mov ecx, byte ptr [eax+edx]) or direct access (mov ecx, byte ptr[eax]).

The reference material about instruction latency and execution pipelinging is https://www.agner.org/optimize/instruction_tables.pdf and it is highly depending on the CPU.

https://www.agner.org/optimize is worth reading and understanding.

 

https://lemire.me/blog/ is another very interesting blog about performance of text processing, from a lot of experience and actual knowledge.

Branchless SSE code will be the faster in all circumstances...


Then use a wall clock measure of full realistic process - not just measuring a loop.

 

Edited by Arnaud Bouchez
  • Like 2

Share this post


Link to post
53 minutes ago, Arnaud Bouchez said:

This is not as simple as this. IIRC prefecthing has no difference between indexed (mov ecx, byte ptr [eax+edx]) or direct access (mov ecx, byte ptr[eax]).

 

I'm sure you miss understood my comment ! I quote from my comment where I answered Anders

On 7/17/2020 at 11:13 PM, Mahdi Safsafi said:

No


mov         cl,byte ptr [eax]  

At CPU level indexed addressing = indirect addressing. The only difference is that an indexed addressing can have an expression inside []. Sometime this could be just good for alignment purpose. [eax] = [eax+0x00].
But Stefan was referring to something else register usage for each load instruction.

What I was referring to is a cause consequence:

Compiler understood better code that does not use pointer ... its able to detects some patterns and optimize them like what gcc does. Pointers on the other side are hard to understand(volatility, nullability, mutability, ...) ... yes could be optimized but the chance of optimizing a code that does not use pointer is much higher. 

See  Gather-scatter (vector addressing) to understand clearly what are you missing when using pointer instead of integer access. 

CPU also detects patterns such memory access patterns and optimize them.

 

Another important thing : yes there is no difference for mov ecx, [eax+edx]/mov ecx, [eax]. But keep in mind that this is not applicable for all instructions ! An example : MPX extension.
 

Just to make thing clear, I'm not talking about optimization on Delphi. Rather ...  a general concept for modern compiler.

Share this post


Link to post

@Mahdi Safsafi

My experiment with the Delphi compiler and pointers is not the same as yours. In mORMot, I ended up with using pointers (PUTF8Char=^AnsiChar) for the raw UTF-8 processing, which generated faster code.

Of course, pointers are difficult to deal with - you need to know what you are doing.

I always consider pointers by looking at the generated ASM. If the code is better and runs faster with pointer, then I use them. Otherwise I don't.

 

I am biased for sure: I like pascal because it can be both high-level and modern (like Java/C# - with high-level structures like classes, strings or dynamic arrays), and also system-level (like C). Pointers are clearly for the 2nd level.
Anyone don't like pointers? Don't use them. But they are pretty convenient and efficient for low-level processing, if the goal is performance.

 

I like very much when you paternize me with the Gather-scatter pattern. You can perfectly do Gather-scatter with two pointers. This is called pointer arithmetic - which I use extensively in mORMot. So I don't understand what you are talking about.

  • Like 1

Share this post


Link to post
22 minutes ago, Arnaud Bouchez said:

@Mahdi Safsafi

I am biased for sure: I like pascal because it can be both high-level and modern (like Java/C# - with high-level structures like classes, strings or dynamic arrays), and also system-level (like C). Pointers are clearly for the 2nd level.

Same here and that's a very logical rationale! Similarly, I like the fact that it supports both strong typed data such as String/Integer and dynamically typed data type such as TValue/Variant

Share this post


Link to post

by the way, if you look at the last code pasted by me, if you remove 

 

  if len = 0 then
    exit(S);
 

at the beginning, it's the same codegen and will be ~4% slower. Of course testing with non-empty input.

Early triggering of OoOE?

Share this post


Link to post
Posted (edited)

No, it's not the same codegen, its a test and jnz less. You don't want forward jumps taken in the common case.

 

Also a very important fact (I think it was mentioned before) - depending on where the compiler puts the code the hot loop code might cross cache lines - that slows things down significantly and can happen when you add or change unrelated code in other routines or in this routine - such as a check for Length or similar. When I let the routine start at some 64byte aligned address the loop body will cross cache lines and be slower.

 

To my knowledge there is no way to control the alignment of generated executable code so that you can force hot code into a single or as few cache lines as possible.

Edited by Stefan Glienke
  • Thanks 1

Share this post


Link to post
Posted (edited)

yes, less, and slower, the rest is the same codegen -> less asm, slower exec, how's that? OoOE? Aligning?

Edited by Attila Kovacs

Share this post


Link to post
2 hours ago, Arnaud Bouchez said:

@Mahdi Safsafi

My experiment with the Delphi compiler and pointers is not the same as yours. In mORMot, I ended up with using pointers (PUTF8Char=^AnsiChar) for the raw UTF-8 processing, which generated faster code.

Yes, I use several compilers (not limited to Delphi) and I can sadly tell you that Pascal compilers didn't evolved in a good way like other compilers.

Quote

Anyone don't like pointers? Don't use them. But they are pretty convenient and efficient for low-level processing, if the goal is performance.

Its not about whether you like them or not. In fact for me, I don't find any issue using low level stuff pointer or assembly. But today, I try to be much wise and choose the best friendly way for the compiler. What are you missing is that today sophisticated compiler in many case can do better job than you can do by using low level stuff like assembly or pointer. gcc for example is able to recognize a pattern (while do) that calculates bits count and replace this loop by a single instruction. It can reorder things for you when it detects that you're misusing some functionality:  see this SO. A compiler could generate much better code only when it can understood your code. If you manage to write a hard code that your compiler can't understand it wan't be optimized correctly.

Here is a simple example to show what would happen when the compiler does not understand your code: both swap/swap2 function do the same job but one is decrementing i and the other one is incrementing i. and both log the result using a for loop. The interesting part is that unlike swap, the two loop of swap2 were merged together

__declspec(noinline)

void swap(short int* src, int* dst,  int count) {
	short int* p = src;
	int* r = dst;
	int i = count;
	// loop 1
	while (i--) {
		*r++ = swapint(*p++);
	}
    // loop 2
	for (int i = 0; i < count; i++) printf("%d = %d\n", i, dst[i]);
	return ;
}
__declspec(noinline)
void swap2(short int* src, int* dst, int count) {
	short int* p = src;
	int* r = dst;
	int i = count;
	// compiler merged/combined the two loop !                         
	while (i++ != count) {
		*r++ = swapint(*p++);
	}
	for (int i = 0; i < count; i++) printf("%d = %d\n", i, dst[i]);
	return;
}

int main()
{
	short int src[10] = { 1,2,3,4,5 };
	int dst[10] = { 0 };
	
	swap(&src[0], &dst[0], 10);
	swap2(&src[0], &dst[0], 10);
	return 0;
}

 

 

ptrdoc.png

Share this post


Link to post

 

3 hours ago, Mahdi Safsafi said:

Yes, I use several compilers (not limited to Delphi) and I can sadly tell you that Pascal compilers didn't evolved in a good way like other compilers.

 

3 hours ago, Mahdi Safsafi said:

If you manage to write a hard code that your compiler can't understand it wan't be optimized correctly.

 

Does that mean you need a trial and error to get to know all these little 'secret' tips how to write best possible optimized code, for each compiler?

Share this post


Link to post
1 hour ago, Mike Torrettinni said:

Does that mean you need a trial and error to get to know all these little 'secret' tips how to write best possible optimized code, for each compiler?

Secret ? no, they're publicly published (agner, intel/amd doc, optimization guide, paper, LLVM, ...). I read them and experiment things my self and when I have no things to do I compare results on different compilers. Sometime, when I'm lucky, I find an already benchmark that lists the results.

Error happens a lot specially when porting code from/to another platform ... but I deeply inspects the error and try to learn from it why it happened ? a workaround ? how to avoid it?.

  • Like 1

Share this post


Link to post
Posted (edited)

In the C++ community people usually tell you the compilers are incredibly smart (mostly referring to gcc, clang, msvc or icc) and you should not bother but just know some C++ tricks (constexpr wherever possible and alike).

When talking about Delphi it would be an insult to most developers calling the compiler smarter when it comes to detecting patterns and optimization.

 

However - unless you write core libraries like Arnaud or me you usually don't care.

Edited by Stefan Glienke
  • Thanks 1

Share this post


Link to post
Posted (edited)
17 hours ago, Stefan Glienke said:

However - unless you write core libraries like Arnaud or me you usually don't care.

This is the main point.

Premature optimization is the root of all evil! 🙂

 

@Mahdi Safsafi  Such abstract/broad input about non Delphi compilers has nothing to do with the initial question of this thread, which is performance of string process with Delphi. It is clearly out of scope and subject.
What does a DSP compiler has in common with the Delphi compiler? Even the MAC units of DSPs have very little in common with regular CPUs.

Edited by Arnaud Bouchez

Share this post


Link to post
53 minutes ago, Arnaud Bouchez said:

Premature optimization is the root of all evil! 🙂

Let's claim that our code is the critical 3% 😉

Share this post


Link to post
3 hours ago, Arnaud Bouchez said:

Such abstract/broad input about non Delphi compilers has nothing to do with the initial question of this thread, which is performance of string process with Delphi. It is clearly out of scope and subject.

Yes you're absolutely right ... I don't really know how the post ended like this. Anyway, it was really nice to have this little discussion with you. Thanks for your time and the valuable information/material you provide.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×