Mahdi Safsafi 225 Posted July 23, 2020 1 minute ago, Stefan Glienke said: @Mahdi Safsafi I would not use that code as evidence because there actually the for loop produces better code than the other one. I was talking about while x do loop vs if x then repeat until not x conversion. Yes I know that. The code was just a snippet from the paper I read to point out about emitting if branch. Quote Also it makes a difference if you have a static count or not Event if the variable is not static ... Some language takes contract as seriously : assert(i > 0); 😉 Share this post Link to post
Stefan Glienke 2002 Posted July 23, 2020 By the way: LoopStuff.dpr.26: while i < count do 0041BDB0 3BFB cmp edi,ebx 0041BDB2 7E0C jle $0041bdc0 LoopStuff.dpr.28: dosomething(i); 0041BDB4 8BC3 mov eax,ebx 0041BDB6 E8D5FFFFFF call dosomething LoopStuff.dpr.29: inc(i); 0041BDBB 43 inc ebx LoopStuff.dpr.26: while i < count do 0041BDBC 3BFB cmp edi,ebx 0041BDBE 7FF4 jnle $0041bdb4 LoopStuff.dpr.26: if (i < count) then // this branch can be emited if i holds at compile time. 0041BDB0 3BFB cmp edi,ebx 0041BDB2 7E0C jle $0041bdc0 LoopStuff.dpr.29: dosomething(i); 0041BDB4 8BC3 mov eax,ebx 0041BDB6 E8D5FFFFFF call dosomething LoopStuff.dpr.30: inc(i); 0041BDBB 43 inc ebx LoopStuff.dpr.31: until (i = count); 0041BDBC 3BFB cmp edi,ebx 0041BDBE 75F4 jnz $0041bdb4 As I said: sometimes the compiler does the check at the top for the while loop. Share this post Link to post
Mahdi Safsafi 225 Posted July 23, 2020 10 minutes ago, Stefan Glienke said: As always: it depends - while I certainly would not use a for-to loop when dealing with some string parsing but rather prever PChar I just found recently that using a for-to loop with indexing into a pointer was faster than inc(thatpointer). As I mentioned before, Compiler is much able to understand integer behavior over pointer behavior. CPU can also be smart to prefetch next data for you on the background. Share this post Link to post
Mahdi Safsafi 225 Posted July 23, 2020 10 minutes ago, Stefan Glienke said: As always: it depends - while I certainly would not use a for-to loop when dealing with some string parsing but rather prever PChar I just found recently that using a for-to loop with indexing into a pointer was faster than inc(thatpointer). As I mentioned before, Compiler is much able to understand integer behavior over pointer behavior. CPU can also be smart to prefetch next data for you on the background. Share this post Link to post
Arnaud Bouchez 407 Posted July 24, 2020 (edited) 20 hours ago, Stefan Glienke said: @Mahdi Safsafi I just found recently that using a for-to loop with indexing into a pointer was faster than inc(thatpointer). It depends on the complexity of the loop. For instance, about how variables are assigned to local stack variables or registers. If the use of the for-to index is not mapped into a register, then thatpointer[ i ] will be slower... Sometimes, just using a small local functions help the Delphi compiler make better register allocation, and may end up be faster... Remember 🙂 Edited July 24, 2020 by Arnaud Bouchez 1 Share this post Link to post
Arnaud Bouchez 407 Posted July 24, 2020 (edited) 20 hours ago, Mahdi Safsafi said: As I mentioned before, Compiler is much able to understand integer behavior over pointer behavior. CPU can also be smart to prefetch next data for you on the background. This is not as simple as this. IIRC prefecthing has no difference between indexed (mov ecx, byte ptr [eax+edx]) or direct access (mov ecx, byte ptr[eax]). The reference material about instruction latency and execution pipelinging is https://www.agner.org/optimize/instruction_tables.pdf and it is highly depending on the CPU. https://www.agner.org/optimize is worth reading and understanding. https://lemire.me/blog/ is another very interesting blog about performance of text processing, from a lot of experience and actual knowledge. Branchless SSE code will be the faster in all circumstances... Then use a wall clock measure of full realistic process - not just measuring a loop. Edited July 24, 2020 by Arnaud Bouchez 1 Share this post Link to post
Mahdi Safsafi 225 Posted July 24, 2020 53 minutes ago, Arnaud Bouchez said: This is not as simple as this. IIRC prefecthing has no difference between indexed (mov ecx, byte ptr [eax+edx]) or direct access (mov ecx, byte ptr[eax]). I'm sure you miss understood my comment ! I quote from my comment where I answered Anders On 7/17/2020 at 11:13 PM, Mahdi Safsafi said: No mov cl,byte ptr [eax] At CPU level indexed addressing = indirect addressing. The only difference is that an indexed addressing can have an expression inside []. Sometime this could be just good for alignment purpose. [eax] = [eax+0x00]. But Stefan was referring to something else : register usage for each load instruction. What I was referring to is a cause consequence: Compiler understood better code that does not use pointer ... its able to detects some patterns and optimize them like what gcc does. Pointers on the other side are hard to understand(volatility, nullability, mutability, ...) ... yes could be optimized but the chance of optimizing a code that does not use pointer is much higher. See Gather-scatter (vector addressing) to understand clearly what are you missing when using pointer instead of integer access. CPU also detects patterns such memory access patterns and optimize them. Another important thing : yes there is no difference for mov ecx, [eax+edx]/mov ecx, [eax]. But keep in mind that this is not applicable for all instructions ! An example : MPX extension. Just to make thing clear, I'm not talking about optimization on Delphi. Rather ... a general concept for modern compiler. Share this post Link to post
Arnaud Bouchez 407 Posted July 27, 2020 @Mahdi Safsafi My experiment with the Delphi compiler and pointers is not the same as yours. In mORMot, I ended up with using pointers (PUTF8Char=^AnsiChar) for the raw UTF-8 processing, which generated faster code. Of course, pointers are difficult to deal with - you need to know what you are doing. I always consider pointers by looking at the generated ASM. If the code is better and runs faster with pointer, then I use them. Otherwise I don't. I am biased for sure: I like pascal because it can be both high-level and modern (like Java/C# - with high-level structures like classes, strings or dynamic arrays), and also system-level (like C). Pointers are clearly for the 2nd level. Anyone don't like pointers? Don't use them. But they are pretty convenient and efficient for low-level processing, if the goal is performance. I like very much when you paternize me with the Gather-scatter pattern. You can perfectly do Gather-scatter with two pointers. This is called pointer arithmetic - which I use extensively in mORMot. So I don't understand what you are talking about. 1 Share this post Link to post
Edwin Yip 154 Posted July 27, 2020 22 minutes ago, Arnaud Bouchez said: @Mahdi Safsafi I am biased for sure: I like pascal because it can be both high-level and modern (like Java/C# - with high-level structures like classes, strings or dynamic arrays), and also system-level (like C). Pointers are clearly for the 2nd level. Same here and that's a very logical rationale! Similarly, I like the fact that it supports both strong typed data such as String/Integer and dynamically typed data type such as TValue/Variant - Share this post Link to post
Attila Kovacs 629 Posted July 27, 2020 by the way, if you look at the last code pasted by me, if you remove if len = 0 then exit(S); at the beginning, it's the same codegen and will be ~4% slower. Of course testing with non-empty input. Early triggering of OoOE? Share this post Link to post
Stefan Glienke 2002 Posted July 27, 2020 (edited) No, it's not the same codegen, its a test and jnz less. You don't want forward jumps taken in the common case. Also a very important fact (I think it was mentioned before) - depending on where the compiler puts the code the hot loop code might cross cache lines - that slows things down significantly and can happen when you add or change unrelated code in other routines or in this routine - such as a check for Length or similar. When I let the routine start at some 64byte aligned address the loop body will cross cache lines and be slower. To my knowledge there is no way to control the alignment of generated executable code so that you can force hot code into a single or as few cache lines as possible. Edited July 27, 2020 by Stefan Glienke 1 Share this post Link to post
Attila Kovacs 629 Posted July 27, 2020 (edited) yes, less, and slower, the rest is the same codegen -> less asm, slower exec, how's that? OoOE? Aligning? Edited July 27, 2020 by Attila Kovacs Share this post Link to post
Mahdi Safsafi 225 Posted July 27, 2020 2 hours ago, Arnaud Bouchez said: @Mahdi Safsafi My experiment with the Delphi compiler and pointers is not the same as yours. In mORMot, I ended up with using pointers (PUTF8Char=^AnsiChar) for the raw UTF-8 processing, which generated faster code. Yes, I use several compilers (not limited to Delphi) and I can sadly tell you that Pascal compilers didn't evolved in a good way like other compilers. Quote Anyone don't like pointers? Don't use them. But they are pretty convenient and efficient for low-level processing, if the goal is performance. Its not about whether you like them or not. In fact for me, I don't find any issue using low level stuff pointer or assembly. But today, I try to be much wise and choose the best friendly way for the compiler. What are you missing is that today sophisticated compiler in many case can do better job than you can do by using low level stuff like assembly or pointer. gcc for example is able to recognize a pattern (while do) that calculates bits count and replace this loop by a single instruction. It can reorder things for you when it detects that you're misusing some functionality: see this SO. A compiler could generate much better code only when it can understood your code. If you manage to write a hard code that your compiler can't understand it wan't be optimized correctly. Here is a simple example to show what would happen when the compiler does not understand your code: both swap/swap2 function do the same job but one is decrementing i and the other one is incrementing i. and both log the result using a for loop. The interesting part is that unlike swap, the two loop of swap2 were merged together __declspec(noinline) void swap(short int* src, int* dst, int count) { short int* p = src; int* r = dst; int i = count; // loop 1 while (i--) { *r++ = swapint(*p++); } // loop 2 for (int i = 0; i < count; i++) printf("%d = %d\n", i, dst[i]); return ; } __declspec(noinline) void swap2(short int* src, int* dst, int count) { short int* p = src; int* r = dst; int i = count; // compiler merged/combined the two loop ! while (i++ != count) { *r++ = swapint(*p++); } for (int i = 0; i < count; i++) printf("%d = %d\n", i, dst[i]); return; } int main() { short int src[10] = { 1,2,3,4,5 }; int dst[10] = { 0 }; swap(&src[0], &dst[0], 10); swap2(&src[0], &dst[0], 10); return 0; } Share this post Link to post
Mike Torrettinni 198 Posted July 27, 2020 3 hours ago, Mahdi Safsafi said: Yes, I use several compilers (not limited to Delphi) and I can sadly tell you that Pascal compilers didn't evolved in a good way like other compilers. 3 hours ago, Mahdi Safsafi said: If you manage to write a hard code that your compiler can't understand it wan't be optimized correctly. Does that mean you need a trial and error to get to know all these little 'secret' tips how to write best possible optimized code, for each compiler? Share this post Link to post
Mahdi Safsafi 225 Posted July 27, 2020 1 hour ago, Mike Torrettinni said: Does that mean you need a trial and error to get to know all these little 'secret' tips how to write best possible optimized code, for each compiler? Secret ? no, they're publicly published (agner, intel/amd doc, optimization guide, paper, LLVM, ...). I read them and experiment things my self and when I have no things to do I compare results on different compilers. Sometime, when I'm lucky, I find an already benchmark that lists the results. Error happens a lot specially when porting code from/to another platform ... but I deeply inspects the error and try to learn from it : why it happened ? a workaround ? how to avoid it?. 1 Share this post Link to post
Stefan Glienke 2002 Posted July 27, 2020 (edited) In the C++ community people usually tell you the compilers are incredibly smart (mostly referring to gcc, clang, msvc or icc) and you should not bother but just know some C++ tricks (constexpr wherever possible and alike). When talking about Delphi it would be an insult to most developers calling the compiler smarter when it comes to detecting patterns and optimization. However - unless you write core libraries like Arnaud or me you usually don't care. Edited July 27, 2020 by Stefan Glienke 1 Share this post Link to post
Mike Torrettinni 198 Posted July 27, 2020 Pretty impressive level of knowledge! Share this post Link to post
Arnaud Bouchez 407 Posted July 28, 2020 (edited) 17 hours ago, Stefan Glienke said: However - unless you write core libraries like Arnaud or me you usually don't care. This is the main point. Premature optimization is the root of all evil! 🙂 @Mahdi Safsafi Such abstract/broad input about non Delphi compilers has nothing to do with the initial question of this thread, which is performance of string process with Delphi. It is clearly out of scope and subject. What does a DSP compiler has in common with the Delphi compiler? Even the MAC units of DSPs have very little in common with regular CPUs. Edited July 28, 2020 by Arnaud Bouchez Share this post Link to post
Stefan Glienke 2002 Posted July 28, 2020 53 minutes ago, Arnaud Bouchez said: Premature optimization is the root of all evil! 🙂 Let's claim that our code is the critical 3% 😉 Share this post Link to post
Mahdi Safsafi 225 Posted July 28, 2020 3 hours ago, Arnaud Bouchez said: Such abstract/broad input about non Delphi compilers has nothing to do with the initial question of this thread, which is performance of string process with Delphi. It is clearly out of scope and subject. Yes you're absolutely right ... I don't really know how the post ended like this. Anyway, it was really nice to have this little discussion with you. Thanks for your time and the valuable information/material you provide. Share this post Link to post