Guest Posted July 17, 2020 No misunderstanding at all, we were discussing the speed of a function even just David came to help and asked if you have a bottleneck, your answer was like, i don't know yet ! i might with this function but not sure. I just laughed loud, remembered the shepherd and the wolf tale, though this is not the case, but how many here are bored looking for something exciting to do ! or just a reason to not do what must be done, valuing helping others. Share this post Link to post
David Heffernan 2345 Posted July 17, 2020 28 minutes ago, Mike Torrettinni said: No, not using any of it, yet. Planning to do some profiling and benchmarking on real data and will see if there are any bottlenecks in this area. If you don't have a performance bottleneck, make sure that you use a version of the code that is easy to read and maintain. 1 Share this post Link to post
Mike Torrettinni 198 Posted July 17, 2020 14 minutes ago, Kas Ob. said: valuing helping others. I thought my first post shows my level of Delphi expertise. If it was expected from me to implement all the suggested new functions, benchmark on real data and report if any of them are bottlenecks... in less than 24h, then I'm clearly way over my head in this topic. It started such a nice topic, practice for me, discovering new approaches, so I feel bad that I mislead you on the purpose. Perhaps admin should delete this topic so nobody else gets mislead. Share this post Link to post
Mike Torrettinni 198 Posted July 17, 2020 4 minutes ago, David Heffernan said: If you don't have a performance bottleneck, make sure that you use a version of the code that is easy to read and maintain. Yes, I agree. Or, this would be most commented function, as I would probably need to comment in details every other line 🙂 Share this post Link to post
Guest Posted July 17, 2020 18 minutes ago, Mike Torrettinni said: It started such a nice topic, practice for me, discovering new approaches, so I feel bad that I mislead you on the purpose. Perhaps admin should delete this topic so nobody else gets mislead. What are you talking about !, it is nice and still nice, don't overthink it, It was the way the question was from David and your answer, look at it this way and you might understand the funny side, David came to help why group of people discussing and what re they struggling with, where is the bottleneck, like we all did, we all came to beat it into big open gate !, and you were like : what bottleneck ?! got it now ? Anyway, things is great here and the result function is great even to be pushed into the RTL. Share this post Link to post
Mike Torrettinni 198 Posted July 17, 2020 1 minute ago, Kas Ob. said: What are you talking about !, it is nice and still nice, don't overthink it, It was the way the question was from David and your answer, look at it this way and you might understand the funny side, David came to help why group of people discussing and what re they struggling with, where is the bottleneck, like we all did, we all came to beat it into big open gate !, and you were like : what bottleneck ?! got it now ? Anyway, things is great here and the result function is great even to be pushed into the RTL. Ok, I misunderstood your reaction. All good now 🙂 Share this post Link to post
Mahdi Safsafi 225 Posted July 17, 2020 3 hours ago, Stefan Glienke said: No, because if you are doing something with whatever is at x[ i ] its faster to use a pointer that starts at @x[ 0 ] and just inc that rather than having a loop which then has 2 counter variables (they like to count down to zero while maintaining the upwards counting i). And on x86 that is a precious register that can be used for something else. Yes, you're absolutely right. Share this post Link to post
Anders Melander 1782 Posted July 17, 2020 5 hours ago, Stefan Glienke said: No, because if you are doing something with whatever is at x[ i ] its faster to use a pointer that starts at @x[ 0 ] and just inc that rather than having a loop which then has 2 counter variables (they like to count down to zero while maintaining the upwards counting i). And on x86 that is a precious register that can be used for something else. My experience is that in some cases indexed access can be much faster than incrementing a pointer, so I guess "it depends". I'm afraid I can't remember any specific cases. Share this post Link to post
Mahdi Safsafi 225 Posted July 17, 2020 1 hour ago, Anders Melander said: My experience is that in some cases indexed access can be much faster than incrementing a pointer, so I guess "it depends". I'm afraid I can't remember any specific cases. It even depends on your compiler. If your compiler is smart enough, it will generate better code sometime better than using pointer because compiler can understand better integer behavior than pointer. Using MSVC (which is not really smart enough) to compile those two function, generates EXACTLY the SAME opcodes for both ! __declspec(noinline) bool test1(const char* str) { int i = 0; while (str[i]) if ('0' == str[i++]) return true; return false; } __declspec(noinline) bool test2(const char* str) { char* p = (char*)&str[0]; while (*p) if ('0' == *p++ ) return true; return false; } Share this post Link to post
Anders Melander 1782 Posted July 17, 2020 26 minutes ago, Mahdi Safsafi said: generates EXACTLY the SAME opcodes for both So I'm guessing the asm it generates uses indexed addressing rather that indirect addressing. Right? Share this post Link to post
Mahdi Safsafi 225 Posted July 17, 2020 18 minutes ago, Anders Melander said: So I'm guessing the asm it generates uses indexed addressing rather that indirect addressing. Right? No mov cl,byte ptr [eax] At CPU level indexed addressing = indirect addressing. The only difference is that an indexed addressing can have an expression inside []. Sometime this could be just good for alignment purpose. [eax] = [eax+0x00]. But Stefan was referring to something else : register usage for each load instruction. 1 Share this post Link to post
Stefan Glienke 2002 Posted July 17, 2020 (edited) 2 hours ago, Mahdi Safsafi said: If your compiler is smart enough, ... I stopped reading after that because we are in a Delphi forum... Well - I did not - lets see - gcc and clang both generate this (so does vc++ I guess): test1(char const*): .L14: movzx edx, BYTE PTR [eax] test dl, dl je .L15 add eax, 1 cmp dl, 48 jne .L14 mov eax, 1 ret .L15: xor eax, eax ret and this is what Delphi makes out of this function: function Test(str: PAnsiChar): Boolean; var i: Integer; begin i := 0; while str[i] <> #0 do begin if '0' = str[i] then Exit(True); Inc(i); end; Result := False; end; Project25.dpr.13: i := 0; 0041BD90 33D2 xor edx,edx 0041BD92 EB09 jmp $0041bd9d Project25.dpr.16: if '0' = str[i] then Exit(True); 0041BD94 80F930 cmp cl,$30 0041BD97 7503 jnz $0041bd9c 0041BD99 B001 mov al,$01 0041BD9B C3 ret Project25.dpr.17: Inc(i); 0041BD9C 42 inc edx Project25.dpr.14: while str[i] <> #0 do 0041BD9D 0FB60C10 movzx ecx,[eax+edx] 0041BDA1 84C9 test cl,cl 0041BDA3 75EF jnz $0041bd94 Project25.dpr.19: Result := False; 0041BDA5 33C0 xor eax,eax Project25.dpr.20: end; 0041BDA7 C3 ret And the other one: function Test2(str: PAnsiChar): Boolean; var p: PAnsiChar; begin p := @str[0]; while p^ <> #0 do begin if '0' = p^ then Exit(True); Inc(p); end; Result := False; end; Project25.dpr.26: p := @str[0]; 0041BD90 EB09 jmp $0041bd9b Project25.dpr.29: if '0' = p^ then Exit(True); 0041BD92 80FA30 cmp dl,$30 0041BD95 7503 jnz $0041bd9a 0041BD97 B001 mov al,$01 0041BD99 C3 ret Project25.dpr.30: Inc(p); 0041BD9A 40 inc eax Project25.dpr.27: while p^ <> #0 do 0041BD9B 0FB610 movzx edx,[eax] 0041BD9E 84D2 test dl,dl 0041BDA0 75F0 jnz $0041bd92 Project25.dpr.32: Result := False; 0041BDA2 33C0 xor eax,eax Project25.dpr.33: end; 0041BDA4 C3 ret What we can see is that the Delphi compiler is clearly not clever but rather sticks to some hardcoded behavior such as that it is addicted to put loop conditions after the body (which usually is a good pattern) even if it makes little sense to do so. Another thing has to do with the post increment operator that we don't have in Delphi which causes the instruction after the check inside the loop which causes the additional jump because it cannot just restart the loop after it did not find the '0'. Things like that require writing loops in a very ass backwards way in Delphi to get the same code generated. In order to get the same code that the C++ compilers emit you have to write unreadable shit like this: function Test3(str: PAnsiChar): Boolean; var c: AnsiChar; begin repeat c := str^; if c = #0 then Break; Inc(str); if c = '0' then Exit(True); until False; Result := False; end; When I used the local variable p here that I assigned str to it did not even keep using eax but insisted in putting a useless mov into a new register. Edited July 17, 2020 by Stefan Glienke Share this post Link to post
Mike Torrettinni 198 Posted July 17, 2020 I did some testing on largest real data set: strings: 1,203,265 avg length: 25 standard deviation: 41 longest string: 2,243 All functions take between 100ms-171ms, so all are very fast. Most common real data set is 10%-25% of this largest data, so these functions will perform under 50ms in those cases. My current function that I use does the same things with: if Copy(s, 1,1) = '{'... and result := result + Copy(s, 1,1), so very very inefficient, but it takes about 1,8s on same data. So this is not a bottleneck in my project, but still almost 20x faster is definitely something i will use - I just have to decide which one. Share this post Link to post
Mahdi Safsafi 225 Posted July 17, 2020 24 minutes ago, Stefan Glienke said: I stopped reading after that because we are in a Delphi forum... Keep some faith Stefan ... remember a tortoise wins a hare once. Share this post Link to post
Stefan Glienke 2002 Posted July 17, 2020 5 minutes ago, Mahdi Safsafi said: Keep some faith Stefan ... remember a tortoise wins a hare once. At least regarding life expectancy 😉 1 Share this post Link to post
Mahdi Safsafi 225 Posted July 18, 2020 (edited) @Stefan Glienke Quote Well - I did not - lets see - gcc and clang both generate this (so does vc++ I guess): As I said before, MSVC is somehow far behind gcc and clang. In fact, it used 8-bit registers heavily which are known to be slow as they require internal manipulation by the CPU: 00241000 8A 0D 10 21 24 00 mov cl,byte ptr [string "test" (0242110h)] 00241006 B8 10 21 24 00 mov eax,offset string "test" (0242110h) 0024100B 0F 1F 44 00 00 nop dword ptr [eax+eax] 00241010 40 inc eax 00241011 80 F9 30 cmp cl,30h 00241014 74 09 je test1+1Fh (024101Fh) 00241016 8A 08 mov cl,byte ptr [eax] 00241018 84 C9 test cl,cl 0024101A 75 F4 jne test1+10h (0241010h) 0024101C 32 C0 xor al,al 0024101E C3 ret 0024101F B0 01 mov al,1 00241021 C3 ret For Delphi, output of Test1 and Test2 is very close. In fact if you change to PChar, you may find that unlike Test1, Test2 is using add reg, imm instruction. Quote What we can see is that the Delphi compiler is clearly not clever but rather sticks to some hardcoded behavior such as that it is addicted to put loop conditions after the body (which usually is a good pattern) No its not a good pattern : 0041BD92 EB09 jmp $0041bd9d // unecessary relative branch. 0041BD94 80F930 cmp cl,$30 // suppose this was a load op => bad for OoOE. 0041BD97 7503 jnz $0041bd9c 0041BD99 B001 mov al,$01 0041BD9B C3 ret Quote At least regarding life expectancy 😉 Just hope for better. Edited July 18, 2020 by Mahdi Safsafi Share this post Link to post
Stefan Glienke 2002 Posted July 18, 2020 (edited) Here it is clearly not - but I was referring to the fact that mostly it is the way to do - see this exhaustive answer. But what you typically don't do is to statically jump over the loop body to the condition at the bottom to then conditionally jump back up because the loop very likely runs at least once. Either the compiler restructures the loop or simply puts the loop condition as if ontop of the loop as well (sometimes dcc even does it) Edited July 18, 2020 by Stefan Glienke Share this post Link to post
Mahdi Safsafi 225 Posted July 18, 2020 7 minutes ago, Stefan Glienke said: Here it is clearly not - but I was referring to the fact that mostly it is the way to do - see this exhaustive answer. But what you typically don't do is to statically jump over the loop body to the condition at the bottom to then conditionally jump back up because the loop runs at least once. Either the compiler restructures the loop or simply puts the loop condition as if ontop of the loop as well (sometimes dcc even does it) Today no high end compiler is doing that because it affects an important optimization part OoOE. Delphi uses a very old code generator that is not up to date. Share this post Link to post
Mahdi Safsafi 225 Posted July 18, 2020 (edited) Moreover saying that "Fewer instructions / uops inside the loop = better" is somehow wrong. I can certainly prove that its wrong. using SIMD instructions where using n instruction to reduce loop count is better than having a single/few instruction inside that loop. Edited July 18, 2020 by Mahdi Safsafi Share this post Link to post
Stefan Glienke 2002 Posted July 18, 2020 (edited) 26 minutes ago, Mahdi Safsafi said: Today no high end compiler is doing that because it affects an important optimization part OoOE. Delphi uses a very old code generator that is not up to date. Just to make sure you don't misunderstand: The common way of compilers (check it for yourself using compiler explorer) to write while x do is to turn it into if x then repeat until not x (simply said). The code you posted could get rid of the check at the bottom because there was nothing else to do after the check for '0' which results in a result true so that basically serves as this loop breaking check. The first check makes sure the loop is even entered at all, otherwise it jumps over it (0 iterations) - the "real" loop condition is at the bottom of the loop body. What is usually not good is to only put the loop condition at the bottom and statically jump over the loop body (because the duplicated condition at the top is missing). Thats two unnecessary jumps for the common case of "at least 1 iteration". Edited July 18, 2020 by Stefan Glienke Share this post Link to post
Mahdi Safsafi 225 Posted July 18, 2020 10 minutes ago, Stefan Glienke said: Just to make sure you don't misunderstand: The common way of compilers (check it for yourself using compiler explorer) to write while x do is to turn it into if x then repeat until not x (simply said). The code you posted could get rid of the check at the bottom because there was nothing else to do after the check for '0' which results in a result true so that basically serves as this loop breaking check. The first check makes sure the loop is even entered at all, otherwise it jumps over it (0 iterations) - the "real" loop condition is at the bottom of the loop body. What is usually not good is to only put the loop condition at the bottom and statically jump over the loop body (because the duplicated condition at the top is missing). Thats two unnecessary jumps for the common case of "at least 1 iteration". Yeah Stefan, I clearly understand you. But the same thing you described can be achieved by branch prediction. So compiler should not generate such code unless you are targeting an old cpu. Share this post Link to post
Arnaud Bouchez 407 Posted July 23, 2020 On 7/18/2020 at 2:48 AM, Stefan Glienke said: The common way of compilers (check it for yourself using compiler explorer) to write while x do is to turn it into if x then repeat until not x (simply said). Indeed. This is why I use the "if x then repeat until not x" explicit pattern in time-critical loops of my framework - e.g. when processing JSON. As a nice side effect, the variables uses in "x" are more likely to be assigned to registers, since they will be used more often. Sometimes an explicit temporary local variable is needed. Or use of the 'result' variable if it is a PChar. Both Delphi and FPC require this, unless "x" is a single simple test, where I have seen FPC able to optimize it IIRC. 1 Share this post Link to post
Arnaud Bouchez 407 Posted July 23, 2020 (edited) On 7/18/2020 at 3:18 AM, Mahdi Safsafi said: Yeah Stefan, I clearly understand you. But the same thing you described can be achieved by branch prediction. So compiler should not generate such code unless you are targeting an old cpu. Branch prediction can do wonders in some micro-benchmarks, but they always have a cost, until the CPU logic actually "warmed up". No branch is always better, since the real problem is branch misprediction. Using the asm generated by a recent GCC with tuned optimization flags as a reference is a good hint of what is likely to be more efficient. Intel and AMD are major GCC contributors. And a wall clock of a realistic process (not micro benchmark) is the ultimate reference. We use some part of our automated tests as performance benchmark, running some close-to-the reality scenarios. Edited July 23, 2020 by Arnaud Bouchez 1 Share this post Link to post
Mahdi Safsafi 225 Posted July 23, 2020 @Stefan Glienke @Arnaud Bouchez Ok, I did some research on that and found that I was wrong. In fact there is more benefits about using inversion loop(Sadly it does not apply to Delphi): var i: Integer; begin // legacy loop: for i := 0 to 10 - 1 do begin dosomething(i); end; // loop inversion: i := 0; if (i < 10) then // this branch can be emited if i holds at compile time. begin repeat dosomething(i); inc(i); until (i = 10); end; end; There is a good paper : Generalizing loop-invariant code motion in a real-world compiler by Paul Colea that describes many loop optimization. @Kas Ob you may be interested on reading this paper. Share this post Link to post
Stefan Glienke 2002 Posted July 23, 2020 (edited) @Mahdi Safsafi I would not use that code as evidence because there actually the for loop produces better code than the other one. I was talking about while x do loop vs if x then repeat until not x conversion. Also it makes a difference if you have a static count or not - the benefit of the for to loop is that the compiler does no additional cmp after it decrements the internal "counting down to zero" variable but does the conditional jmp depending on the outcome of the dec. An until loop produces an additional cmp instruction (reported as https://quality.embarcadero.com/browse/RSP-22124) As always: it depends - while I certainly would not use a for-to loop when dealing with some string parsing but rather prever PChar I just found recently that using a for-to loop with indexing into a pointer was faster than inc(thatpointer). Edited July 23, 2020 by Stefan Glienke Share this post Link to post