Mike Torrettinni 198 Posted October 11, 2020 (edited) Using Delphi 10.2.3 version: I have a very simple example of inlined function, that is faster compared to non-inlined function, but not as fast as no function call is executed: Goal with inlined function: the loop condition is used in numerous places, so I want to shorten the code and have 1 call to inlined function that replaces all these (same) conditions. I was hoping I can shorten the code to use function, and was assuming than inline will just 'replace' call to function with exact same condition as in function. So, in theory the execution should be the same. But loop without inlined function is 50% faster than with inlined function, approx: 1570ms vs 2175ms. uses System.Diagnostics; function IsSearchByValueFound_Inlined(const aSearchValue, aItemValue: string): boolean; inline; begin Result := (aSearchValue = '') and (aItemValue <> '') or (aItemValue = aSearchValue ); end; procedure TForm2.FormCreate(Sender: TObject); const loop: integer = 1000000000; var vSearchValue, vItemValue: string; i: integer; sw: TStopWatch; begin memo1.Clear; sw := TStopWatch.StartNew; for i := 1 to loop do if (vSearchValue = '') and (vItemValue <> '') or (vItemValue = vSearchValue ) then ; sw.Stop; memo1.Lines.Add(sw.ElapsedMilliseconds.ToString); sw := TStopWatch.StartNew; for i := 1 to loop do if IsSearchByValueFound_Inlined(vSearchValue, vItemValue) then ; sw.Stop; memo1.Lines.Add(sw.ElapsedMilliseconds.ToString); end; Is that just how it is, or am I missing something and loop using inlined function should run exactly the same time as loop without function call? Edited October 12, 2020 by Mike Torrettinni Goal details + Delphi version Share this post Link to post
Attila Kovacs 629 Posted October 11, 2020 change the order of the 2 loops and give us the new results Share this post Link to post
Mike Torrettinni 198 Posted October 11, 2020 4 minutes ago, Attila Kovacs said: change the order of the 2 loops and give us the new results Same, loop without call to inlined function is 50% faster. Did you expect different results depending on the order of loops? Share this post Link to post
Attila Kovacs 629 Posted October 12, 2020 Register optimization, I never put more than one test into one procedure, but the problem is the inlining introduce the extra boolean evaluation instead of just the 3 in the "if". 🤷♂️ 1 Share this post Link to post
Mike Torrettinni 198 Posted October 12, 2020 1 minute ago, Attila Kovacs said: the problem is the inlining introduce the extra boolean evaluation instead of just the 3 in the "if". I see. Do you think there is a way to avoid this extra boolean evaluation, perhaps restructuring function? I'm not familiar with asm, but I see a difference in generated asm code between the loops. I have no idea how to interpret the difference, now I assume it could be the extra boolean evaluation you are referring to. Share this post Link to post
Attila Kovacs 629 Posted October 12, 2020 move the loop into the inlinded proc Share this post Link to post
Mike Torrettinni 198 Posted October 12, 2020 5 minutes ago, Attila Kovacs said: move the loop into the inlinded proc Oh, sorry I guess I wasn't clear enough of my goal: the loop condition is used in numerous places, so I want to shorten the code and have 1 call to inlined function that replaces all these (same) conditions. I updated the first post. Share this post Link to post
Attila Kovacs 629 Posted October 12, 2020 Well, maybe you could write a generic search class having the whole code only once in your app, but hard to give any advice without knowing the details. Just leave it as it is, it always works. Share this post Link to post
Mike Torrettinni 198 Posted October 12, 2020 1 minute ago, Attila Kovacs said: Well, maybe you could write a generic search class having the whole code only once in your app, but hard to give any advice without knowing the details. Just leave it as it is, it always works. Thanks, but in this case I was just looking for some understanding why such difference. This is generic condition used to search the data, repeated across multiple record types. For example searching Project names, Item names, Group names... Share this post Link to post
Guest Posted October 12, 2020 I am not saying to go and do the following everywhere, it is somehow ugly ! Only when the performance is the target and you have already did the measuring and pin-pointed the culprit, then you can do some trick to strip the compiler from its stupidity and inefficiency, there is many tricks to do so and here one of them. Replace that inlined function with this to recover the wasted time ( or some of it, based on your CPU and build bitness) function IsSearchByValueFound_Inlined(const aSearchValue, aItemValue: string): boolean; inline; begin //Result := ((aSearchValue = '') and (aItemValue <> '')) or (aItemValue = aSearchValue); Result := Boolean(Ord(aSearchValue = '') and Ord(aItemValue <> '') or Ord(aItemValue = aSearchValue)); end; On my CPU the time is identical for 32bit, on 64bit the inlined still slower by only 7% Share this post Link to post
David Heffernan 2345 Posted October 12, 2020 This test seems pointless because the two strings are always empty. You never read the strings from a collection. You never compare two strings. When you put this code into a realistic setting you'll likely find that it makes no difference to performance which versions you use. The smells of premature optimisation. And that results in hard to maintain code. 1 Share this post Link to post
Mike Torrettinni 198 Posted October 12, 2020 1 hour ago, David Heffernan said: This test seems pointless because the two strings are always empty. You never read the strings from a collection. You never compare two strings. When you put this code into a realistic setting you'll likely find that it makes no difference to performance which versions you use. The smells of premature optimisation. And that results in hard to maintain code. Yes, I admit I have a tendency to try optimize sometimes without measuring first. 2 hours ago, Kas Ob. said: the compiler from its stupidity and inefficiency I assume this is what is going on here and of course also my expectation how compiler should work. If I look at simpler example, I can see the compiler generates the same code with or without function call, as expected,: In this simple example of Inc(), the asm code generated is the same: procedure Inc_Inlined(var aValue: integer); inline; begin Inc(aValue); end; procedure TForm2.FormCreate(Sender: TObject); var a: integer; begin a := 0; Inc(a); Inc_Inlined(a); end; And now also with the documentation on inline directive: " The inline directive is a suggestion to the compiler. There is no guarantee the compiler will inline a particular routine, as there are a number of circumstances where inlining cannot be done. The following list shows the conditions under which inlining does or does not occur: Perhaps the documentation could mention that it is also not guaranteed that generated code will be as efficient as if the code was entered instead of function call. Sometimes it will be the same code, sometimes not. Share this post Link to post
David Heffernan 2345 Posted October 12, 2020 13 minutes ago, Mike Torrettinni said: Yes, I admit I have a tendency to try optimize sometimes without measuring first. Why are you doing this? Have you timed your actual program yet? Do the two options perform measurably differently? Is that code even a bottleneck? Share this post Link to post
Kryvich 165 Posted October 12, 2020 @Mike Torrettinni What version of Delphi are you using and what compiler directives are set? I ran your test in Delphi 10.3 CE and I got the same assembler code for both cases. TestStringHelper.dpr.23: if (vSearchValue = '') and (vItemValue <> '') or (vItemValue = vSearchValue ) 004E9AA1 837DFC00 cmp dword ptr [ebp-$04],$00 004E9AA5 7506 jnz $004e9aad 004E9AA7 837DF800 cmp dword ptr [ebp-$08],$00 004E9AAB 750B jnz $004e9ab8 004E9AAD 8B45F8 mov eax,[ebp-$08] 004E9AB0 8B55FC mov edx,[ebp-$04] 004E9AB3 E8301CF2FF call @UStrEqual ... TestStringHelper.dpr.29: if IsSearchByValueFound_Inlined(vSearchValue, vItemValue) 004E9B49 837DFC00 cmp dword ptr [ebp-$04],$00 004E9B4D 7506 jnz $004e9b55 004E9B4F 837DF800 cmp dword ptr [ebp-$08],$00 004E9B53 750B jnz $004e9b60 004E9B55 8B45F8 mov eax,[ebp-$08] 004E9B58 8B55FC mov edx,[ebp-$04] 004E9B5B E8881BF2FF call @UStrEqual 1 Share this post Link to post
Stefan Glienke 2002 Posted October 12, 2020 11 minutes ago, Mike Torrettinni said: Perhaps the documentation could mention that it is also not guaranteed that generated code will be as efficient as if the code was entered instead of function call. Sometimes it will be the same code, sometimes not. Unfortunately that is true for almost all code in Delphi - and the reason why there are so many "do I better write the code like this or that" discussions - because we constantly have to help the compiler writing code in certain ways when we want to get the optimum. The inliner is not effective as it could be - I would guess the reason being the Delphi compiler is mostly a single pass compiler - so it does not run another optimization step after the inlined code. That means that often there is register or stack juggling happening after the inlining took place that would not have been there if the code would have written there directly. But again: measure and evaluate if it matters. And be careful when measuring it because simply taking both different codes and timing it won't be enough. 1 Share this post Link to post
Mike Torrettinni 198 Posted October 12, 2020 (edited) 5 minutes ago, David Heffernan said: Why are you doing this? Have you timed your actual program yet? Do the two options perform measurably differently? Is that code even a bottleneck? I'm not sure what more I can write that I didn't put into first post. 3 minutes ago, Kryvich said: What version of Delphi are you using and what compiler directives are set? I ran your test in Delphi 10.3 CE and I got the same assembler code for both cases. Aha, I didn't think about the version differences. I use Delphi 10.2.3. Not sure which compiler directives are you referring to, I don't set anything specific, whatever is default in debug, as this is tested in new empty project. Edited October 12, 2020 by Mike Torrettinni Share this post Link to post
Stefan Glienke 2002 Posted October 12, 2020 (edited) Classic measuring issues. I ran the same code in 10.4.1 and while it produces the same asm code the first loop ran slower for me (461 vs 232). I have seen this before and I guess its because the TStopwatch code is not yet in the cache for the first run - same is true for code to be measured. That is why for running good benchmarks you either run both in their own binary and not back to back in the same one - in order for them to be both affected by being a cold run or you simply run the benchmark once to warm up and then start measuring. There are more things to consider though but I won't go into detail here. Edit: compiled in 10.1 the second loop indeed runs slower for me as well. Unit1.pas.46: for i := 1 to loop do 005CE6D2 8B1D54C95D00 mov ebx,[$005dc954] 005CE6D8 85DB test ebx,ebx 005CE6DA 7E1A jle $005ce6f6 Unit1.pas.47: if (vSearchValue = '') and (vItemValue <> '') or (vItemValue = vSearchValue ) 005CE6DC 837DFC00 cmp dword ptr [ebp-$04],$00 005CE6E0 7506 jnz $005ce6e8 005CE6E2 837DF800 cmp dword ptr [ebp-$08],$00 005CE6E6 750B jnz $005ce6f3 005CE6E8 8B45F8 mov eax,[ebp-$08] 005CE6EB 8B55FC mov edx,[ebp-$04] 005CE6EE E8DDC2E3FF call @UStrEqual Unit1.pas.46: for i := 1 to loop do 005CE6F3 4B dec ebx 005CE6F4 75E6 jnz $005ce6dc vs Unit1.pas.53: for i := 1 to loop do 005CE74B 8B1D54C95D00 mov ebx,[$005dc954] 005CE751 85DB test ebx,ebx 005CE753 7E24 jle $005ce779 Unit1.pas.54: if IsSearchByValueFound_Inlined(vSearchValue, vItemValue) 005CE755 837DFC00 cmp dword ptr [ebp-$04],$00 005CE759 7506 jnz $005ce761 005CE75B 837DF800 cmp dword ptr [ebp-$08],$00 005CE75F 7511 jnz $005ce772 005CE761 8B45F8 mov eax,[ebp-$08] 005CE764 8B55FC mov edx,[ebp-$04] 005CE767 E864C2E3FF call @UStrEqual 005CE76C 7404 jz $005ce772 005CE76E 33C0 xor eax,eax 005CE770 EB02 jmp $005ce774 005CE772 B001 mov al,$01 005CE774 84C0 test al,al Unit1.pas.53: for i := 1 to loop do 005CE776 4B dec ebx 005CE777 75DC jnz $005ce755 So there is indeed a difference in the code which affects the performance which goes back to what I said before - the inliner not doing its best job - what you see here is that the compiler still generates that result variable and either sets it to true or to false and then checks that one. But the first loop is only faster because you don't do anything after the check. Edit: One more thing that is important when measuring stuff like this in comparison directly - cache lines. In my case the second loop is always faster in 10.4 even though both of them have the same code generated - and that is simply because the first loop spans two cachelines and the second one in only one - that is something you cannot influence easily and should not bother with but need to be kept in mind when doing measuring code like this. Edited October 12, 2020 by Stefan Glienke 1 Share this post Link to post
Mike Torrettinni 198 Posted October 12, 2020 6 minutes ago, Stefan Glienke said: Unfortunately that is true for almost all code in Delphi - and the reason why there are so many "do I better write the code like this or that" discussions - because we constantly have to help the compiler writing code in certain ways when we want to get the optimum. The inliner is not effective as it could be - I would guess the reason being the Delphi compiler is mostly a single pass compiler - so it does not run another optimization step after the inlined code. That means that often there is register or stack juggling happening after the inlining took place that would not have been there if the code would have written there directly. But again: measure and evaluate if it matters. And be careful when measuring it because simply taking both different codes and timing it won't be enough. OK, makes sense. I would not expect some complex code to be inlined as if it's outside function, but just simple string comparisons I assume it could be working as expected. 1 minute ago, Stefan Glienke said: Classic measuring issues. I ran the same code in 10.4.1 and while it produces the same asm code the first loop ran slower for me (461 vs 232). I have seen this before and I guess its because the TStopwatch code is not yet in the cache for the first run - same is true for code to be measured. That is why for running good benchmarks you either run both in their own binary and not back to back in the same one - in order for them to be both affected by being a cold run or you simply run the benchmark once to warm up and then start measuring. There are more things to consider though but I won't go into detail here. OK, thanks for this. I had no idea of such details. I believe now we confirmed that my version Delphi 10.2.3 is less efficient compared to 10.3 and up, regarding inline directive, so this topic is irrelevant when I move to newer version. Share this post Link to post
David Heffernan 2345 Posted October 12, 2020 38 minutes ago, Mike Torrettinni said: I believe now we confirmed that my version Delphi 10.2.3 is less efficient compared to 10.3 and up, regarding inline directive, so this topic is irrelevant when I move to newer version. That's not the conclusion of this topic. Share this post Link to post
Stefan Glienke 2002 Posted October 12, 2020 (edited) Without going into more detailed analysis for now (maybe that's a good topic for a future blog post) I would say even though the older versions produced that extra temp variable and checking against that with the inlined code the main reason why both loops take different durations is due to being in one or two cachelines. We had the same situation some while ago in another thread when we measured different string handling routines. Some performed better or worse and then suddenly a small change changed the result significantly simply because the instructions emitted for the loop were located differently. Some stuff to read about Microbenchmarks: https://engineering.appfolio.com/appfolio-engineering/2019/1/7/microbenchmarks-vs-macrobenchmarks-ie-whats-a-microbenchmark Edited October 12, 2020 by Stefan Glienke Share this post Link to post
Mike Torrettinni 198 Posted October 12, 2020 11 minutes ago, David Heffernan said: That's not the conclusion of this topic. I guess you are right, since I can't move to newer version, yet. Perhaps before or about the time of 10.5 release. Until then, I will have to accept that this example of inline directive in my code is not as efficient as will be in the future 🙂 1 minute ago, Stefan Glienke said: Without going into more detailed analysis for now (maybe that's a good topic for a future blog post) I would say even though the older versions produced that extra temp variable and checking against that with the inlined code the main reason why both loops take different durations is due to being in one or two cachelines. We had the same situation some while ago in another thread when we measured different string handling routines. Some performed better or worse and then suddenly a small change changed the result significantly simply because the instructions emitted for the loop were located differently. Cache lines? New terminology to me, in Delphi. Perhaps at that time (moving to 10.4 or 10.5) I will also have more experience in benchmarking, because some details pointed out in this topic were completely new to me (like how registers, order of execution, TStopWatch init/reset, cache lines and other details can affect the results). Share this post Link to post
David Heffernan 2345 Posted October 12, 2020 Can I ask again why you are timing something that has no relevance to the performance of the code that you care about? Once you time these variants in a real program, you will find that you won't be able to detect any difference in runtime. And then you will draw the conclusion that the best way to write the code is in a manner which avoids duplication of code, and which makes it easy to maintain. You'll likely also decide that there is no real gain in explicit inlining of the function here, and will stop doing that. Share this post Link to post
Kryvich 165 Posted October 12, 2020 29 minutes ago, Mike Torrettinni said: Perhaps before or about the time of 10.5 release. Embarcadero announced DelphiCon 2020 for November 17th. So I think a new version (10.5 or 10.4.x) is coming soon. Share this post Link to post
Dalija Prasnikar 1396 Posted October 12, 2020 11 minutes ago, Kryvich said: Embarcadero announced DelphiCon 2020 for November 17th. So I think a new version (10.5 or 10.4.x) is coming soon. DelphiCon schedule is totally unrelated to any release. It is just online Delphi conference. Previously we had CodeRage in about same timeframe. Share this post Link to post
Stano 143 Posted October 12, 2020 From the wise book: optimize the application when it is really needed. A maximum of 10% of "bottlenecks" need to be optimized to increase performance sufficiently Share this post Link to post