Jump to content
Mike Torrettinni

Simple inlined function question

Recommended Posts

Using Delphi 10.2.3 version:

I have a very simple example of inlined function, that is faster compared to non-inlined function, but not as fast as no function call is executed:

 

Goal with inlined function: the loop condition is used in numerous places, so I want to shorten the code and have 1 call to inlined function that replaces  all these (same) conditions.

 

I was hoping I can shorten the code to use function, and was assuming than inline will just 'replace' call to function with exact same condition as in function. So, in theory the execution should be the same.

But loop without inlined function is 50% faster than with inlined function, approx: 1570ms vs 2175ms.

uses System.Diagnostics;

function IsSearchByValueFound_Inlined(const aSearchValue, aItemValue: string): boolean; inline;
begin
  Result :=  (aSearchValue = '') and (aItemValue <> '') or (aItemValue = aSearchValue );
end;

procedure TForm2.FormCreate(Sender: TObject);
const loop: integer = 1000000000;
var vSearchValue, vItemValue: string;
    i: integer;
    sw: TStopWatch;
begin
  memo1.Clear;

  sw := TStopWatch.StartNew;
  for i := 1 to loop do
  if (vSearchValue = '') and (vItemValue <> '') or (vItemValue = vSearchValue )
  then ;
  sw.Stop;
  memo1.Lines.Add(sw.ElapsedMilliseconds.ToString);

  sw := TStopWatch.StartNew;
  for i := 1 to loop do
  if IsSearchByValueFound_Inlined(vSearchValue, vItemValue)
  then ;
  sw.Stop;
  memo1.Lines.Add(sw.ElapsedMilliseconds.ToString);

end;

Is that just how it is, or am I missing something and loop using inlined function should run exactly the same time as loop without function call?

 

 

Edited by Mike Torrettinni
Goal details + Delphi version

Share this post


Link to post
4 minutes ago, Attila Kovacs said:

change the order of the 2 loops and give us the new results

Same, loop without call to inlined function is 50% faster. Did you expect different results depending on the order of loops?

Share this post


Link to post

Register optimization, I never put more than one test into one procedure, but the problem is the inlining introduce the extra boolean evaluation instead of just the 3 in the "if". 🤷‍♂️

  • Thanks 1

Share this post


Link to post
1 minute ago, Attila Kovacs said:

the problem is the inlining introduce the extra boolean evaluation instead of just the 3 in the "if". 

I see. Do you think there is a way to avoid this extra boolean evaluation, perhaps restructuring function?

I'm not familiar with asm, but I see a difference in generated asm code between the loops. I have no idea how to interpret the difference, now I assume it could be the extra boolean evaluation you are referring to.

Share this post


Link to post
5 minutes ago, Attila Kovacs said:

move the loop into the inlinded proc

Oh, sorry I guess I wasn't clear enough of my goal: the loop condition is used in numerous places, so I want to shorten the code and have 1 call to inlined function that replaces  all these (same) conditions.

I updated the first post.

Share this post


Link to post

Well, maybe you could write a generic search class having the whole code only once in your app, but hard to give any advice without knowing the details.

Just leave it as it is, it always works.

Share this post


Link to post
1 minute ago, Attila Kovacs said:

Well, maybe you could write a generic search class having the whole code only once in your app, but hard to give any advice without knowing the details.

Just leave it as it is, it always works.

Thanks, but in this case I was just looking for some understanding why such difference. This is generic condition used to search the data, repeated across multiple record types. For example searching Project names, Item names, Group names...

 

Share this post


Link to post
Guest

I am not saying to go and do the following everywhere, it is somehow ugly !

 

Only when the performance is the target and you have already did the measuring and pin-pointed the culprit, then you can do some trick to strip the compiler from its stupidity and inefficiency, there is many tricks to do so and here one of them.

 

Replace that inlined function with this to recover the wasted time ( or some of it, based on your CPU and build bitness)

function IsSearchByValueFound_Inlined(const aSearchValue, aItemValue: string): boolean; inline;
begin
  //Result := ((aSearchValue = '') and (aItemValue <> '')) or (aItemValue = aSearchValue);
  Result := Boolean(Ord(aSearchValue = '') and Ord(aItemValue <> '') or Ord(aItemValue = aSearchValue));
end;

On my CPU the time is identical for 32bit, on 64bit the inlined still slower by only 7%

 

Share this post


Link to post

This test seems pointless because the two strings are always empty. You never read the strings from a collection. You never compare two strings. 

 

When you put this code into a realistic setting you'll likely find that it makes no difference to performance which versions you use. 

 

The smells of premature optimisation. And that results in hard to maintain code. 

  • Thanks 1

Share this post


Link to post
1 hour ago, David Heffernan said:

This test seems pointless because the two strings are always empty. You never read the strings from a collection. You never compare two strings. 

 

When you put this code into a realistic setting you'll likely find that it makes no difference to performance which versions you use. 

 

The smells of premature optimisation. And that results in hard to maintain code. 

Yes, I admit I have a tendency to try optimize sometimes without measuring first.

 

 

2 hours ago, Kas Ob. said:

the compiler from its stupidity and inefficiency

I assume this is what is going on here and of course also my expectation how compiler should work. If I look at simpler example, I can see the compiler generates the same code with or without function call, as expected,:

 

In this simple example of Inc(), the asm code generated is the same:

 

procedure Inc_Inlined(var aValue: integer); inline;
begin
  Inc(aValue);
end;

procedure TForm2.FormCreate(Sender: TObject);
var a: integer;
begin
  a := 0;
  Inc(a);
  Inc_Inlined(a);
end;

 

image.png.c1bd87272adc11d72511442e87b2a2b9.png

 

And now also with the documentation on inline directive:

" The inline directive is a suggestion to the compiler. There is no guarantee the compiler will inline a particular routine, as there are a number of circumstances where inlining cannot be done. The following list shows the conditions under which inlining does or does not occur:

 

Perhaps the documentation could mention that it is also not guaranteed that generated code will be as efficient as if the code was entered instead of function call. Sometimes it will be the same code, sometimes not.

 

Share this post


Link to post
13 minutes ago, Mike Torrettinni said:

Yes, I admit I have a tendency to try optimize sometimes without measuring first.

Why are you doing this? Have you timed your actual program yet? Do the two options perform measurably differently? Is that code even a bottleneck?

Share this post


Link to post

@Mike Torrettinni What version of Delphi are you using and what compiler directives are set? I ran your test in Delphi 10.3 CE and I got the same assembler code for both cases.

 

TestStringHelper.dpr.23: if (vSearchValue = '') and (vItemValue <> '') or (vItemValue = vSearchValue )
004E9AA1 837DFC00         cmp dword ptr [ebp-$04],$00
004E9AA5 7506             jnz $004e9aad
004E9AA7 837DF800         cmp dword ptr [ebp-$08],$00
004E9AAB 750B             jnz $004e9ab8
004E9AAD 8B45F8           mov eax,[ebp-$08]
004E9AB0 8B55FC           mov edx,[ebp-$04]
004E9AB3 E8301CF2FF       call @UStrEqual
...
TestStringHelper.dpr.29: if IsSearchByValueFound_Inlined(vSearchValue, vItemValue)
004E9B49 837DFC00         cmp dword ptr [ebp-$04],$00
004E9B4D 7506             jnz $004e9b55
004E9B4F 837DF800         cmp dword ptr [ebp-$08],$00
004E9B53 750B             jnz $004e9b60
004E9B55 8B45F8           mov eax,[ebp-$08]
004E9B58 8B55FC           mov edx,[ebp-$04]
004E9B5B E8881BF2FF       call @UStrEqual

 

  • Thanks 1

Share this post


Link to post
11 minutes ago, Mike Torrettinni said:

Perhaps the documentation could mention that it is also not guaranteed that generated code will be as efficient as if the code was entered instead of function call. Sometimes it will be the same code, sometimes not.

Unfortunately that is true for almost all code in Delphi - and the reason why there are so many "do I better write the code like this or that" discussions - because we constantly have to help the compiler writing code in certain ways when we want to get the optimum.

 

The inliner is not effective as it could be - I would guess the reason being the Delphi compiler is mostly a single pass compiler - so it does not run another optimization step after the inlined code. That means that often there is register or stack juggling happening after the inlining took place that would not have been there if the code would have written there directly. But again: measure and evaluate if it matters. And be careful when measuring it because simply taking both different codes and timing it won't be enough.

  • Thanks 1

Share this post


Link to post
5 minutes ago, David Heffernan said:

Why are you doing this? Have you timed your actual program yet? Do the two options perform measurably differently? Is that code even a bottleneck?

I'm not sure what more I can write that I didn't put into first post.

 

3 minutes ago, Kryvich said:

What version of Delphi are you using and what compiler directives are set? I ran your test in Delphi 10.3 CE and I got the same assembler code for both cases.

Aha, I didn't think about the version differences. I use Delphi 10.2.3. Not sure which compiler directives are you referring to, I don't set anything specific, whatever is default in debug, as this is tested in new empty project.

Edited by Mike Torrettinni

Share this post


Link to post

Classic measuring issues. I ran the same code in 10.4.1 and while it produces the same asm code the first loop ran slower for me (461 vs 232).

I have seen this before and I guess its because the TStopwatch code is not yet in the cache for the first run - same is true for code to be measured. That is why for running good benchmarks you either run both in their own binary and not back to back in the same one - in order for them to be both affected by being a cold run or you simply run the benchmark once to warm up and then start measuring. There are more things to consider though but I won't go into detail here.

 

Edit: compiled in 10.1 the second loop indeed runs slower for me as well.

 

Unit1.pas.46: for i := 1 to loop do
005CE6D2 8B1D54C95D00     mov ebx,[$005dc954]
005CE6D8 85DB             test ebx,ebx
005CE6DA 7E1A             jle $005ce6f6
Unit1.pas.47: if (vSearchValue = '') and (vItemValue <> '') or (vItemValue = vSearchValue )
005CE6DC 837DFC00         cmp dword ptr [ebp-$04],$00
005CE6E0 7506             jnz $005ce6e8
005CE6E2 837DF800         cmp dword ptr [ebp-$08],$00
005CE6E6 750B             jnz $005ce6f3
005CE6E8 8B45F8           mov eax,[ebp-$08]
005CE6EB 8B55FC           mov edx,[ebp-$04]
005CE6EE E8DDC2E3FF       call @UStrEqual
Unit1.pas.46: for i := 1 to loop do
005CE6F3 4B               dec ebx
005CE6F4 75E6             jnz $005ce6dc

vs

 

Unit1.pas.53: for i := 1 to loop do
005CE74B 8B1D54C95D00     mov ebx,[$005dc954]
005CE751 85DB             test ebx,ebx
005CE753 7E24             jle $005ce779
Unit1.pas.54: if IsSearchByValueFound_Inlined(vSearchValue, vItemValue)
005CE755 837DFC00         cmp dword ptr [ebp-$04],$00
005CE759 7506             jnz $005ce761
005CE75B 837DF800         cmp dword ptr [ebp-$08],$00
005CE75F 7511             jnz $005ce772
005CE761 8B45F8           mov eax,[ebp-$08]
005CE764 8B55FC           mov edx,[ebp-$04]
005CE767 E864C2E3FF       call @UStrEqual
005CE76C 7404             jz $005ce772
005CE76E 33C0             xor eax,eax
005CE770 EB02             jmp $005ce774
005CE772 B001             mov al,$01
005CE774 84C0             test al,al
Unit1.pas.53: for i := 1 to loop do
005CE776 4B               dec ebx
005CE777 75DC             jnz $005ce755

So there is indeed a difference in the code which affects the performance which goes back to what I said before - the inliner not doing its best job - what you see here is that the compiler still generates that result variable and either sets it to true or to false and then checks that one. But the first loop is only faster because you don't do anything after the check.

 

Edit: One more thing that is important when measuring stuff like this in comparison directly - cache lines. In my case the second loop is always faster in 10.4 even though both of them have the same code generated - and that is simply because the first loop spans two cachelines and the second one in only one - that is something you cannot influence easily and should not bother with but need to be kept in mind when doing measuring code like this.

Edited by Stefan Glienke
  • Thanks 1

Share this post


Link to post
6 minutes ago, Stefan Glienke said:

Unfortunately that is true for almost all code in Delphi - and the reason why there are so many "do I better write the code like this or that" discussions - because we constantly have to help the compiler writing code in certain ways when we want to get the optimum.

 

The inliner is not effective as it could be - I would guess the reason being the Delphi compiler is mostly a single pass compiler - so it does not run another optimization step after the inlined code. That means that often there is register or stack juggling happening after the inlining took place that would not have been there if the code would have written there directly. But again: measure and evaluate if it matters. And be careful when measuring it because simply taking both different codes and timing it won't be enough.

OK, makes sense. I would not expect some complex code to be inlined as if it's outside function, but just simple string comparisons I assume it could be working as expected.

 

1 minute ago, Stefan Glienke said:

Classic measuring issues. I ran the same code in 10.4.1 and while it produces the same asm code the first loop ran slower for me (461 vs 232).

I have seen this before and I guess its because the TStopwatch code is not yet in the cache for the first run - same is true for code to be measured. That is why for running good benchmarks you either run both in their own binary and not back to back in the same one - in order for them to be both affected by being a cold run or you simply run the benchmark once to warm up and then start measuring. There are more things to consider though but I won't go into detail here.

 

OK, thanks for this. I had no idea of such details.

 

I believe now we confirmed that my version Delphi 10.2.3 is less efficient compared to 10.3 and up, regarding inline directive, so this topic is irrelevant when I move to newer version.

Share this post


Link to post
38 minutes ago, Mike Torrettinni said:

I believe now we confirmed that my version Delphi 10.2.3 is less efficient compared to 10.3 and up, regarding inline directive, so this topic is irrelevant when I move to newer version.

That's not the conclusion of this topic.

Share this post


Link to post

Without going into more detailed analysis for now (maybe that's a good topic for a future blog post) I would say even though the older versions produced that extra temp variable and checking against that with the inlined code the main reason why both loops take different durations is due to being in one or two cachelines. We had the same situation some while ago in another thread when we measured different string handling routines. Some performed better or worse and then suddenly a small change changed the result significantly simply because the instructions emitted for the loop were located differently.

 

Some stuff to read about Microbenchmarks: https://engineering.appfolio.com/appfolio-engineering/2019/1/7/microbenchmarks-vs-macrobenchmarks-ie-whats-a-microbenchmark

Edited by Stefan Glienke

Share this post


Link to post
11 minutes ago, David Heffernan said:

That's not the conclusion of this topic.

I guess you are right, since I can't move to newer version, yet. Perhaps before or about the time of 10.5 release. Until then, I will have to accept that this example of inline directive in my code is not as efficient as will be in the future 🙂

 

1 minute ago, Stefan Glienke said:

Without going into more detailed analysis for now (maybe that's a good topic for a future blog post) I would say even though the older versions produced that extra temp variable and checking against that with the inlined code the main reason why both loops take different durations is due to being in one or two cachelines. We had the same situation some while ago in another thread when we measured different string handling routines. Some performed better or worse and then suddenly a small change changed the result significantly simply because the instructions emitted for the loop were located differently.

Cache lines? New terminology to me, in Delphi.

 

Perhaps at that time (moving to 10.4 or 10.5) I will also have more experience in benchmarking, because some details pointed out in this topic were completely new to me (like how registers, order of execution, TStopWatch init/reset, cache lines and other details can affect the results).

Share this post


Link to post

Can I ask again why you are timing something that has no relevance to the performance of the code that you care about?

 

Once you time these variants in a real program, you will find that you won't be able to detect any difference in runtime.  And then you will draw the conclusion that the best way to write the code is in a manner which avoids duplication of code, and which makes it easy to maintain.  You'll likely also decide that there is no real gain in explicit inlining of the function here, and will stop doing that.

Share this post


Link to post
29 minutes ago, Mike Torrettinni said:

Perhaps before or about the time of 10.5 release.

Embarcadero announced DelphiCon 2020 for November 17th. So I think a new version (10.5 or 10.4.x) is coming soon.

Share this post


Link to post
11 minutes ago, Kryvich said:

Embarcadero announced DelphiCon 2020 for November 17th. So I think a new version (10.5 or 10.4.x) is coming soon.

DelphiCon schedule is totally unrelated to any release. It is just online Delphi conference. Previously we had CodeRage in about same timeframe.

 

Share this post


Link to post

From the wise book: optimize the application when it is really needed. A maximum of 10% of "bottlenecks" need to be optimized to increase performance sufficiently

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×