Jump to content
Mike Torrettinni

Simple inlined function question

Recommended Posts

47 minutes ago, David Heffernan said:

Can I ask again why you are timing something that has no relevance to the performance of the code that you care about?

 

Once you time these variants in a real program, you will find that you won't be able to detect any difference in runtime.  And then you will draw the conclusion that the best way to write the code is in a manner which avoids duplication of code, and which makes it easy to maintain.  You'll likely also decide that there is no real gain in explicit inlining of the function here, and will stop doing that.

I'm still not sure what is missing from the original post. There is example and timing results, focused just on difference between non-inlined and inlined code. I also had timing for non-inlined function, which was the slowest, but the focus of this topic is why simple inlined function is not generating same code as non-inlined code.

 

 

Share this post


Link to post
31 minutes ago, Dalija Prasnikar said:
48 minutes ago, Kryvich said:

Embarcadero announced DelphiCon 2020 for November 17th. So I think a new version (10.5 or 10.4.x) is coming soon.

DelphiCon schedule is totally unrelated to any release. It is just online Delphi conference. Previously we had CodeRage in about same timeframe.

Based on past releases: https://delphi.fandom.com/wiki/Delphi_Release_Dates   I'm not expecting 10.5 before fall 2021.

Share this post


Link to post
2 hours ago, Stefan Glienke said:

So there is indeed a difference in the code which affects the performance which goes back to what I said before - the inliner not doing its best job - what you see here is that the compiler still generates that result variable and either sets it to true or to false and then checks that one. But the first loop is only faster because you don't do anything after the check.

Perhaps inline directive generates even less performant code in earlier Delphi versions. Not sure when it was released first.

Share this post


Link to post
20 minutes ago, Mike Torrettinni said:

Perhaps inline directive generates even less performant code in earlier Delphi versions. Not sure when it was released first.

Directive inline existed already in Borland Pascal.

Share this post


Link to post
57 minutes ago, Mike Torrettinni said:

the focus of this topic is why simple inlined function is not generating same code as non-inlined code

Well, it's reasonable to wonder about that, and Stefan talked about that.

 

But you actually spent a lot of time talking not about the codegen, but about performance. And the key point is that there is no performance difference for the two versions of the code that you presented, once you put the code in a context where it actually does something. After all, the code you showed doesn't initialise any variables, and just performs the exact same two comparisons on each iteration of the loop.

 

Here's a question for you, why don't you compare the run time of your code, with code where your for loop is removed? I bet it will be faster with the for loop removed. And the code does the same thing with or without the for loop. Obviously that's silly because the for loop is meant to represent some real world code. But it's only meaningful in the context of actual production code.

 

I've said it many times, but when you put your two variants into your real world program, you won't be able to tell them apart from the perspective of performance.

  • Like 1
  • Thanks 1

Share this post


Link to post
1 hour ago, Vandrovnik said:

Directive inline existed already in Borland Pascal.

But it could contain only the code that would be compiled 1:1

Share this post


Link to post
1 hour ago, Vandrovnik said:

Directive inline existed already in Borland Pascal.

Aha, OK. I was only able to see it is in documentation for Delphi 2010.

 

1 hour ago, A.M. Hoornweg said:

On my Delphi 10.4.1, both loops run equally fast (1214 and 1213 ms).  This is on a core i9 notebook processor.

As @Kryvich posted, it seems to have been improved in 10.3, unfortunately I still use 10.2.3 😞

 

 

Share this post


Link to post

Google shows Turbo Pascal 3 already had INLINE directive. Wow, we are talking about a dinosaur here, and I thought it was recently added.

 

Now that we know INLINE is getting improvements decades later, I see no reason to doubt any other features have great future, too! 🙂

Share this post


Link to post
2 hours ago, A.M. Hoornweg said:

On my Delphi 10.4.1, both loops run equally fast (1214 and 1213 ms).  This is on a core i9 notebook processor.

Those numbers make me assume that you ran in Debug config (i.e. without $O+).

Share this post


Link to post
1 hour ago, Mike Torrettinni said:

Now that we know INLINE is getting improvements decades later

It isn't.

 

1 hour ago, Mike Torrettinni said:

I see no reason to doubt any other features have great future, too!

Hmm .....

Share this post


Link to post
1 hour ago, Mike Torrettinni said:

Google shows Turbo Pascal 3 already had INLINE directive.

You are kind of right, but that was a totally different kind of inline:

Quote

INLINE statement
Turbo Pascal's INLINE statement pro­vides a way to insert machine code di­rectly into a Turbo Pascal program:


inline($B4/$02/    { mov ah,2 }
       $CD/$21 );  { int 21 h }

 

Share this post


Link to post
5 minutes ago, dummzeuch said:
2 hours ago, Mike Torrettinni said:

Google shows Turbo Pascal 3 already had INLINE directive.

You are kind of right, but that was a totally different kind of inline:

 

Aha good to know. Now then this also make sense:

 

1 hour ago, Jacek Laskowski said:

Inline was added in D2005:

 

https://edn.embarcadero.com/article/33050

 

Share this post


Link to post
16 hours ago, Stefan Glienke said:

Those numbers make me assume that you ran in Debug config (i.e. without $O+).

 

No!  But your reply made me double-check.

 

I  ran the original test in a "tbutton.onclick" event instead of FormCreate, expecting that it would make no difference. It's just my way of doing things.

 

Weird enough, it DOES make a difference. Inside FormCreate() the numbers are 495 and 241 ms. 

Why the heck does stuff run slower in an OnClick() event?

 

 

Share this post


Link to post
1 hour ago, A.M. Hoornweg said:

 

No!  But your reply made me double-check.

 

I  ran the original test in a "tbutton.onclick" event instead of FormCreate, expecting that it would make no difference. It's just my way of doing things.

 

Weird enough, it DOES make a difference. Inside FormCreate() the numbers are 495 and 241 ms. 

Why the heck does stuff run slower in an OnClick() event?

 

 

Because timing a for loop doing 1,000,000,000 iterations of nothing meaningful is nothing more than a low grade PRNG.

 

Put something realistic inside the loop, and then time it.

Share this post


Link to post
Guest
1 hour ago, A.M. Hoornweg said:

Why the heck does stuff run slower in an OnClick() event?

may be the should ask the better question, "How to time it right?"

So here it is:

1) refactor those loops into functions

2) isolate the IO operations which has nothing to do with the timing and delay them until finished timing, in other words, those calls to Memo.XX are wiper for the CPU cache and all of its enhancement, like branch prediction ..., those call will go down in the OS Kernel as they might trigger chain of event, from OS VMM to even touch the display driver itself, so no outer call before ( after warming of course) or during the timing.

3) after refactoring the loops in their own functions, warm up, to give the CPU chance to trigger its best ability, means call them few times at least 3 times with smaller count, like we are going to time 100mil then call them 3 times with 1-5mil, after that start timing, 

4) better than using TStopWatch or other heavy classes try to go after the smallest mechanism, like QueryPerformanceCounter or RDTSC, ( google that for more info)

5) Now to the real deal, align you function and loops for consistency of the result, Delphi compiler has no consideration for this, and as they add CodeAlign directive, they did limit it to 16byte, here you should know aligning is essential for branching CPU operation, though you should not be considered about it too much, as so little you can do when the compiler is not helping, BUT there is a work around and for testing 

you can do this 

{$Codealign 16}
var
  Temp: Integer;

procedure Align11;
begin
  Inc(Temp);
end;

procedure Align12;
begin
  Inc(Temp);
end;

procedure Align13;
begin
  Inc(Temp);
end;
//  repeat the above up to 7 to have the ability to move the address function from 10 to 80 (as an example), or use 4 to align to 64
....
{
// Here goes the rest of the unit
}

initialization  
// we will comment/uncomment those to make the function(s) in question to have an address ending with 00 or 80 or even 40
// to check the current address's you should use the CPU disassembler in the debugger
  Align11;
  Align12;
  Align13;
  Align14;
  Align15;
  Align16;
  Align17;

  Align21;
  Align22;
  Align23;
  Align24;
  Align25;
  //Align26;
  //Align27;

After doing all of that you still have 99% of having the same consistence result, to be sure try with the same code for loops, when you see the result is accurate then try the other code, just remember to check the address alignment.

To make sure of one in particular situation that has heavy effect, are sided, then make sure your code is not crossing a page alignment, in other word an address ending with 000 should be crossed in your functions, as each time the EIP is crossing that there is chance of been adding from few cycles to hundreds of CPU cycles internally, that delay depends and vary very much by each CPU architecture and even model number, most likely is irrelevant with modern CPU, but still few cycles might slip away with each iteration.

 

As for doing something meaningful, that can relevant or irrelevant, as the code in question here, could be just find and return an index, means to be meaningful, it will need just a "Break;" from the loop with a simple conditional jump, making the loop and its conditional comparison the most relevant.

Share this post


Link to post

This is quite extensive prep work for accurate benchmarking. Do you know of any library that has something similar already implemented and is ready to be used?

Share this post


Link to post

The best test bench is your program. Try both variants on real data and compare results.

  • Like 1

Share this post


Link to post
40 minutes ago, Mike Torrettinni said:

This is quite extensive prep work for accurate benchmarking. Do you know of any library that has something similar already implemented and is ready to be used?

I don't really buy what Kas is saying above. Just time your actual program is usage scenarios that you care about.

 

After all, why would care about the performance of code that you never run? You only care about the code that you do run, or your users run. 

Share this post


Link to post
Guest
15 minutes ago, Mike Torrettinni said:

Do you know of any library that has something similar already implemented and is ready to be used?

Very good question but there is no good answer to it, as there is no library will do this for you, you will see even with the most advanced and powerful compilers, some developers do the manual thing.

 

So i will try to explain why it somehow irrelevant for most situations and for current time and current Delphi compiler, there is many things control and affect performance, some by boosting speed and other by not causing slowness or wasting CPU time, i think you do understand the difference between these two, now we stuck with Delphi compiler and it does what it does unlike most (if not all) other compiler, i mean it does give little attention for these low levels optimization or not even considering the suggestions from the CPU manufacturer, almost none been utilized or even considered, this is huge work relatively by it is not if done by professional and specialized personal, which i think Embarcadero not considering bring such people, ( losing focus again )

Anyway we stuck with these compilers and we need to coup, so you need to attack and optimize only what you really measured and see real gain.

 

Now what type of optimization you can do?, there is two types

1) the one which use lowest level approach, in other words, going assembly coding, and OS API calling directly, ditching the OOP and classes, this code is hard to maintain or update, go for this only after been sure it will bring you real advantage.

2) Change the code algorithm in question, for explain this i will take you own code for example, you have this

Quote

if (vSearchValue = '') and (vItemValue <> '') or (vItemValue = vSearchValue )

I don't know what your loop is missing but as merely example, lets imagine that vSearchValue is fixed value ( like been provided by caller ), then you are right to check against an empty string but should you been doing this for every iteration ?

Of course not, then by checking it for emptiness then, either enter the loop or skip it, this will make the loop like this

Quote

 (vItemValue <> '') or (vItemValue = vSearchValue )

Is this is faster, best case scenario it is one third faster, now lets take another pass on the new line, do you really need to check if vItemValue is empty ?

We know now vSearchItem is not empty, means that checking vItemValue checking is useless because comparing two strings does include empty strings, means now your if condition is 

Quote

(vItemValue = vSearchValue )

Is this faster, yes and best case scenario it is two third faster than you initial code, while we didn't compromise or alter the functionality.

 

I hope you got the idea of going on higher level optimization for algorithm, as you see from the example above, even if you translate your initial code into assembly and used utilized all the tricks to help the CPU, we managed to get better performance without such low level optimization and without losing the readability of the code.

 

PS : When i really need to time algorithm on CPU level, i use FPC for its code consistency, i tried LLVM long time ago, and it is very powerful, but the Delphi included version is somehow unfruitful, like casterated.

Share this post


Link to post
Guest
1 minute ago, David Heffernan said:

I don't really buy what Kas is saying above.

I am not selling.

Share this post


Link to post
10 minutes ago, Kas Ob. said:

I hope you got the idea of going on higher level optimization for algorithm, as you see from the example above, even if you translate your initial code into assembly and used utilized all the tricks to help the CPU, we managed to get better performance without such low level optimization and without losing the readability of the code.

Thanks, that's the only level I can even try to optimize 🙂 I've only started looking at generated asm code recently (see thread about const records access with TList access) and it was interesting to see the pitfalls of the inline directive.

I will try to look at generated code more frequent, so step by step. But prepping registers is way out of my league, for now.

 

Usually I'm more interested in how I can improve performance expressed in percentages and not what this means in actual time - unless actual time is a big problem. So, if all my benchmarks use same 'not best optimized' library (TStopWatch), I'm happy if percentage of improvement is at accepted level, for the effort needed.

Share this post


Link to post
Guest
14 minutes ago, Mike Torrettinni said:

Usually I'm more interested in how I can improve performance expressed in percentages and not what this means in actual time - unless actual time is a big problem. So, if all my benchmarks use same 'not best optimized' library (TStopWatch), I'm happy if percentage of improvement is at accepted level, for the effort needed.

Work on your algorithms, the higher level approach's, that what really bring you performance, TStopWatch will work as long you always use it and don't concern your self much with 50% increase in small loops, as no matter what you do, because the compiler is not on your side, while better algorithm might gain you many folds speed, instead of just double the speed per very small loop, a loop that takes 1ms ( usually loops with thousands entries might be measured in microseconds so not even a millisecond), while a repaint of the main form might take 1-10ms .

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×