Jump to content
Mike Torrettinni

QueryPerformanceCounter precision

Recommended Posts

Posted (edited)

If we can trust that Date/Time functions consistently return same accurate timestamp, than I think we can just calculate QPC number within 1s.

 

This function gets QPC within 999ms and if this is acceptable accurate, than we can calculate QPC per ms/µs/ns,... on a current PC.


This could probably be optimized to be even more accurate, but my PC shows: 9,989,541 (within accepted accuracy of 1% difference, if it should be 10,000,000)

uses System.SysUtils, Winapi.Windows, System.DateUtils;

function GetQPCPerSecond: Int64;
var ms1, ms2: integer; // vars for milliseconds
    t1, t2: Int64;     // vars for QPC
begin
  // assuming loop will run a few times within each 1ms,
  // so we just need to wait for first occurence of the same ms as ms1, after 1 cycle of 1 second
  QueryPerformanceCounter(t1);
  ms1 := MilliSecondOf(System.SysUtils.Time) - 1; // reduce by 1 to skip current millisecond
  while True do
  begin
   QueryPerformanceCounter(t2);                   // acquire QPC
   ms2 := MilliSecondOf(System.SysUtils.Time);    // get milliseconds of current time
   if ms2 = ms1 then                              // check if milliseconds have cycled 1 second
    Break;
  end;

  Result := t2 - t1;
end;

var i: integer;
begin
  for i := 1 to 10 do // run a few  times in case computer is busy in bg
    Writeln('QPC per second: ' + GetQPCPerSecond.ToString);
end;

 

@Kas Ob. Can you test this function how it relates to your frequency of 3331195?

 

 

Edited by Mike Torrettinni
added 'uses' in code.

Share this post


Link to post
8 hours ago, Mike Torrettinni said:

Can you test this function how it relates to your frequency of 3331195?

The result after closing everything on my Windows

image.png.2b114446abc5265529c2b90fccd57e7d.png 

And with one running IDE also a Chrome with YouTube video opened ( yes once opened YouTube will start playing around with time resolution, hence it will affect the IDE and everything else, the IDE does the same !)

image.png.fafa5c45af51be787314e9aae1d995a5.png

 

Now lets point to your code,

1) MilliSecondOf might retuned 0 this means ms1=-1 -> endless loop !

2) You are assuming you will hit the same millisec plus one, based on what ? that if equal logic does hold unpredicted behaviour, and to rephrase this, lets assume windows timer is very accurate, and the timer precision is 2 this means we should hit either evens or odds and this will consistence more as the OS is more accurate but this assumption itself means we might hit the other (even or odd) and this assumption is valid, so what if we hit only once the other value and continued for 1000 (or forever), the same logic can be applied to the biggest prime number less than 1000, we will hit it again after a cycle depends on its value, right ?

3) there is no point going after timing measure using protected mode on not-real-time-OS like Windows, you can't not achieve that in well defined and documented way.

 

Anyway, i respect your curiosity and persistence to know, don't lose that !,

Just time your time you are spending on timing.

 

  • Thanks 1

Share this post


Link to post
21 minutes ago, Kas Ob. said:

The result after closing everything on my Windows

And with one running IDE also a Chrome with YouTube video opened ( yes once opened YouTube will start playing around with time resolution, hence it will affect the IDE and everything else, the IDE does the same !)

Not sure if you do the same, but unless I need debugging, I always run performance timings in Release and Run without debugging  image.png.1d8022909fd0b6eec452abeed43af0f0.png  Seems to give more constant results. Could this be also giving better results on your PC?

 

26 minutes ago, Kas Ob. said:

1) MilliSecondOf might retuned 0 this means ms1=-1 -> endless loop !

Thanks, didn't think of that!

27 minutes ago, Kas Ob. said:

2) You are assuming you will hit the same millisec plus one, based on what ?

Oh, it was just a simple idea. In my test that specific while loop was iterating around 6Mio times within 1 sec, so I assume it should hit 1ms quite confidently. But you are right, if OS skips odd/even milliseconds, then this could loop endlessly.

30 minutes ago, Kas Ob. said:

3) there is no point going after timing measure using protected mode on not-real-time-OS like Windows, you can't not achieve that in well defined and documented way.

I hope that higher up the scale we go, for example > 1s, > 1min... we could get to decently constant comparable results.

 

For example: comparing performance of single call of StringReplace in Delphi 10.2.3 and 10.5 (when released) does not make sense. But preparing a benchmarking process that takes into account all (most/some) of the details pointed out in this thread, should give pretty good results to make conclusions of better/worse/same performance.

 

 

 

 

 

Share this post


Link to post

This updated function seems to be even closer to 10,000,000: than previous function:

 

image.png.13a2d40c53c7df897ccdab65a7815298.png

 

@Kas Ob. It takes into account odd/even issue - assuming if it hits 1 or 2 once, it will hit it again next 1s cycle.

function GetQPCPerSecond: Int64;
var ms1, ms2: integer; // vars for milliseconds
    t1, t2: Int64;     // vars for QPC
begin
  // assuming loop will run a few times within each 1ms,
  // so we just need to wait for first occurence of the same ms as ms1, after 1 cycle of 1 second

  // get starting point (start with next 1s interval, at ms1 = 1, or ms = 2, in case OS skips 1 or 2)
  repeat
    QueryPerformanceCounter(t1);
    ms1 := MilliSecondOf(System.SysUtils.Time);
  until (ms1 = 1) or (ms1 = 2);

  // wait until next ms
  repeat
  until MilliSecondOf(System.SysUtils.Time) > ms1;

  // get QPC at the next 1s cycle
  repeat
    QueryPerformanceCounter(t2);                   // acquire QPC
    ms2 := MilliSecondOf(System.SysUtils.Time);    // get milliseconds of current time
  until (ms2 = ms1);            // end when 1s has cyled

  Result := t2 - t1;
end;

 

Share this post


Link to post
41 minutes ago, Mike Torrettinni said:

Could this be also giving better results on your PC?

I deliberately left it on debug, it wasn't a mistake, and your question is good one, but lets try put this right once and for all.

 

Starting with the Debug vs Release, that was small portion of code and the only difference will be in optimize enabled or not, the difference in code cycle consumption will be also very small, something like between 2 and not sure but lets say 20 cycle, even on 1000 cycle how this can affect CPU result, remember this, your CPU is most like something >3GHz means 3 billion cycle per second, also >3 million cycle per millisecond, and 3 thousand cycle per microsecond, and >300 cycle per nanosecond, now do you think it will be that much relevant ? 

No, it will not, but this will introduce very ugly fact, why the result is that different even for your last posted result, the answer is very complex as there is many factors playing roles here.

Remember that we are controlled and protected sandbox (the Windows OS is emulating a sandbox), and it does control some of the aspects of your code execution (software) also it does control how CPU cores does switch between hardware execution point, also emulating threads, this is done to protect integrity also simulating real world multitasking,

also cache miss does have huge impact on the result one will say we don't have memory access to larger than the CPU L1 cache in that code, yes we don't have but we have context switch will direct the CPU core to read code from different place and load that chunk also that code will need its stack, hence these are two reads, these two read most likely reside in very different realm, hence the L1 lading will trigger L2 and L3 loading too, and while L1 access lines of 64 byte L2 and L3 works on larger block and will ask for full page load means 4k bytes., and for every load there is an eviction to the already data there, this must be guaranteed to be shipped to the memory module.

 

One context switch might take 10k cycle but also it might take few billions and that something you can't predict or control ( there is was to mitigate or control this to some extent but i am not posting any of them on this forum as they are close to writing a rootkit )

Also to understand this effect i just ran your last code twice , and here is the result

image.thumb.png.d5d1bfdd45e483412865bb859d7c5583.pngimage.thumb.png.59565f5e8251aa4e16b64f8dce5b449e.png

I didn't close my opened IDE's and this browser.

 

In different runs i got more different result from the screenshot, see 10 second run and there was more than 30 context switch with delta between them of 5 switches and around 700 million cycle, while my system reports >1180 threads are up and running, no way to guarantee a switch happened to the same other thread, 

So after all of that, do you think measuring this stuff will be accurate on Windows ? what accuracy you can reach ?

 

43 minutes ago, Mike Torrettinni said:

It takes into account odd/even issue - assuming if it hits 1 or 2 once, it will hit it again next 1s cycle.

Nope, zero guarantee that will ever happen but statistically it will happen, now refer to table of cycles per fraction of second i explained above.

 

The only way to measure time with higher precision on Windows is by using averages over longer running time (also ir might help to use other statistics methods like deviation and excluding ranges, like remove some result as margins caused by OS interference based on there distance from the median...etc)

 

Hope that was clear and helpful.

  • Thanks 1

Share this post


Link to post
11 minutes ago, Kas Ob. said:

Hope that was clear and helpful.

Thanks! I would love say I understand everything, but I think I got the main point.

 

22 minutes ago, Kas Ob. said:

The only way to measure time with higher precision on Windows is by using averages over longer running time (also ir might help to use other statistics methods like deviation and excluding ranges, like remove some result as margins caused by OS interference based on there distance from the median...etc) 

I think I'm getting to realize this, yes. I already noticed using trimmed mean (20%) removes most of the edge cases, warmup and other odd values. Perhaps more than 100/1000/100000 reps and trimmed mean (50%) could be even better.

 

34 minutes ago, Kas Ob. said:

So after all of that, do you think measuring this stuff will be accurate on Windows ? what accuracy you can reach ? 

Yes, well, accurate enough. Is just finding the right combination of benchmarking details and time and interpretation of results - reps, mean, min, max...

 

 

 

 

  • Like 1

Share this post


Link to post

Oh, I hope it is clear that calculating QPC per second was supposed to find the 'real' QP Frequency.

I assumed that on a computer where  QPF = 10,000,000 is a bug, then the QPC is not aligned with it. But as we tested, QPF and QPC per second are the same (or as close as possible), 10,000,000 on my comp, 3,331,195 on yours.

 

Share this post


Link to post
34 minutes ago, Mike Torrettinni said:

Oh, I hope it is clear that calculating QPC per second was supposed to find the 'real' QP Frequency.

I assumed that on a computer where  QPF = 10,000,000 is a bug, then the QPC is not aligned with it. But as we tested, QPF and QPC per second are the same (or as close as possible), 10,000,000 on my comp, 3,331,195 on yours.

You called it a bug, and i didn't correct you on that, there is no bug at all, because these are controlling your OS tempo, like a maestro, the thing is they are hiding the real hardware frequency, and this is how Microsoft does this differently from lets say Apple, if you overclock your hardware these might change but the result of using this pair together will be consistent, unlike Apple, if you ever tried to install Hachintosh (not illegal to install) on your OEM PC then you will might faced this problem, i have unlocked CPU i7-2600K, and by default from first start it went to run on 3.8GHz instead of 3.4GHz, every time i reset the bios the motherboard does that, anyway installing/running Hackintosh on overclocked/downclocked PC will fail unless you do specify the timing parameter on boot settings with 100% accuracy, this is not needed on Apple hardware.

 

Now to access hardware timing, like motherboard RTC, you need to be able to execute specific hardware instructions that are prohibited in user mode, plain and simple.

Also, while timing on OS will not be accurate, you can use different approach for timing, which is getting cycles per instructions block, but again is not the cup of tea for everyone. this is the most accurate way to compare speed and performance between algorithms/code blocks.

 

QPF and QPC values are worthless alone with being used together, but they are always right.

 

ps: 10m is coming form the motherboard bus clock, which in turn being associated with other clock timing base and effective (usually the same and usually 100Mhz), while your CPU will run on an multiplayer of that clock to reach its frequency, also memory modules runs on completely different clock multiplayer with different timing (something like 800,1333,1600...Mhz) , the OS pick Frequency base to run at and this will not reflect the speed of your device, sometimes the OS on older devices prefer to choose less/smaller frequency to be in control like my device, and in newer devices it might prefer to use higher frequency.

  • Thanks 1

Share this post


Link to post
7 hours ago, Kas Ob. said:

...  your CPU is most like something >3GHz means 3 billion cycle per second, also >3 million cycle per millisecond, and 3 thousand cycle per microsecond, and >300 cycle per nanosecond, now do you think it will be that much relevant ?  

 

Just a small correction: 3000 per microsecond ==> 3 per nanosecond, not 300.

Share this post


Link to post
Guest
Posted (edited)

hey people, what about KeQueryPerformanceCounter to acquire high resolution (<1µs) time stamps for time interval measurements... by M$ definition

https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-kequeryperformancecounter

 

The KeQueryInterruptTime routine returns the current value of the system interrupt time count, with accuracy to within system clock tick.

Quote

 


This routine returns the system interrupt time, which is the amount of time since the operating system was last started. The interrupt-time count begins at zero when the operating system starts and is incremented at each clock interrupt by the length of a clock tick. For various reasons, such as hardware differences, the length of a system clock tick can vary between computers. Call the KeQueryTimeIncrement routine to determine the size of a system clock tick.

KeQueryInterruptTime can be used for performance tuning. This routine returns a finer grained measurement than the KeQueryTickCount routine. A call to KeQueryInterruptTime has considerably less overhead than a call to the KeQueryPerformanceCounter routine, as well.

Consequently, interrupt time can be used to measure very fine-grained durations while the system is running because operations that set or reset the system time have no effect on the system interrupt time count.

However, power-management state changes do affect the system interrupt time count. Maintenance of the interrupt time count is suspended during system sleep states. When a subsequent wake state transition occurs, the system adds a "bias" value to the interrupt time count to compensate for the estimated duration of such a sleep state. The interrupt time count that is returned by KeQueryInterruptTime includes this bias value. To obtain an unbiased interrupt time count, use the KeQueryUnbiasedInterruptTime routine instead of KeQueryInterruptTime.
 

 

 

https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-kequeryinterrupttime

 

note: probaly, "Ke" stay by "Kernel"

 

hug

Edited by Guest

Share this post


Link to post

 

5 hours ago, emailx45 said:

what about KeQueryPerformanceCounter to acquire high resolution (<1µs) time stamps for time interval measurements...

 

I'm OK with what I have now, QPC, I don't see how this could be more useful for my purpose.

Share this post


Link to post

if milliseconds are precice enough, two calls to Now and subratract like ms := MillisecondOfTheDay(SecondNow) - MillisecondOfTheDay(FirstNow); can do it?

  • Thanks 1

Share this post


Link to post
2 minutes ago, KodeZwerg said:

if milliseconds are precice enough, two calls to Now and subratract like ms := MillisecondOfTheDay(SecondNow) - MillisecondOfTheDay(FirstNow); can do it?

Thanks, I need a little more precise. If testing StringReplace there could be 1000s of executions within 1ms, if benchmarking some long sort that takes seconds, than it could be enough. So, I need single approach that covers most cases, so more precise. QPC is enough.

  • Like 1

Share this post


Link to post

For microbenchmarking you don't need that high of a precision - you simply run the benchmarked function thousands of times and then divide by the number of runs and you got your duration.

 

Watch some videos by Chandler Carruth like these two:

 

 

 

 

  • Like 1
  • Thanks 1

Share this post


Link to post

Am I the only one who finds it odd that we have a two page thread on replacing TStopwatch with QPC when in fact TStopwatch is QPC. Perhaps somebody else could point out to @Mike Torrettinni that when he says

8 hours ago, Mike Torrettinni said:

I'm OK with what I have now, QPC

It's actually what he had originally with TStopwatch. 

  • Like 5
  • Thanks 1

Share this post


Link to post
16 minutes ago, David Heffernan said:

Am I the only one who finds it odd that we have a two page thread on replacing TStopwatch with QPC when in fact TStopwatch is QPC. Perhaps somebody else could point out to @Mike Torrettinni that when he says

It's actually what he had originally with TStopwatch. 

Yes, I was wondering the same.

There are only two reasons not to use TStopWatch:

  1. You are using a Delphi version that didn't have that yet (e.g. Delphi 2007)
  2. You want to play with various options.
  • Like 1
  • Thanks 1

Share this post


Link to post
3 hours ago, dummzeuch said:

There are only two reasons not to use TStopWatch:

  1. You are using a Delphi version that didn't have that yet (e.g. Delphi 2007)
  2. You want to play with various options. 

Thanks, I guess I should've included this in first post.

 

9 hours ago, Stefan Glienke said:

Watch some videos by Chandler Carruth like these two:

Thanks, will do.

Share this post


Link to post
3 hours ago, dummzeuch said:

You want to play with various options.

There aren't any options with QPC. It just returns a 64 bit int. Which TStopwatch passes on to you. And the performance counter frequency is also available.

 

The only possible explanation for this entire thread is that @Mike Torrettinni has not realised this. I guess he is blocking me because otherwise he would read my posts and realise this.

Share this post


Link to post
Posted (edited)
5 minutes ago, David Heffernan said:

I guess he is blocking me because otherwise he would read my posts and realise this.

@David Heffernan I read all your comments, Is just I finally figured it out that my 'what do you mean? can you give me an example, or details?' just annoy you. So, unless I have anything smart to respond, I try to to not bother you.

Why would anybody block your comments?

Edited by Mike Torrettinni

Share this post


Link to post
On 3/27/2021 at 6:37 AM, David Heffernan said:

TStopwatch is implemented on Windows using QueryPerformanceCounter......

You said that you were replacing TStopwatch with QPC, but TStopwatch is implemented using QPC. So this entire effort is pointless. That's my point. TStopwatch.Frequency comes from a call to QPF. TStopwatch.GetTimeStamp is QPC. And TStopwatch.ElapsedTicks is the difference between QPC when you started the stopwatch, and QPC when you called ElapsedTicks.

 

I guess you were previously calling TStopwatch.ElapsedMilliseconds and wanted more precision. Which you can get by switching to TStopwatch.ElapsedTicks.

  • Thanks 1

Share this post


Link to post
51 minutes ago, David Heffernan said:
4 hours ago, dummzeuch said:

You want to play with various options.

There aren't any options with QPC. It just returns a 64 bit int. Which TStopwatch passes on to you. And the performance counter frequency is also available.

There are alternative options to using TStopWatch, that's what I meant, e.g. using GetSystemTimeAsFileTime, GetTickCount, GetTickCount64. I'm not suggesting to use these for high precision timing in general, they are just options for some particular cases, e.g. if you don't want an Int64 but an Int32 for whatever reason.

  • Thanks 1

Share this post


Link to post
Posted (edited)

Google benchmark also uses GetProcessTimes - here is an article about the differences. And there you can also see why benchmark does a certain number of iterations on the profiled code to ensure a certain overall duration (the number of iterations dynamically depends on the single duration of the measured code) - as you can also see in the presentation from Chandler I linked earlier

Another consideration with all these methods is once you benchmark multithreaded code the wall time might not be the information you want.

Edited by Stefan Glienke
  • Thanks 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×