Jump to content
Mike Torrettinni

QueryPerformanceCounter precision

Recommended Posts

Posted (edited)

I started using QueryPerformanceCounter for measuring timings of execution, so I can measure just 1 method execution instead of calling x times for TStopWatch to measure in milliseconds.

But, I'm getting strange timings that I'm not sure how to interpret:

 

here I have example of timing Sleep(1) and the difference in 100 calls could be 100% between Min vs Max execution time:

 

image.png.e4a8a00420edb6418e544474baaba260.png

 

This was the biggest difference I got, while some timings have less difference, but most of the time the difference is just too big.

 

Anybody has any idea why do the timings of simple Sleep(1) have such a vast range? Is it just a fact that this is normal when measuring such a small timings, like single Sleep(1) call?

 

Code:

 

uses
  System.SysUtils, Winapi.Windows, System.Math; 

procedure TestQPerf;
const cLoop = 100;
var i, vStart, vStop, vMin, vMax: Int64;
    vIntArr: TArray<integer>;
begin
  SetLength(vIntArr, cLoop);

  for i := 0 to cLoop - 1 do
  begin
    QueryPerformanceCounter(vStart);
    Sleep(1);
    QueryPerformanceCounter(vStop);
    vIntArr[i] := vStop - vStart;
  end;

  for i := Low(vIntArr) to High(vIntArr) do
    Writeln(vIntArr[i].ToString);

  vMin := System.Math.MinIntValue(vIntArr);
  vMax := System.Math.MaxIntValue(vIntArr);

  Writeln('Min: ' + vMin.ToString);
  Writeln('Max: ' + vMax.ToString);
  Writeln('Diff: ' + (vMax - vMin).ToString);
end;

begin 
  TestQPerf;

  Readln;
end.

 

 

Edited by Mike Torrettinni
typo

Share this post


Link to post
Guest
Posted (edited)

hi @Mike Torrettinni

 

for each propuse, a necessity!!!

my suggestion, in Windows case, is always read the MS API article... else, you can have a effect-side like this!!!

-- no talking about your coding-way!!!

 

Delphi use "Sleep by MSWindows" internally... then, read about "Remark" part on article:

Quote

This function causes a thread to relinquish the remainder of its time slice and become unrunnable for an interval based on the value of dwMilliseconds. The system clock "ticks" at a constant rate. If dwMilliseconds is less than the resolution of the system clock, the thread may sleep for less than the specified length of time. If dwMilliseconds is greater than one tick but less than two, the wait can be anywhere between one and two ticks, and so on.

To increase the accuracy of the sleep interval, call the timeGetDevCaps function to determine the supported minimum timer resolution and the timeBeginPeriod function to set the timer resolution to its minimum.

Use caution when calling timeBeginPeriod, as frequent calls can significantly affect the system clock, system power usage, and the scheduler.

If you call timeBeginPeriod, call it one time early in the application and be sure to call the timeEndPeriod function at the very end of the application.

...

Be careful when using Sleep in the following scenarios:

  • Code that directly or indirectly creates windows (for example, DDE and COM CoInitialize). If a thread creates any windows, it must process messages. Message broadcasts are sent to all windows in the system. If you have a thread that uses Sleep with infinite delay, the system will deadlock.
  • Threads that are under concurrency control. For example, an I/O completion port or thread pool limits the number of associated threads that can run. If the maximum number of threads is already running, no additional associated thread can run until a running thread finishes. If a thread uses Sleep with an interval of zero to wait for one of the additional associated threads to accomplish some work, the process might deadlock.

source: https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-sleep

 

Quote

Acquiring high-resolution time stamps 05/31/2018

Windows provides APIs that you can use to acquire high-resolution time stamps or measure time intervals.

  • The primary API for native code is QueryPerformanceCounter (QPC).
  • For device drivers, the kernel-mode API is KeQueryPerformanceCounter.
  • For managed code, the System.Diagnostics.Stopwatch class uses QPC as its precise time basis.

QPC is independent of and isn't synchronized to any external time reference. To retrieve time stamps that can be synchronized to an external time reference, such as, Coordinated Universal Time (UTC) for use in high-resolution time-of-day measurements, use GetSystemTimePreciseAsFileTime.

 

Time stamps and time-interval measurements are an integral part of computer and network performance measurements. These performance measurement operations include the computation of response time, throughput, and latency as well as profiling code execution. Each of these operations involves a measurement of activities that occur during a time interval that is defined by a start and an end event that can be independent of any external time-of-day reference.

 

QPC is typically the best method to use to time-stamp events and measure small time intervals that occur on the same system or virtual machine. Consider using GetSystemTimePreciseAsFileTime when you want to time-stamp events across multiple machines, provided that each machine is participating in a time synchronization scheme such as Network Time Protocol (NTP). QPC helps you avoid difficulties that can be encountered with other time measurement approaches, such as reading the processor’s time stamp counter (TSC) directly.

source: 

 

many resulted can be raised by CPU according with function in use!!!

QueryPerformanceCounter using a Multi-core CPU the values can pass between more than a CPU and this can present the "time losted" on resulted... +/- this ... you see me?

here an article about this:

Quote

On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL).

mine: ... maybe this can dont reflect true in new processors...

 

maybe, if you add "SetThreadAffinityMask" to say in "what cpu" your software should run... then, maybe you can have a different resulted.... try it if possible!

 

try see if you can use KeQueryPerformanceCounter:  https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-kequeryperformancecounter

Quote

Use KeQueryPerformanceCounter to acquire high resolution (<1µs) time stamps for time interval measurements

 

hug

Edited by Guest

Share this post


Link to post
4 hours ago, Mike Torrettinni said:

Anybody has any idea why do the timings of simple Sleep(1) have such a vast range? Is it just a fact that this is normal when measuring such a small timings, like single Sleep(1) call?

Yes i have, but first we must understand that Windows is not real time OS, means there is no guarantee of any timing periods to match your expectation, there might and will be some variation, and this will happen even with the existence of hardware functionality, in other words the timing will be emulated at best effort form the OS, on other hand there is CPU and hardware interrupts that can be very accurate, the problem is exposing these interrupts to be used by any application will compromise the integrity and stability of the OS, hence the OS isolate these low level hardware interrupts from software usage and keep them for itself to maintain the multithreading functionality, 

 

on side note here, i know that Z80 is somehow is beloved here so as fun fact, Z80 does have 3.58Mhz clock also does have only one timed interrupt that is guaranteed to call one fixed address 50 times per second, and this interrupt is accessible by any software to hook and this was the only way for games to simulate a multitasking (control, draw, update, output some buffers as music ..), also as fun fact its not 50 times per second but might also be 60 and this depends on your display !

 

Now back to Sleep(1), there is no of any sort from Windows to sleep specific amount, so yes it is a fact that Sleep(1) can range from less than 1/4 ms to 4 times the timing period ( which is by default is 15.66666 ms), why 4 ? it is just a number i saw, also will be highly affected by the system load. 

 

Also if you want to measure time then measure it the right way, it is fine to compare result between calls to QueryPerformanceCounter, but this miss the fact that QueryPerformanceCounter does not have a unit !, and to convert these numbers returned from it you should use QueryPerformanceFrequency with it, like this

procedure TestQPerf;
const
  cLoop = 100;
var
  i, vStart, vStop, vMin, vMax, Freq: Int64;
  vIntArr: TArray<integer>;
begin
  SetLength(vIntArr, cLoop);
  QueryPerformanceFrequency(Freq);

  for i := 0 to cLoop - 1 do
  begin
    QueryPerformanceCounter(vStart);
    Sleep(1);
    QueryPerformanceCounter(vStop);
    vIntArr[i] := vStop - vStart;
  end;

  for i := Low(vIntArr) to High(vIntArr) do
    Writeln((1000 * vIntArr[i] / Freq).ToString);

  vMin := System.Math.MinIntValue(vIntArr);
  vMax := System.Math.MaxIntValue(vIntArr);

  Writeln('Min: ' + vMin.ToString);
  Writeln('Max: ' + vMax.ToString);
  Writeln('Diff: ' + (vMax - vMin).ToString);

  Writeln('Min: ' + (1000 * vMin / Freq).ToString);
  Writeln('Max: ' + (1000 * vMax / Freq).ToString);
  Writeln('Diff: ' + (1000 * (vMax - vMin) / Freq).ToString);
end;

Now that will return 

image.png.68a30d5b428c21816bbb4aedee777690.png

Do you thing this result make sense now ?

 

Now lets try and close IDE and browsers, Chrome does nasty stuff and measuring with chrome running in the background will give you different result on each system

image.png.be1de252715c0755e0d53324fad3365e.png

 

last thing to get better idea what is the expected time, then you need more fact about these timing result, so use deviation and/or variance, and you will get better picture on the timing interval on Windows.

  • Thanks 1

Share this post


Link to post

Thanks! I guess measuring single execution of short operation has just so many variables dependent on OS, that doesn't make sense. The only reason for Sleep(1) was because this is the only operation I know that you can define the number - I was sure it is a simple: start now, wait 1ms, continue. I guess not.

 

OK, lets see what today brings with more testing. I'm going to try to find the measurement that has the least, but constant, variance(?), probably <5%. While the timing is still below 1ms.

Share this post


Link to post

And then this comment: "Definitely add repeat. At chunks of <10s system handing CPU time to other tasks may affect test performance seriously. A test of ~10 minutes is more reliable. Also, kill all other CPU-intensive and disk-intensive processes."

 

https://stackoverflow.com/questions/2072361/what-is-the-best-way-to-measure-how-long-code-takes-to-execute#comment2002250_2072367

Share this post


Link to post
Guest

I would can ... but im on "B--" letter yet.... :classic_biggrin:

 

hug

Share this post


Link to post

Seems that a 10-100 repetitions is pretty much enough to even out the results - tried 10-10mio and results are within a few percentage points difference.

It also helps to warmup with 1+ call before and trimmed mean (0.2 or 20%) also gives better value than standard average.

 

And the more you separate full test run across multiple methods (organizing into smaller refactored pieces), the results get affected - the warmup seems to get lost 😉 

 

@emailx45 I tried within a separate thread and affinity mask set, but the results were so odd, much bigger differences than in main thread, so I assume there are other factors of the thread handling affecting result, so i gave up this approach. But, I'm not expert on threads, anyway.

 

Thank you @Kas Ob. , the QueryPerformanceFrequency  is the key to turn into milliseconds or seconds, but my frequency is always: 10000000.

 

Share this post


Link to post
7 hours ago, Mike Torrettinni said:

but my frequency is always: 10000000.

While mine is always 3331195.

 

The Frequency value depends on your hardware (CPU, Mother RTC...), so to get consistence results across different devices always convert with current frequency, you need to query it only once.

  • Thanks 1

Share this post


Link to post
Posted (edited)
8 hours ago, Mike Torrettinni said:

 my frequency is always: 10000000.

See https://hero.handmade.network/forums/code-discussion/t/7485-queryperformancefrequency_returning_10mhz_bug

 

Also of interest, see MSDN: https://docs.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps

Quote

Cases might exist where QueryPerformanceFrequency doesn't return the actual frequency of the hardware tick generator. For example, in many cases, QueryPerformanceFrequency returns the TSC frequency divided by 1024; and on Hyper-V, the performance counter frequency is always 10 MHz when the guest virtual machine runs under a hypervisor that implements the hypervisor version 1.0 interface. As a result, don't assume that QueryPerformanceFrequency will return the precise TSC frequency.

 

Edited by Remy Lebeau
  • Thanks 2

Share this post


Link to post
4 hours ago, Remy Lebeau said:

So, not only we have all these factors affecting precise measurements now we have 'bugs/features' that can change based on each OS update.

Of course this is when precise timings are wanted, I assume timings based on System.SysUtils.GetTime timestamp function (or similar date time functions) would correctly express timings in seconds, minutes, hours.

 

 

 

Share this post


Link to post
19 minutes ago, Mike Torrettinni said:

So, not only we have all these factors affecting precise measurements now we have 'bugs/features' that can change based on each OS update.

Of course this is when precise timings are wanted, I assume timings based on System.SysUtils.GetTime timestamp function (or similar date time functions) would correctly express timings in seconds, minutes, hours.

None of these sources claim that QPC used together with QPF was imprecise. Just that they are based on different time sources, depending on hardware and operating system and therefore the resolution and overhead for the calls vary. As far as I understand it, there is a guaranteed resolution of at least 1 microsecond for all those cases.

There are other factors that affect timing much more than these, e.g. what else is running in parallel and whether there are context switches.

 

Also, they cannot be relied on for high precision timestamps that must be consistent between multiple CPUs, but that has nothing to do with performance measurements.

  • Thanks 1

Share this post


Link to post
Posted (edited)
13 minutes ago, dummzeuch said:

None of these sources claim that QPC used together with QPF was imprecise. Just that they are based on different time sources, depending on hardware and operating system and therefore the resolution and overhead for the calls vary. As far as I understand it, there is a guaranteed resolution of at least 1 microsecond for all those cases.

There are other factors that affect timing much more than these, e.g. what else is running in parallel and whether there are context switches.

 

Also, they cannot be relied on for high precision timestamps that must be consistent between multiple CPUs, but that has nothing to do with performance measurements.

Yes, that makes sense.

 

But, funny enough!, I did a very simple timing log and 1s or 1000ms (14284 - 13283 in screenshot) is almost 10,000,000 ticks, like the 'defined feature/bug' frequency: 🙂

(the milliseconds are based on System.SysUtils.Time)

 

image.thumb.png.acf2fd17a325f06c44693af6afdafc7d.png

 

 

Edited by Mike Torrettinni

Share this post


Link to post
Posted (edited)
18 hours ago, Stefan Glienke said:

I wish someone with some C++ knowledge would write a Delphi wrapper for Google Benchmark

It would probably be easier to write the whole thing from scratch in Delphi.

Edited by Dalija Prasnikar

Share this post


Link to post
1 minute ago, Dalija Prasnikar said:

It would probably be easier to write the whole thing from scratch in Delphi.

I doubt - google benchmark is a little more than "run this stuff in a loop and time it with a stopwatch" 😉

Share this post


Link to post
1 hour ago, Stefan Glienke said:

I doubt - google benchmark is a little more than "run this stuff in a loop and time it with a stopwatch" 😉

Yeah... but making a wrapper around C++ lib.... 

Share this post


Link to post

To benchmarks instructions, you need specific SW tooling, and also proper HW.

The reference is https://www.agner.org/optimize/#testp

 

To benchmark Sleep() doesn't make any sense, especially on Windows.

On windows, the Sleep() resolution is around the system timer which is typically between 14-20ms.

Sleep() waits "at least" for the number of milliseconds specified.
So in a waiting loop, you should never count the number of Sleep() iterations, but call GetTickCount64, with a timeout.

  • Thanks 1

Share this post


Link to post
6 minutes ago, Arnaud Bouchez said:

To benchmarks instructions, you need specific SW tooling, and also proper HW.

The reference is https://www.agner.org/optimize/#testp

 

To benchmark Sleep() doesn't make any sense, especially on Windows.

On windows, the Sleep() resolution is around the system timer which is typically between 14-20ms.

Sleep() waits "at least" for the number of milliseconds specified.
So in a waiting loop, you should never count the number of Sleep() iterations, but call GetTickCount64, with a timeout.

Sleep(1) was used to test accuracy after s := FormatDateTime('YYYYMMDD', Date); was giving odd timing results for single execution. I thought Sleep(1) will always execute in exactly ms, now I know it doesn't, as @Kas Ob. explained above,

 

 

 

Share this post


Link to post
Guest
Posted (edited)

I think that "Timers" on a system would be a subject to specialist involving Hardware and Software, and, as I dont any one they, then, I'll use a tool very knowed for many peoples around the world.

 

GPU-Z tools can say, or at least, near that, what is your QPC in your system to compare with many others.

 

with these tool, you can see the "real" timers between ACPI, QPC and RTC timers for use it to compare with your "code". Of course, this timers always been computed by "a software", in fact, between GPU-Z software and Hardware values... probaly you will call this software as: BIOS.

 

image.thumb.png.98a486df00f5597f9431c8bd93bdbcda.png   image.png.ea076869cac6a1c2421c1b55dc291c64.png

 

for any critical here, my post is about know your "timer reference" not what its value... because the tool dont show it, but compare the 3 frequences used by hardware, and, by consequence, by softwares on multi-task.

 

Naturely, you can calculate the "factor" between 3 frequences or values above.

 

other options to aproximated values is use: CLOCKRES or CLOCKRES64 by SysInternas --> source: https://docs.microsoft.com/en-us/sysinternals/downloads/clockres

 

hug

Edited by Guest

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×