Revisiting TThreadedQueue and TMonitor

pyscripter · May 19, 2020

I am attaching an updated fix using the corrected InterlockedCompareExchange128 by @Anders Melander.

A note about @Darian Miller stress test from Github. If you run the stress test with the fix you will see failures reported. Fist a couple of important pointers:

WaitForSingleObject and the rest of Windows wait functions by default use the system clock resolution and not the high-resolution timer like QueryPerformanceCounter that the StopWatch is using.
The resolution of clock timer is between 7-15ms at least in Windows 7.
The StopWatch ElapsedMilliseconds is a rounded figure. So when it is zero it does not mean that the elapsed time is zero.

I have confirmed that in the cases where the stress test reports a failure, this is not Delphi's fault but what the WaitForSingleObject returns.

To run the stress test in a meaningful way either

change the POP-TIMEOUT to a value greater than 15 or
change the TTestThread.Execute() so that it just reports cases in which the vStopWatch.ElapsedMilliseconds is zero but it does not stop the test:

e.g.

procedure TTestThread.Execute();
var
  Item: TItem;
  vWaitResult:TWaitResult;
  vStopwatch:TStopwatch;
begin
  pubSyncrhonizedStart.WaitFor();

  while not Terminated do
  begin
    if fQueue.ShutDown then Break;
    if pubTestCompletionCheck.WaitFor(0) = wrSignaled then Break;

    vStopwatch := TStopwatch.StartNew;
    vWaitResult := FQueue.PopItem( Item );
    vStopWatch.Stop();


    //Note: Reason for ACCEPTABLE_VARIANCE_MS is that on some low timeout values (like 200ms) it occasionally times out a little early (like 180ms)
    if (vWaitResult = wrTimeout) and (vStopWatch.ElapsedMilliseconds > 0) and (vStopWatch.ElapsedMilliseconds >= POP_TIMEOUT-ACCEPTABLE_TIMEOUTVARIANCE_MS) then
    begin
      //successful PopItem operation as we aren't adding anything into our queue
      Continue;
    end
    else
    begin
      LogIt('TTestThread ERROR: TThreadedQueue.PopItem returned [' + TRttiEnumerationType.GetName(vWaitResult) + '] unexpectedly after [' + IntToStr(vStopWatch.ElapsedMilliseconds) + 'ms]');
      //pubTestCompletionCheck.SetEvent();
      //Break;
    end;
  end;
end;

I have tested with a couple of thousand threads in both win32 and win64 and it behaves as expected.

MonitorWaitStackFix.pas

Edited May 19, 2020 by pyscripter

Anders Melander · May 19, 2020

4 minutes ago, pyscripter said:

I have tested with a couple of thousand threads in both win32 and win64 and it behaves as expected.

Thanks for the update.

Note that more threads does not necessarily result in a higher likelihood of failure. More threads means more overhead (context switching and so on) and less time spent actually exercising the code we want to test. I think the the goal should be to loop as fast as possible (i.e. perform as many operations as possible), utilizing all available CPUs, with as few context switches and as little kernel mode as possible.

pyscripter · May 19, 2020

6 minutes ago, Anders Melander said:

Note that more threads does not necessarily result in a higher likelihood of failure. More threads means more overhead (context switching and so on) and less time spent actually exercising the code we want to test. I think the the goal should be to loop as fast as possible (i.e. perform as many operations as possible), utilizing all available CPUs, with as few context switches and as little kernel mode as possible.

Correct in general but in this particular test and given the timeout the greater the number of threads (up to a point) the greater the number of calls to NewWaitObj (which calls the stack push and pop). You can confirm by running the stress test.

Edited May 19, 2020 by pyscripter

Leif Uneus · May 19, 2020

@pyscripter

Note that in InterlockedCompareExchange128, the last line (MOVZX EAX,AL) is redundant.

The function returns a boolean, one byte in size.

And, in InterlockedCompareExchange, should it not be following code for the 32 bit ?

Result := InterlockedCompareExchange64(Int64(Dest), Int64(NewValue), Int64(CurrentValue)) = Int64(CurrentValue);

David Heffernan · May 19, 2020

17 minutes ago, Leif Uneus said:

The function returns a boolean, one byte in size.

Wrong. It returns a Bool, which is LongBool, the 4 byte boolean.

The right fix is for Emba to change it to Boolean.

Leif Uneus · May 19, 2020

2 minutes ago, David Heffernan said:

Wrong. It returns a Bool, which is LongBool, the 4 byte boolean.

The right fix is for Emba to change it to Boolean.

Sorry @DavidHeffernan, but in the file posted by @pyscript his proposed function InterlockedCompareExchange128 returns a boolean.

pyscripter · May 19, 2020

4 minutes ago, David Heffernan said:

Wrong. It returns a Bool, which is LongBool, the 4 byte boolean.

The right fix is for Emba to change it to Boolean.

Correct, but in the fix it is declered as boolean. So I guess MOVZX EAX,AL is not needed. I am reattaching the amended file.

MonitorWaitStackFix.pas

David Heffernan · May 19, 2020

43 minutes ago, Leif Uneus said:

Sorry @DavidHeffernan, but in the file posted by @pyscript his proposed function InterlockedCompareExchange128 returns a boolean.

OK. I've been working to the definition in the Delphi RTL.

pyscripter · May 21, 2020

On 5/18/2020 at 12:37 PM, Anders Melander said:

Anyway, here's my final (hopefully) fixed version of InterlockedCompareExchange128 which conforms to the Win32 API documentation:

Are the MOV instructions after LOCK CMPXCHG16B [R10] needed?

The following passes the stress test:

{$IFDEF Win64}
function InterlockedCompareExchange128(Destination: Pointer; ExchangeHigh, ExchangeLow: Int64; ComparandResult: Pointer): boolean;
// The parameters are in the RCX, RDX, R8 and R9 registers per the MS x64 calling convention:
//   RCX        Destination
//   RDX        ExchangeHigh
//   R8         ExchangeLow
//   R9         ComparandResult
//
// CMPXCHG16B requires the following register setup:
//   RDX:RAX    ComparandResult.High:ComparandResult.Low
//   RCX:RBX    ExchangeHigh:ExchangeLow
// See: https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b
asm
      .PUSHNV RBX
      MOV   R10,Destination             // RCX
      MOV   RBX,ExchangeLow             // R8
      MOV   RCX,ExchangeHigh            // RDX
      MOV   RDX,[ComparandResult+8]     // R9
      MOV   RAX,[ComparandResult]       // R9
 LOCK CMPXCHG16B [R10]
//      MOV   [ComparandResult+8],RDX     // R9
//      MOV   [ComparandResult],RAX       // R9
      SETZ  AL
 end;
{$ENDIF Win64}

Anders Melander · May 21, 2020

5 hours ago, pyscripter said:

Are the MOV instructions after LOCK CMPXCHG16B [R10] needed?

Yes, so you can revert your change in RSP-28272.

They copy the result back to the ComparandResult parameter (or rather the 128 bit value it points to). See the InterlockedCompareExchange128 documentation:

Quote

ComparandResult

The value to compare to. This parameter is an array of two 64-bit integers considered as a 128-bit field. On output, this is overwritten with the original value of the destination.

5 hours ago, pyscripter said:

The following passes the stress test:

Because the stress test does not use return value in the ComparandResult parameter.

My test code (see an earlier post) includes a test of InterlockedCompareExchange128 which does validate that ComparandResult is set correctly.

pyscripter · May 21, 2020

@Anders MelanderThanks for the explanation. I have updated my comment in the RSP.

Anders Melander · May 26, 2020

Not fixed in 10.4:

InterlockedCompareExchange128
EventCache Push and Pop ABA problem

I guess we were too late 😞

Uwe Raabe · May 26, 2020

Just now, Anders Melander said:

I guess we were too late

By magnitudes

Darian Miller · May 26, 2020

9 minutes ago, Anders Melander said:

Not fixed in 10.4:

InterlockedCompareExchange128

EventCache Push and Pop ABA problem

I guess we were too late 😞

There's always hope for 10.4.1!

Kryvich · May 26, 2020

35 minutes ago, Darian Miller said:

There's always hope for 10.4.1!

Rather hope for 10.5. They are unlikely to break DCU compatibility in 10.4.1.

Anders Melander · May 26, 2020

2 minutes ago, Kryvich said:

They are unlikely to break DCU compatibility in 10.4.1

You are right about that but they can fix TMonitor (the ABA problem) by using a private, correct implementation of InterlockedCompareExchange128 without breaking DCU compatibility.

They could also just implement a temporary fix for InterlockedCompareExchange128 that maintains the incorrect return type (bool) but returns the correct value.

pyscripter · May 31, 2020

I have improved the fix to deal with TMonitor.Wait being called during unit finalization. Corner case but it did come up in a real application.

See attached:

MonitorWaitStackFix.pas

Edited May 31, 2020 by pyscripter

Darian Miller · May 31, 2020

1 hour ago, pyscripter said:

I have improved the fix to deal with TMonitor.Wait being called during unit finalization. Corner case but it did come up in a real application.

See attached:

MonitorWaitStackFix.pas

I updated the GitHub repo with your changes and am running two more day-long tests (win32+win64) both with 1,000 threads and 20ms pop-timeout on Windows 10 physical machine. My last 24-hour Win32 stress test completed without failure.

https://github.com/ideasawakened/DelphiKB/commits/master/Delphi Tests/Source/rtl/common/System.Generics.Collections/TThreadedQueue/StressTestPopItem

Darian Miller · June 1, 2020

On 5/30/2020 at 8:44 PM, pyscripter said:

I have improved the fix to deal with TMonitor.Wait being called during unit finalization. Corner case but it did come up in a real application.

See attached:

MonitorWaitStackFix.pas

Two stress tests just completed without error after running for 24 hours straight: Win32 and Win64 running 1,000 threads each with a 20ms timeout with your latest MonitorWaitStackFix.pas.

June 1, 2020

3 hours ago, Darian Miller said:

Two stress tests just completed without error after running for 24 hours straight: Win32 and Win64 running 1,000 threads each with a 20ms timeout with your latest MonitorWaitStackFix.pas.

Thank you very much !

I have two question here, (if you have time)

1) What is your test bed ? (CPU and System)

2) What is your observations about the failing point that concern timeout/threads ? eg. minimum timeout with x threads will fail in n seconds , does 0 or 1 ms timeout with number less than available logical cores fail ?, etc..

I am not asking to repeat the test for 24h, but failing in minutes against failing in seconds is more a concern to me.

Thank you again.

Darian Miller · June 1, 2020

13 hours ago, Kas Ob. said:

1) What is your test bed ? (CPU and System)

2) What is your observations about the failing point that concern timeout/threads ? eg. minimum timeout with x threads will fail in n seconds , does 0 or 1 ms timeout with number less than available logical cores fail ?, etc..

1. Windows 10 build 18363. i7-7700K 4.2GHz, 4 cores, 8 logical processors, 64GB RAM.

2. pyscripter explained why a low pop timeout fails earlier in this thread and is why there's a minimum of 20ms in the stress tester now.

The test originally failed less in a virtual environment (VMWare Workstation) so I switched to testing on physical machine.

On Win32, you can try throwing more threads and run into problems, likely more with Windows than Delphi. I tested with 1,000 threads without any issue.

Win64 could handle 50,000 threads without failing.

David Heffernan · June 1, 2020

39 minutes ago, Darian Miller said:

1. Windows 10 build 18363. i7-7700K 4.2GHz, 4 cores, 8 logical processors, 64GB RAM.

2. pyscripter explained why a low pop timeout fails earlier in this thread and is why there's a minimum of 20ms in the stress tester now.

The test originally failed less in a virtual environment (VMWare Workstation) so I switched to testing on physical machine.

On Win32, you can try throwing more threads and run into problems, likely more with Windows than Delphi. I tested with 1,000 threads without any issue.

Win64 could handle 50,000 threads without failing.

If it failed in a virtual environment, then the code is presumably defective

June 2, 2020

@Darian Miller Thank you,

I am trying to make sense of my observation, now it is morning and i just turned on my PC less an hour ago, it does worked fine for about 10 minutes with 256 thread with timeout at 2ms but failed at 1ms immediately, yesterday before shutting down and after the PC was stressed all day, the 16ms was failing and 20 looked stable, while yesterday in the morning 1ms worked for more than 10 minutes without fail.

Windows 10 Build 1803 ( 17134.706), i7-2600K .

Anders Melander · June 2, 2020

5 hours ago, Kas Ob. said:

it does worked fine for about 10 minutes with 256 thread with timeout at 2ms but failed at 1ms immediately, yesterday before shutting down and after the PC was stressed all day, the 16ms was failing and 20 looked stable, while yesterday in the morning 1ms worked for more than 10 minutes without fail.

It seems to me that these failures have more to do with the test not properly taking the timer resolution into account, than any possible remaining problems with TMonitor.

Like it has already been mentioned it's perfectly normal, and expected, for a wait function to return early: https://docs.microsoft.com/da-dk/windows/win32/sync/wait-functions?redirectedfrom=MSDN#wait-functions-and-time-out-intervals

Primož Gabrijelčič · June 2, 2020

Did some testing on that recently and most of the time wait functions work as expected while on some virtual machines they return early. The test cases were running on few Windows boxes and on two VMWare Fusion/Mac VMs. Worked OK on all but one. One Fusion/Mac combo consistently returned early from wait.

Sign In

Revisiting TThreadedQueue and TMonitor

Recommended Posts

pyscripter 791

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

pyscripter 791

Share this post

Link to post

Leif Uneus 43

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

Leif Uneus 43

Share this post

Link to post

pyscripter 791

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

pyscripter 791

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

pyscripter 791

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

Uwe Raabe 2160

Share this post

Link to post

Darian Miller 386

Share this post

Link to post

Kryvich 174

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

pyscripter 791

Share this post

Link to post

Darian Miller 386

Share this post

Link to post

Darian Miller 386

Share this post

Link to post

Guest

Share this post

Link to post

Darian Miller 386

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

Guest

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

Primož Gabrijelčič 227

Share this post

Link to post

Create an account or sign in to comment

Create an account