Jump to content
pyscripter

Revisiting TThreadedQueue and TMonitor

Recommended Posts

Posted (edited)

I am attaching an updated fix using the corrected InterlockedCompareExchange128 by @Anders Melander.

 

A note about @Darian Miller stress test from Github.  If you run the stress test with the fix you will see failures reported.   Fist a couple of important pointers:

I have confirmed that in the cases where the stress test reports a failure, this is not Delphi's fault but what the WaitForSingleObject returns.

 

To run the stress test in a meaningful way either

  • change the POP-TIMEOUT to a value greater than 15  or
  • change the TTestThread.Execute() so that it just reports cases in which the vStopWatch.ElapsedMilliseconds is zero but it does not stop the test:

e.g.

 

procedure TTestThread.Execute();
var
  Item: TItem;
  vWaitResult:TWaitResult;
  vStopwatch:TStopwatch;
begin
  pubSyncrhonizedStart.WaitFor();

  while not Terminated do
  begin
    if fQueue.ShutDown then Break;
    if pubTestCompletionCheck.WaitFor(0) = wrSignaled then Break;

    vStopwatch := TStopwatch.StartNew;
    vWaitResult := FQueue.PopItem( Item );
    vStopWatch.Stop();


    //Note: Reason for ACCEPTABLE_VARIANCE_MS is that on some low timeout values (like 200ms) it occasionally times out a little early (like 180ms)
    if (vWaitResult = wrTimeout) and (vStopWatch.ElapsedMilliseconds > 0) and (vStopWatch.ElapsedMilliseconds >= POP_TIMEOUT-ACCEPTABLE_TIMEOUTVARIANCE_MS) then
    begin
      //successful PopItem operation as we aren't adding anything into our queue
      Continue;
    end
    else
    begin
      LogIt('TTestThread ERROR: TThreadedQueue.PopItem returned [' + TRttiEnumerationType.GetName(vWaitResult) + '] unexpectedly after [' + IntToStr(vStopWatch.ElapsedMilliseconds) + 'ms]');
      //pubTestCompletionCheck.SetEvent();
      //Break;
    end;
  end;
end;

I have tested with a couple of thousand threads in both win32 and win64 and it behaves as expected.

MonitorWaitStackFix.pas

Edited by pyscripter
  • Thanks 1

Share this post


Link to post
4 minutes ago, pyscripter said:

I have tested with a couple of thousand threads in both win32 and win64 and it behaves as expected.

Thanks for the update.

Note that more threads does not necessarily result in a higher likelihood of failure. More threads means more overhead (context switching and so on) and less time spent actually exercising the code we want to test. I think the the goal should be to loop as fast as possible (i.e. perform as many operations as possible), utilizing all available CPUs, with as few context switches and as little kernel mode as possible.

Share this post


Link to post
Posted (edited)
6 minutes ago, Anders Melander said:

Note that more threads does not necessarily result in a higher likelihood of failure. More threads means more overhead (context switching and so on) and less time spent actually exercising the code we want to test. I think the the goal should be to loop as fast as possible (i.e. perform as many operations as possible), utilizing all available CPUs, with as few context switches and as little kernel mode as possible.

Correct in general but in this particular test and given the timeout the greater the number of threads (up to a point) the greater the number of calls to NewWaitObj (which calls the stack push and pop).  You can confirm by running the stress test.

Edited by pyscripter

Share this post


Link to post

@pyscripter

Note that in InterlockedCompareExchange128, the last line (MOVZX EAX,AL) is redundant.

The function returns a boolean, one byte in size.

 

And, in InterlockedCompareExchange, should it not be following code for the 32 bit ?

 

Result := InterlockedCompareExchange64(Int64(Dest), Int64(NewValue), Int64(CurrentValue)) = Int64(CurrentValue);

 

 

 

 

Share this post


Link to post
17 minutes ago, Leif Uneus said:

The function returns a boolean, one byte in size.

Wrong. It returns a Bool, which is LongBool, the 4 byte boolean.

 

The right fix is for Emba to change it to Boolean.

Share this post


Link to post
2 minutes ago, David Heffernan said:

Wrong. It returns a Bool, which is LongBool, the 4 byte boolean.

 

The right fix is for Emba to change it to Boolean.

Sorry @DavidHeffernan, but in the file posted by @pyscript his proposed function InterlockedCompareExchange128 returns a boolean.

Share this post


Link to post
4 minutes ago, David Heffernan said:

Wrong. It returns a Bool, which is LongBool, the 4 byte boolean.

 

The right fix is for Emba to change it to Boolean.

Correct, but in the fix it is declered as boolean.  So I guess MOVZX EAX,AL is not needed.   I am reattaching the amended file.

MonitorWaitStackFix.pas

Share this post


Link to post
43 minutes ago, Leif Uneus said:

Sorry @DavidHeffernan, but in the file posted by @pyscript his proposed function InterlockedCompareExchange128 returns a boolean.

OK. I've been working to the definition in the Delphi RTL. 

Share this post


Link to post
On 5/18/2020 at 12:37 PM, Anders Melander said:

Anyway, here's my final (hopefully) fixed version of InterlockedCompareExchange128 which conforms to the Win32 API documentation:

Are the MOV instructions after LOCK CMPXCHG16B [R10] needed?

 

The following passes the stress test:

 

{$IFDEF Win64}
function InterlockedCompareExchange128(Destination: Pointer; ExchangeHigh, ExchangeLow: Int64; ComparandResult: Pointer): boolean;
// The parameters are in the RCX, RDX, R8 and R9 registers per the MS x64 calling convention:
//   RCX        Destination
//   RDX        ExchangeHigh
//   R8         ExchangeLow
//   R9         ComparandResult
//
// CMPXCHG16B requires the following register setup:
//   RDX:RAX    ComparandResult.High:ComparandResult.Low
//   RCX:RBX    ExchangeHigh:ExchangeLow
// See: https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b
asm
      .PUSHNV RBX
      MOV   R10,Destination             // RCX
      MOV   RBX,ExchangeLow             // R8
      MOV   RCX,ExchangeHigh            // RDX
      MOV   RDX,[ComparandResult+8]     // R9
      MOV   RAX,[ComparandResult]       // R9
 LOCK CMPXCHG16B [R10]
//      MOV   [ComparandResult+8],RDX     // R9
//      MOV   [ComparandResult],RAX       // R9
      SETZ  AL
 end;
{$ENDIF Win64}

 

Share this post


Link to post
5 hours ago, pyscripter said:

Are the MOV instructions after LOCK CMPXCHG16B [R10] needed?

Yes, so you can revert your change in RSP-28272.

They copy the result back to the ComparandResult parameter (or rather the 128 bit value it points to). See the InterlockedCompareExchange128 documentation:

Quote

ComparandResult

The value to compare to. This parameter is an array of two 64-bit integers considered as a 128-bit field. On output, this is overwritten with the original value of the destination.

 

5 hours ago, pyscripter said:

The following passes the stress test:

Because the stress test does not use return value in the ComparandResult parameter.

My test code (see an earlier post) includes a test of InterlockedCompareExchange128 which does validate that ComparandResult is set correctly.

  • Thanks 1

Share this post


Link to post

Not fixed in 10.4:

  • InterlockedCompareExchange128
  • EventCache Push and Pop ABA problem

I guess we were too late 😞

Share this post


Link to post
Just now, Anders Melander said:

I guess we were too late

By magnitudes :classic_biggrin:

Share this post


Link to post
9 minutes ago, Anders Melander said:

Not fixed in 10.4:

  • InterlockedCompareExchange128
  • EventCache Push and Pop ABA problem

I guess we were too late 😞

 

There's always hope for 10.4.1!

 

  • Like 1

Share this post


Link to post
35 minutes ago, Darian Miller said:

 

There's always hope for 10.4.1!

 

Rather hope for 10.5. They are unlikely to break DCU compatibility in 10.4.1.

Share this post


Link to post
2 minutes ago, Kryvich said:

They are unlikely to break DCU compatibility in 10.4.1

You are right about that but they can fix TMonitor (the ABA problem) by using a private, correct implementation of InterlockedCompareExchange128 without breaking DCU compatibility.

They could also just implement a temporary fix for InterlockedCompareExchange128 that maintains the incorrect return type (bool) but returns the correct value.

  • Like 2

Share this post


Link to post
Posted (edited)

I have improved the fix to deal with TMonitor.Wait being called during unit finalization.  Corner case but it did come up in a real application. 

 

See attached:

MonitorWaitStackFix.pas

Edited by pyscripter
  • Like 1

Share this post


Link to post
1 hour ago, pyscripter said:

I have improved the fix to deal with TMonitor.Wait being called during unit finalization.  Corner case but it did come up in a real application. 

 

See attached:

MonitorWaitStackFix.pas

I updated the GitHub repo with your changes and am running two more day-long tests (win32+win64) both with 1,000 threads and 20ms pop-timeout on Windows 10 physical machine.  My last 24-hour Win32 stress test completed without failure.

 

https://github.com/ideasawakened/DelphiKB/commits/master/Delphi Tests/Source/rtl/common/System.Generics.Collections/TThreadedQueue/StressTestPopItem

 

  • Like 2
  • Thanks 1

Share this post


Link to post
On 5/30/2020 at 8:44 PM, pyscripter said:

I have improved the fix to deal with TMonitor.Wait being called during unit finalization.  Corner case but it did come up in a real application. 

 

See attached:

MonitorWaitStackFix.pas

Two stress tests just completed without error after running for 24 hours straight:  Win32 and Win64 running 1,000 threads each with a 20ms timeout with your latest MonitorWaitStackFix.pas.  

 

  • Like 1
  • Thanks 1

Share this post


Link to post
3 hours ago, Darian Miller said:

Two stress tests just completed without error after running for 24 hours straight:  Win32 and Win64 running 1,000 threads each with a 20ms timeout with your latest MonitorWaitStackFix.pas.  

Thank you very much !

 

I have two question here, (if you have time)

1) What is your test bed ? (CPU and System)

2) What is your observations about the failing point that concern timeout/threads ? eg. minimum timeout with x threads will fail in n seconds , does 0 or 1 ms timeout with number less than available logical cores fail ?, etc..

 

I am not asking to repeat the test for 24h, but failing in minutes against failing in seconds is more a concern to me.

 

Thank you again.

Share this post


Link to post
13 hours ago, Kas Ob. said:

1) What is your test bed ? (CPU and System)

2) What is your observations about the failing point that concern timeout/threads ? eg. minimum timeout with x threads will fail in n seconds , does 0 or 1 ms timeout with number less than available logical cores fail ?, etc..

 

1. Windows 10 build 18363.  i7-7700K 4.2GHz, 4 cores, 8 logical processors, 64GB RAM.

2. pyscripter explained why a low pop timeout fails earlier in this thread and is why there's a minimum of 20ms in the stress tester now.  

 

The test originally failed less in a virtual environment (VMWare Workstation) so I switched to testing on physical machine.

 

On Win32, you can try throwing more threads and run into problems, likely more with Windows than Delphi.  I tested with 1,000 threads without any issue.

Win64 could handle 50,000 threads without failing.

 

 

  • Thanks 1

Share this post


Link to post
39 minutes ago, Darian Miller said:

1. Windows 10 build 18363.  i7-7700K 4.2GHz, 4 cores, 8 logical processors, 64GB RAM.

2. pyscripter explained why a low pop timeout fails earlier in this thread and is why there's a minimum of 20ms in the stress tester now.  

 

The test originally failed less in a virtual environment (VMWare Workstation) so I switched to testing on physical machine.

 

On Win32, you can try throwing more threads and run into problems, likely more with Windows than Delphi.  I tested with 1,000 threads without any issue.

Win64 could handle 50,000 threads without failing.

 

 

If it failed in a virtual environment, then the code is presumably defective 

Share this post


Link to post

@Darian Miller Thank you, 

 

I am trying to make sense of my observation, now it is morning and i just turned on my PC less an hour ago, it does worked fine for about 10 minutes with 256 thread with timeout at 2ms but failed at 1ms immediately, yesterday before shutting down and after the PC was stressed all day, the 16ms was failing and 20 looked stable, while yesterday in the morning 1ms worked for more than 10 minutes without fail.

 

Windows 10 Build 1803 ( 17134.706), i7-2600K .

Share this post


Link to post
5 hours ago, Kas Ob. said:

it does worked fine for about 10 minutes with 256 thread with timeout at 2ms but failed at 1ms immediately, yesterday before shutting down and after the PC was stressed all day, the 16ms was failing and 20 looked stable, while yesterday in the morning 1ms worked for more than 10 minutes without fail.

It seems to me that these failures have more to do with the test not properly taking the timer resolution into account, than any possible remaining problems with TMonitor.

Like it has already been mentioned it's perfectly normal, and expected, for a wait function to return early: https://docs.microsoft.com/da-dk/windows/win32/sync/wait-functions?redirectedfrom=MSDN#wait-functions-and-time-out-intervals

  • Like 1

Share this post


Link to post

Did some testing on that recently and most of the time wait functions work as expected while on some virtual machines they return early. The test cases were running on few Windows boxes and on two VMWare Fusion/Mac VMs. Worked OK on all but one. One Fusion/Mac combo consistently returned early from wait.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×