pyscripter 689 Posted May 19, 2020 (edited) I am attaching an updated fix using the corrected InterlockedCompareExchange128 by @Anders Melander. A note about @Darian Miller stress test from Github. If you run the stress test with the fix you will see failures reported. Fist a couple of important pointers: WaitForSingleObject and the rest of Windows wait functions by default use the system clock resolution and not the high-resolution timer like QueryPerformanceCounter that the StopWatch is using. The resolution of clock timer is between 7-15ms at least in Windows 7. The StopWatch ElapsedMilliseconds is a rounded figure. So when it is zero it does not mean that the elapsed time is zero. I have confirmed that in the cases where the stress test reports a failure, this is not Delphi's fault but what the WaitForSingleObject returns. To run the stress test in a meaningful way either change the POP-TIMEOUT to a value greater than 15 or change the TTestThread.Execute() so that it just reports cases in which the vStopWatch.ElapsedMilliseconds is zero but it does not stop the test: e.g. procedure TTestThread.Execute(); var Item: TItem; vWaitResult:TWaitResult; vStopwatch:TStopwatch; begin pubSyncrhonizedStart.WaitFor(); while not Terminated do begin if fQueue.ShutDown then Break; if pubTestCompletionCheck.WaitFor(0) = wrSignaled then Break; vStopwatch := TStopwatch.StartNew; vWaitResult := FQueue.PopItem( Item ); vStopWatch.Stop(); //Note: Reason for ACCEPTABLE_VARIANCE_MS is that on some low timeout values (like 200ms) it occasionally times out a little early (like 180ms) if (vWaitResult = wrTimeout) and (vStopWatch.ElapsedMilliseconds > 0) and (vStopWatch.ElapsedMilliseconds >= POP_TIMEOUT-ACCEPTABLE_TIMEOUTVARIANCE_MS) then begin //successful PopItem operation as we aren't adding anything into our queue Continue; end else begin LogIt('TTestThread ERROR: TThreadedQueue.PopItem returned [' + TRttiEnumerationType.GetName(vWaitResult) + '] unexpectedly after [' + IntToStr(vStopWatch.ElapsedMilliseconds) + 'ms]'); //pubTestCompletionCheck.SetEvent(); //Break; end; end; end; I have tested with a couple of thousand threads in both win32 and win64 and it behaves as expected. MonitorWaitStackFix.pas Edited May 19, 2020 by pyscripter 1 Share this post Link to post
Anders Melander 1782 Posted May 19, 2020 4 minutes ago, pyscripter said: I have tested with a couple of thousand threads in both win32 and win64 and it behaves as expected. Thanks for the update. Note that more threads does not necessarily result in a higher likelihood of failure. More threads means more overhead (context switching and so on) and less time spent actually exercising the code we want to test. I think the the goal should be to loop as fast as possible (i.e. perform as many operations as possible), utilizing all available CPUs, with as few context switches and as little kernel mode as possible. Share this post Link to post
pyscripter 689 Posted May 19, 2020 (edited) 6 minutes ago, Anders Melander said: Note that more threads does not necessarily result in a higher likelihood of failure. More threads means more overhead (context switching and so on) and less time spent actually exercising the code we want to test. I think the the goal should be to loop as fast as possible (i.e. perform as many operations as possible), utilizing all available CPUs, with as few context switches and as little kernel mode as possible. Correct in general but in this particular test and given the timeout the greater the number of threads (up to a point) the greater the number of calls to NewWaitObj (which calls the stack push and pop). You can confirm by running the stress test. Edited May 19, 2020 by pyscripter Share this post Link to post
Leif Uneus 43 Posted May 19, 2020 @pyscripter Note that in InterlockedCompareExchange128, the last line (MOVZX EAX,AL) is redundant. The function returns a boolean, one byte in size. And, in InterlockedCompareExchange, should it not be following code for the 32 bit ? Result := InterlockedCompareExchange64(Int64(Dest), Int64(NewValue), Int64(CurrentValue)) = Int64(CurrentValue); Share this post Link to post
David Heffernan 2345 Posted May 19, 2020 17 minutes ago, Leif Uneus said: The function returns a boolean, one byte in size. Wrong. It returns a Bool, which is LongBool, the 4 byte boolean. The right fix is for Emba to change it to Boolean. Share this post Link to post
Leif Uneus 43 Posted May 19, 2020 2 minutes ago, David Heffernan said: Wrong. It returns a Bool, which is LongBool, the 4 byte boolean. The right fix is for Emba to change it to Boolean. Sorry @DavidHeffernan, but in the file posted by @pyscript his proposed function InterlockedCompareExchange128 returns a boolean. Share this post Link to post
pyscripter 689 Posted May 19, 2020 4 minutes ago, David Heffernan said: Wrong. It returns a Bool, which is LongBool, the 4 byte boolean. The right fix is for Emba to change it to Boolean. Correct, but in the fix it is declered as boolean. So I guess MOVZX EAX,AL is not needed. I am reattaching the amended file. MonitorWaitStackFix.pas Share this post Link to post
David Heffernan 2345 Posted May 19, 2020 43 minutes ago, Leif Uneus said: Sorry @DavidHeffernan, but in the file posted by @pyscript his proposed function InterlockedCompareExchange128 returns a boolean. OK. I've been working to the definition in the Delphi RTL. Share this post Link to post
pyscripter 689 Posted May 21, 2020 On 5/18/2020 at 12:37 PM, Anders Melander said: Anyway, here's my final (hopefully) fixed version of InterlockedCompareExchange128 which conforms to the Win32 API documentation: Are the MOV instructions after LOCK CMPXCHG16B [R10] needed? The following passes the stress test: {$IFDEF Win64} function InterlockedCompareExchange128(Destination: Pointer; ExchangeHigh, ExchangeLow: Int64; ComparandResult: Pointer): boolean; // The parameters are in the RCX, RDX, R8 and R9 registers per the MS x64 calling convention: // RCX Destination // RDX ExchangeHigh // R8 ExchangeLow // R9 ComparandResult // // CMPXCHG16B requires the following register setup: // RDX:RAX ComparandResult.High:ComparandResult.Low // RCX:RBX ExchangeHigh:ExchangeLow // See: https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b asm .PUSHNV RBX MOV R10,Destination // RCX MOV RBX,ExchangeLow // R8 MOV RCX,ExchangeHigh // RDX MOV RDX,[ComparandResult+8] // R9 MOV RAX,[ComparandResult] // R9 LOCK CMPXCHG16B [R10] // MOV [ComparandResult+8],RDX // R9 // MOV [ComparandResult],RAX // R9 SETZ AL end; {$ENDIF Win64} Share this post Link to post
Anders Melander 1782 Posted May 21, 2020 5 hours ago, pyscripter said: Are the MOV instructions after LOCK CMPXCHG16B [R10] needed? Yes, so you can revert your change in RSP-28272. They copy the result back to the ComparandResult parameter (or rather the 128 bit value it points to). See the InterlockedCompareExchange128 documentation: Quote ComparandResult The value to compare to. This parameter is an array of two 64-bit integers considered as a 128-bit field. On output, this is overwritten with the original value of the destination. 5 hours ago, pyscripter said: The following passes the stress test: Because the stress test does not use return value in the ComparandResult parameter. My test code (see an earlier post) includes a test of InterlockedCompareExchange128 which does validate that ComparandResult is set correctly. 1 Share this post Link to post
pyscripter 689 Posted May 21, 2020 @Anders MelanderThanks for the explanation. I have updated my comment in the RSP. Share this post Link to post
Anders Melander 1782 Posted May 26, 2020 Not fixed in 10.4: InterlockedCompareExchange128 EventCache Push and Pop ABA problem I guess we were too late 😞 Share this post Link to post
Uwe Raabe 2057 Posted May 26, 2020 Just now, Anders Melander said: I guess we were too late By magnitudes Share this post Link to post
Darian Miller 361 Posted May 26, 2020 9 minutes ago, Anders Melander said: Not fixed in 10.4: InterlockedCompareExchange128 EventCache Push and Pop ABA problem I guess we were too late 😞 There's always hope for 10.4.1! 1 Share this post Link to post
Kryvich 165 Posted May 26, 2020 35 minutes ago, Darian Miller said: There's always hope for 10.4.1! Rather hope for 10.5. They are unlikely to break DCU compatibility in 10.4.1. Share this post Link to post
Anders Melander 1782 Posted May 26, 2020 2 minutes ago, Kryvich said: They are unlikely to break DCU compatibility in 10.4.1 You are right about that but they can fix TMonitor (the ABA problem) by using a private, correct implementation of InterlockedCompareExchange128 without breaking DCU compatibility. They could also just implement a temporary fix for InterlockedCompareExchange128 that maintains the incorrect return type (bool) but returns the correct value. 2 Share this post Link to post
pyscripter 689 Posted May 31, 2020 (edited) I have improved the fix to deal with TMonitor.Wait being called during unit finalization. Corner case but it did come up in a real application. See attached: MonitorWaitStackFix.pas Edited May 31, 2020 by pyscripter Share this post Link to post
Darian Miller 361 Posted May 31, 2020 1 hour ago, pyscripter said: I have improved the fix to deal with TMonitor.Wait being called during unit finalization. Corner case but it did come up in a real application. See attached: MonitorWaitStackFix.pas I updated the GitHub repo with your changes and am running two more day-long tests (win32+win64) both with 1,000 threads and 20ms pop-timeout on Windows 10 physical machine. My last 24-hour Win32 stress test completed without failure. https://github.com/ideasawakened/DelphiKB/commits/master/Delphi Tests/Source/rtl/common/System.Generics.Collections/TThreadedQueue/StressTestPopItem 1 1 Share this post Link to post
Darian Miller 361 Posted June 1, 2020 On 5/30/2020 at 8:44 PM, pyscripter said: I have improved the fix to deal with TMonitor.Wait being called during unit finalization. Corner case but it did come up in a real application. See attached: MonitorWaitStackFix.pas Two stress tests just completed without error after running for 24 hours straight: Win32 and Win64 running 1,000 threads each with a 20ms timeout with your latest MonitorWaitStackFix.pas. 1 Share this post Link to post
Guest Posted June 1, 2020 3 hours ago, Darian Miller said: Two stress tests just completed without error after running for 24 hours straight: Win32 and Win64 running 1,000 threads each with a 20ms timeout with your latest MonitorWaitStackFix.pas. Thank you very much ! I have two question here, (if you have time) 1) What is your test bed ? (CPU and System) 2) What is your observations about the failing point that concern timeout/threads ? eg. minimum timeout with x threads will fail in n seconds , does 0 or 1 ms timeout with number less than available logical cores fail ?, etc.. I am not asking to repeat the test for 24h, but failing in minutes against failing in seconds is more a concern to me. Thank you again. Share this post Link to post
Darian Miller 361 Posted June 1, 2020 13 hours ago, Kas Ob. said: 1) What is your test bed ? (CPU and System) 2) What is your observations about the failing point that concern timeout/threads ? eg. minimum timeout with x threads will fail in n seconds , does 0 or 1 ms timeout with number less than available logical cores fail ?, etc.. 1. Windows 10 build 18363. i7-7700K 4.2GHz, 4 cores, 8 logical processors, 64GB RAM. 2. pyscripter explained why a low pop timeout fails earlier in this thread and is why there's a minimum of 20ms in the stress tester now. The test originally failed less in a virtual environment (VMWare Workstation) so I switched to testing on physical machine. On Win32, you can try throwing more threads and run into problems, likely more with Windows than Delphi. I tested with 1,000 threads without any issue. Win64 could handle 50,000 threads without failing. Share this post Link to post
David Heffernan 2345 Posted June 1, 2020 39 minutes ago, Darian Miller said: 1. Windows 10 build 18363. i7-7700K 4.2GHz, 4 cores, 8 logical processors, 64GB RAM. 2. pyscripter explained why a low pop timeout fails earlier in this thread and is why there's a minimum of 20ms in the stress tester now. The test originally failed less in a virtual environment (VMWare Workstation) so I switched to testing on physical machine. On Win32, you can try throwing more threads and run into problems, likely more with Windows than Delphi. I tested with 1,000 threads without any issue. Win64 could handle 50,000 threads without failing. If it failed in a virtual environment, then the code is presumably defective Share this post Link to post
Guest Posted June 2, 2020 @Darian Miller Thank you, I am trying to make sense of my observation, now it is morning and i just turned on my PC less an hour ago, it does worked fine for about 10 minutes with 256 thread with timeout at 2ms but failed at 1ms immediately, yesterday before shutting down and after the PC was stressed all day, the 16ms was failing and 20 looked stable, while yesterday in the morning 1ms worked for more than 10 minutes without fail. Windows 10 Build 1803 ( 17134.706), i7-2600K . Share this post Link to post
Anders Melander 1782 Posted June 2, 2020 5 hours ago, Kas Ob. said: it does worked fine for about 10 minutes with 256 thread with timeout at 2ms but failed at 1ms immediately, yesterday before shutting down and after the PC was stressed all day, the 16ms was failing and 20 looked stable, while yesterday in the morning 1ms worked for more than 10 minutes without fail. It seems to me that these failures have more to do with the test not properly taking the timer resolution into account, than any possible remaining problems with TMonitor. Like it has already been mentioned it's perfectly normal, and expected, for a wait function to return early: https://docs.microsoft.com/da-dk/windows/win32/sync/wait-functions?redirectedfrom=MSDN#wait-functions-and-time-out-intervals 1 Share this post Link to post
Primož Gabrijelčič 223 Posted June 2, 2020 Did some testing on that recently and most of the time wait functions work as expected while on some virtual machines they return early. The test cases were running on few Windows boxes and on two VMWare Fusion/Mac VMs. Worked OK on all but one. One Fusion/Mac combo consistently returned early from wait. Share this post Link to post