pyscripter 689 Posted May 17, 2020 (edited) @Anders MelanderCurrently the only way to pass the tests in 64bits is to use the modified version of the CAS function which now looks like this: function CAS(const oldData: pointer; oldReference: NativeInt; newData: pointer; newReference: NativeInt; var destination): boolean; asm {$IFNDEF CPUX64} ... {$ELSE CPUX64} .pushnv rbx mov rax, oldData mov rbx, newData mov rcx, newReference mov r8, [destination] lock cmpxchg16b [r8] {$ENDIF CPUX64} setz al end; { CAS } Your implementation of InterlockedCompareExchange128 does not pass the tests, same as Delphi's own implementation. It would be nice to come up with a working InterlockedCompareExchange128 and submit that as a separate bug report with a working solution. Edited May 17, 2020 by pyscripter Share this post Link to post
Darian Miller 361 Posted May 17, 2020 17 hours ago, pyscripter said: Over here win 32 300 threads 10 minutes: ... Sometimes with many threads say more than 300 I get an early failure before all threads are created and I am not sure why. Once all the threads are created I get no errors. If you add Sleep(1000) at the top of TTestThread.Execute() to give it time for all the threads to be created these startup errors are avoided. I set the max_worker_threads to 200 in Win32 and it bombed after 4 seconds and only 78 threads created. It's definitely hit-n-miss 2020-05-17 10.41.16: StressTestPopItem Start: Waiting up to [60] seconds for PopItem to prematurely timeout. 2020-05-17 10.41.16: Note: Using [10] as PopTimeout on TThreadedQueue creation 2020-05-17 10.41.17: TThreadCreator Start: Creating up to [200] threads 2020-05-17 10.41.17: Note: Creating threads with a StackSize of [65536] 2020-05-17 10.41.20: TTestThread ERROR: TThreadedQueue.PopItem returned [wrTimeout] unexpectedly after [0ms] 2020-05-17 10.41.20: TThreadCreator Note: aborting thread creation at [78] threads as a failure has already been detected. 2020-05-17 10.41.20: TThreadCreator End: [78] worker threads created 2020-05-17 10.41.21: StressTestPopItem End: After [4] seconds, a failure of PopItem was detected in at least one thread 2020-05-17 10.41.21: Hit <enter> to exit Share this post Link to post
Anders Melander 1782 Posted May 17, 2020 I got rid of the AV in CMPXCHG16B by allocating the stack head on the heap thus ensuring correct alignment, but I'm still experiencing problems with 64-bit. At some point, when freeing a block returned by Pop, I get an EInvalidPointer exception from FastMM. Other times I end up with a negative count of items on the stack. I can reproduce with just two threads! Occasionally the error occurs after less than 20 operations. Test source attached. UnitEventStackTest.pas UnitEventStackTest.dfm Share this post Link to post
Anders Melander 1782 Posted May 17, 2020 37 minutes ago, Koru said: Reported https://quality.embarcadero.com/browse/RSB-4115 Already openned No access. Share this post Link to post
pyscripter 689 Posted May 17, 2020 @Darian Miller Please try with this version. It uses the OnmiThreadLibrary CAS function in both 32-bit and 64-bit. Works solidly here. iaStressTest.TThreadedQueue.PopItem.pas Share this post Link to post
Koru 2 Posted May 17, 2020 Hi, sorry for the inconvenience maybe because this is a report for the beta? Share this post Link to post
Anders Melander 1782 Posted May 17, 2020 40 minutes ago, pyscripter said: Currently the only way to pass the tests in 64bits is to use the modified version of the CAS function That function is too obscure; I'm not using it. The oldReference parameter is unused and as I said it's not setting the RDX register. Share this post Link to post
Anders Melander 1782 Posted May 17, 2020 Just now, Koru said: maybe because this is a report for the beta? Probably. It's not a problem but of course it would be nice it we had access now that we're hacking at the problem. Share this post Link to post
Darian Miller 361 Posted May 17, 2020 3 minutes ago, pyscripter said: @Darian Miller Please try with this version. It uses the OnmiThreadLibrary CAS function in both 32-bit and 64-bit. Works solidly here. iaStressTest.TThreadedQueue.PopItem.pas I set the max thread count to 250 and it failed at 183 threads in 10 seconds. So I dropped it down to 175 and it lasted longer 2020-05-17 11.19.17: StressTestPopItem Start: Waiting up to [300] seconds for PopItem to prematurely timeout. 2020-05-17 11.19.17: Note: Using [10] as PopTimeout on TThreadedQueue creation 2020-05-17 11.19.18: TThreadCreator Start: Creating up to [175] threads 2020-05-17 11.19.18: Note: Creating threads with a StackSize of [65536] 2020-05-17 11.19.32: TThreadCreator End: [175] worker threads created 2020-05-17 11.19.43: New Wait objects created: 100000 2020-05-17 11.19.54: New Wait objects created: 200000 2020-05-17 11.20.06: New Wait objects created: 300000 2020-05-17 11.20.17: TTestThread ERROR: TThreadedQueue.PopItem returned [wrTimeout] unexpectedly after [0ms] Share this post Link to post
pyscripter 689 Posted May 17, 2020 @Darian MillerI must add that I am testing with the beta in case it makes a difference. Share this post Link to post
Anders Melander 1782 Posted May 17, 2020 7 minutes ago, Anders Melander said: That function is too obscure; I'm not using it. Well, I just tried it anyway. With the CAS version my test passes with 100 threads each doing 10 million operations. I'll have to compare the assembler of the calling code - but first I have some large quantities of pizza to prepare and eat and then I'm watching Astralis anihilate G2. Share this post Link to post
pyscripter 689 Posted May 17, 2020 22 minutes ago, Anders Melander said: Test source attached. UnitEventStackTest.pas UnitEventStackTest.dfm If you remove the packed from from the TEventStack definition there will be no need to allocate on the heap. Please read https://stackoverflow.com/questions/8460862/what-does-packed-now-forces-byte-alignment-of-records-mean. Share this post Link to post
Darian Miller 361 Posted May 17, 2020 (edited) 20 hours ago, pyscripter said: Sometimes with many threads say more than 300 I get an early failure before all threads are created and I am not sure why. Once all the threads are created I get no errors. If you add Sleep(1000) at the top of TTestThread.Execute() to give it time for all the threads to be created these startup errors are avoided. I removed your sleep and used an event to start all threads after they have all been created, your latest additions quickly fail in Win32 using the latest version of Delphi. iaStressTest.TThreadedQueue.PopItem.pas Edited May 17, 2020 by Darian Miller Share this post Link to post
Anders Melander 1782 Posted May 17, 2020 4 hours ago, pyscripter said: If you remove the packed from from the TEventStack definition there will be no need to allocate on the heap. Ah, that was why I couldn't get the alignment working. The "packed" wasn't there from the start but at some point I think I must have copied your changes back into my own. Share this post Link to post
pyscripter 689 Posted May 18, 2020 I have packaged the fix as a stand-alone unit (see attached). You just add this unit to your project. Uses the CAS function from OmniThreadLibrary. It would still be nice to fix InterlockedCompareExchange128 . MonitorWaitStackFix.pas Share this post Link to post
David Heffernan 2345 Posted May 18, 2020 7 hours ago, pyscripter said: I have packaged the fix as a stand-alone unit (see attached). You just add this unit to your project. Uses the CAS function from OmniThreadLibrary. It would still be nice to fix InterlockedCompareExchange128 . MonitorWaitStackFix.pas I submitted a QP report for InterlockedCompareExchange128 not restoring rbx 1 Share this post Link to post
Anders Melander 1782 Posted May 18, 2020 1 minute ago, David Heffernan said: I submitted a QP report for InterlockedCompareExchange128 not restoring rbx Thanks but that can't be the only problem with it. My tests fail even with that detail fixed but passes with the OmniThreadLibrary CAS. I can't spot what the problem is. Share this post Link to post
David Heffernan 2345 Posted May 18, 2020 1 minute ago, Anders Melander said: Thanks but that can't be the only problem with it. My tests fail even with that detail fixed but passes with the OmniThreadLibrary CAS. I can't spot what the problem is. Has anybody looked at the equivalent for msvc, gcc, clang, etc? Share this post Link to post
Anders Melander 1782 Posted May 18, 2020 7 minutes ago, David Heffernan said: Has anybody looked at the equivalent for msvc, gcc, clang, etc? I've looked at several but they just wrap CMPXCHG16B like CAS does. I think I have an old version of msvc installed somewhere but I think it predates InterlockedCompareExchange128. Share this post Link to post
Leif Uneus 43 Posted May 18, 2020 Maybe someone can understand this Q&A at SO: CMPXCHG16B correct? https://stackoverflow.com/q/4825400/576719 Or ask a new question at SO. Share this post Link to post
Anders Melander 1782 Posted May 18, 2020 (edited) There's one thing that bothers me with both InterlockedCompareExchange128 and CAS. They both return a boolean indicating if the swap was performed or not. They correctly do so by examining ZF after calling CMPXCHG16B. If the flag is set they set AL to 1, otherwise clear it to zero: CMPXCHG16B [...] SETZ AL The problem is that the caller examines the value of AX to get the boolean result: call InterlockedCompareExchange128 test eax,eax jnz ... Now what about the value that just happens to be in AH? I would think that we'd need to clear AX before calling SETZ: CMPXCHG16B [...] XOR EAX, EAX SETZ AL Edit: Whoops. XOR EAX,EAX sets ZF. Edited May 18, 2020 by Anders Melander Share this post Link to post
Anders Melander 1782 Posted May 18, 2020 Aaaaand there's an additional problem with InterlockedCompareExchange128: It doesn't return any value in the ComparandResult parameter. It's supposed to return the new value on success and the current value on failure. Anyway, here's my final (hopefully) fixed version of InterlockedCompareExchange128 which conforms to the Win32 API documentation: {$IFDEF Win64} function InterlockedCompareExchange128(Destination: Pointer; ExchangeHigh, ExchangeLow: Int64; ComparandResult: Pointer): boolean; // The parameters are in the RCX, RDX, R8 and R9 registers per the MS x64 calling convention: // RCX Destination // RDX ExchangeHigh // R8 ExchangeLow // R9 ComparandResult // // CMPXCHG16B requires the following register setup: // RDX:RAX ComparandResult.High:ComparandResult.Low // RCX:RBX ExchangeHigh:ExchangeLow // See: https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b asm .NOFRAME .PUSHNV RBX MOV R10,Destination // RCX MOV RBX,ExchangeLow // R8 MOV RCX,ExchangeHigh // RDX MOV RDX,[ComparandResult+8] // R9 MOV RAX,[ComparandResult] // R9 LOCK CMPXCHG16B [R10] MOV [ComparandResult+8],RDX // R9 MOV [ComparandResult],RAX // R9 SETZ BL XOR EAX, EAX MOV AL,BL end; {$ENDIF Win64} and here's the test I used against it: type T128 = record Low: pointer; High: pointer; end; procedure TestInterlockedCompareExchange128; const Value: T128 = (low: pointer($0010203040506070); high: pointer($8090A0B0C0D0E0F)); var Dest, OldValue, NewValue: T128; begin (* ** Test success *) Dest := Value; OldValue := Value; NewValue.Low := pointer($ABABABABABABABAB); NewValue.High := pointer($1212121212121212); // InterlockedCompareExchange128 should return True Assert(InterlockedCompareExchange128(@Dest, Int64(NewValue.High), Int64(NewValue.Low), @OldValue), 'Success expected'); // Dest contains new value Assert(Dest.Low = NewValue.Low); Assert(Dest.High = NewValue.High); // Comparand contains new value = old value Assert(OldValue.Low = Value.Low); Assert(OldValue.High = Value.High); (* ** Test failure *) Dest := Value; OldValue.Low := pointer($ABABABABABABABAB); OldValue.High := pointer($1212121212121212); NewValue.Low := nil; NewValue.High := nil; // InterlockedCompareExchange128 should return False Assert(not InterlockedCompareExchange128(@Dest, Int64(NewValue.High), Int64(NewValue.Low), @OldValue), 'Failure expected'); // Dest contains original value Assert(Dest.Low = Value.Low); Assert(Dest.High = Value.High); // Comparand contains original value Assert(OldValue.Low = Value.Low); Assert(OldValue.High = Value.High); end; With this fix my event stack test passes no matter how hard I hit it. Share this post Link to post
David Heffernan 2345 Posted May 18, 2020 (edited) SETZ BL XOR EAX, EAX MOV AL,BL Don't you just need SETZ AL MOVZX EAX, AL Also, .NOFRAME is wrong here. You need the frame to preserve RBX. At least that's how I understand it. Edited May 18, 2020 by David Heffernan 1 Share this post Link to post
David Heffernan 2345 Posted May 18, 2020 (edited) It's astonishing that Emba's InterlockedCompareExchange128 returns a BOOL, i.e. a 4 byte boolean, LongBool. They are presumably trying to copy this one from MS https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange128 which is actually implemented as an intrinsic. But look at the MS function's return type. Yup, it's BOOLEAN which is a single byte. The Emba version should use Boolean rather than Bool as the return type. It's a screw-up from top to bottom. Anyone might think that they just write this stuff and don't test it ...... Edited May 18, 2020 by David Heffernan Share this post Link to post
Anders Melander 1782 Posted May 18, 2020 4 minutes ago, David Heffernan said: Don't you just need SETZ AL MOVZX EAX, AL Yes. You are right. 4 minutes ago, David Heffernan said: Also, .NOFRAME is wrong here. You need the frame to preserve RBX. At least that's how I understand it. As I read it .NOFRAME tells the compiler that I'm not calling anything and don't need a stack frame. Of course that means I'm relying on the existing stack frame having room for the single register we're pushing. I guess removing .NOFRAME would be the safe thing to do. 1 Share this post Link to post