Jump to content

pyscripter

Members
  • Content Count

    983
  • Joined

  • Last visited

  • Days Won

    62

Everything posted by pyscripter

  1. pyscripter

    You RAD Studio 10.4 Sydney appreciated features and bug fixes

    Also the improvements in Vcl.Styles and DPI awareness are welcome. There are still a few outstanding issues: InputQuery scaling and styling issues Styled menus DPI scaling TCustomizeDlg not fully VCL styled
  2. pyscripter

    You RAD Studio 10.4 Sydney appreciated features and bug fixes

    Right now the quality portal is updated at a frenetic rate. There was a very large number of fixes, I think more than in any other previous version, and I was quite pleased that quite a few of my reports have been resolved, some of the unexpectedly: System.AnsiStrings AdjustLineBreaks not exported Memory leak in TInvokeableVariantType.DispInvoke Optimize TInvokeableVariantType.DispInvoke Wrongly premultiplied TWICImage when assigning a Bitmap with aDefined TButtonedEdit is not styled properly System.Threading got some attention with the famous Cancel and Wait issue resolved. Issues I would like to see fixed now and consider critical are: TThreadedQueue and TMonitor issue, possible solution InterlockedCompareExchange128 doesn't restore RBX Threading - Incorrect calculation of IdleWorkerThreadCount
  3. pyscripter

    MMX supports Delphi 10.4 Sydney

    That was quick!
  4. pyscripter

    Revisiting TThreadedQueue and TMonitor

    @Anders MelanderThanks for the explanation. I have updated my comment in the RSP.
  5. pyscripter

    Revisiting TThreadedQueue and TMonitor

    Are the MOV instructions after LOCK CMPXCHG16B [R10] needed? The following passes the stress test: {$IFDEF Win64} function InterlockedCompareExchange128(Destination: Pointer; ExchangeHigh, ExchangeLow: Int64; ComparandResult: Pointer): boolean; // The parameters are in the RCX, RDX, R8 and R9 registers per the MS x64 calling convention: // RCX Destination // RDX ExchangeHigh // R8 ExchangeLow // R9 ComparandResult // // CMPXCHG16B requires the following register setup: // RDX:RAX ComparandResult.High:ComparandResult.Low // RCX:RBX ExchangeHigh:ExchangeLow // See: https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b asm .PUSHNV RBX MOV R10,Destination // RCX MOV RBX,ExchangeLow // R8 MOV RCX,ExchangeHigh // RDX MOV RDX,[ComparandResult+8] // R9 MOV RAX,[ComparandResult] // R9 LOCK CMPXCHG16B [R10] // MOV [ComparandResult+8],RDX // R9 // MOV [ComparandResult],RAX // R9 SETZ AL end; {$ENDIF Win64}
  6. pyscripter

    Revisiting TThreadedQueue and TMonitor

    Correct, but in the fix it is declered as boolean. So I guess MOVZX EAX,AL is not needed. I am reattaching the amended file. MonitorWaitStackFix.pas
  7. pyscripter

    Revisiting TThreadedQueue and TMonitor

    Correct in general but in this particular test and given the timeout the greater the number of threads (up to a point) the greater the number of calls to NewWaitObj (which calls the stack push and pop). You can confirm by running the stress test.
  8. pyscripter

    Revisiting TThreadedQueue and TMonitor

    I am attaching an updated fix using the corrected InterlockedCompareExchange128 by @Anders Melander. A note about @Darian Miller stress test from Github. If you run the stress test with the fix you will see failures reported. Fist a couple of important pointers: WaitForSingleObject and the rest of Windows wait functions by default use the system clock resolution and not the high-resolution timer like QueryPerformanceCounter that the StopWatch is using. The resolution of clock timer is between 7-15ms at least in Windows 7. The StopWatch ElapsedMilliseconds is a rounded figure. So when it is zero it does not mean that the elapsed time is zero. I have confirmed that in the cases where the stress test reports a failure, this is not Delphi's fault but what the WaitForSingleObject returns. To run the stress test in a meaningful way either change the POP-TIMEOUT to a value greater than 15 or change the TTestThread.Execute() so that it just reports cases in which the vStopWatch.ElapsedMilliseconds is zero but it does not stop the test: e.g. procedure TTestThread.Execute(); var Item: TItem; vWaitResult:TWaitResult; vStopwatch:TStopwatch; begin pubSyncrhonizedStart.WaitFor(); while not Terminated do begin if fQueue.ShutDown then Break; if pubTestCompletionCheck.WaitFor(0) = wrSignaled then Break; vStopwatch := TStopwatch.StartNew; vWaitResult := FQueue.PopItem( Item ); vStopWatch.Stop(); //Note: Reason for ACCEPTABLE_VARIANCE_MS is that on some low timeout values (like 200ms) it occasionally times out a little early (like 180ms) if (vWaitResult = wrTimeout) and (vStopWatch.ElapsedMilliseconds > 0) and (vStopWatch.ElapsedMilliseconds >= POP_TIMEOUT-ACCEPTABLE_TIMEOUTVARIANCE_MS) then begin //successful PopItem operation as we aren't adding anything into our queue Continue; end else begin LogIt('TTestThread ERROR: TThreadedQueue.PopItem returned [' + TRttiEnumerationType.GetName(vWaitResult) + '] unexpectedly after [' + IntToStr(vStopWatch.ElapsedMilliseconds) + 'ms]'); //pubTestCompletionCheck.SetEvent(); //Break; end; end; end; I have tested with a couple of thousand threads in both win32 and win64 and it behaves as expected. MonitorWaitStackFix.pas
  9. pyscripter

    Revisiting TThreadedQueue and TMonitor

    I have packaged the fix as a stand-alone unit (see attached). You just add this unit to your project. Uses the CAS function from OmniThreadLibrary. It would still be nice to fix InterlockedCompareExchange128 . MonitorWaitStackFix.pas
  10. pyscripter

    Revisiting TThreadedQueue and TMonitor

    If you remove the packed from from the TEventStack definition there will be no need to allocate on the heap. Please read https://stackoverflow.com/questions/8460862/what-does-packed-now-forces-byte-alignment-of-records-mean.
  11. pyscripter

    Revisiting TThreadedQueue and TMonitor

    @Darian MillerI must add that I am testing with the beta in case it makes a difference.
  12. pyscripter

    Revisiting TThreadedQueue and TMonitor

    @Darian Miller Please try with this version. It uses the OnmiThreadLibrary CAS function in both 32-bit and 64-bit. Works solidly here. iaStressTest.TThreadedQueue.PopItem.pas
  13. pyscripter

    Revisiting TThreadedQueue and TMonitor

    @Anders MelanderCurrently the only way to pass the tests in 64bits is to use the modified version of the CAS function which now looks like this: function CAS(const oldData: pointer; oldReference: NativeInt; newData: pointer; newReference: NativeInt; var destination): boolean; asm {$IFNDEF CPUX64} ... {$ELSE CPUX64} .pushnv rbx mov rax, oldData mov rbx, newData mov rcx, newReference mov r8, [destination] lock cmpxchg16b [r8] {$ENDIF CPUX64} setz al end; { CAS } Your implementation of InterlockedCompareExchange128 does not pass the tests, same as Delphi's own implementation. It would be nice to come up with a working InterlockedCompareExchange128 and submit that as a separate bug report with a working solution.
  14. pyscripter

    Revisiting TThreadedQueue and TMonitor

    SetMinimumBlockAlignment does nothing in 64bits (see the source code). Alignment is fixed at mba16Byte always.
  15. pyscripter

    Revisiting TThreadedQueue and TMonitor

    If I replace the push and pop rbx with .pushnv rbx and remove the .noframe the code crashes. How is one supposed to use .pushnv? function CAS(const oldData: pointer; oldReference: NativeInt; newData: pointer; newReference: NativeInt; var destination): boolean; asm {$IFNDEF CPUX64} {$ELSE CPUX64} .pushnv rbx //rsp := rsp - 8 ! mov rax, oldData mov rbx, newData mov rcx, newReference mov r8, [destination + 8] //+8 with respect to .noframe lock cmpxchg16b [r8] {$ENDIF CPUX64} setz al end; { CAS }
  16. pyscripter

    Revisiting TThreadedQueue and TMonitor

    @David HeffernanThanks. Could you please post here what the function should be?
  17. pyscripter

    Revisiting TThreadedQueue and TMonitor

    Well there is the Delphi implementation which does not work: function InterlockedCompareExchange128(Destination: Pointer; ExchangeHigh, ExchangeLow: Int64; ComparandResult: Pointer): Bool; stdcall; asm MOV R10,RCX MOV RBX,R8 MOV RCX,RDX MOV RDX,[R9+8] MOV RAX,[R9] LOCK CMPXCHG16B [R10] SETZ AL end; and the CAS function from OmniThreadLibrary which works well. function CAS(const oldData: pointer; oldReference: NativeInt; newData: pointer; newReference: NativeInt; var destination): boolean; asm {$IFNDEF CPUX64} push edi push ebx mov ebx, newData mov ecx, newReference mov edi, destination lock cmpxchg8b qword ptr [edi] pop ebx pop edi {$ELSE CPUX64} .noframe push rbx //rsp := rsp - 8 ! mov rax, oldData mov rbx, newData mov rcx, newReference mov r8, [destination + 8] //+8 with respect to .noframe lock cmpxchg16b [r8] pop rbx {$ENDIF CPUX64} setz al end; { CAS } @David Heffernanhas said that both are wrong (see earlier comments). By the way Allen Bauer explains .PUSHNV here.
  18. pyscripter

    Revisiting TThreadedQueue and TMonitor

    @David HeffernanI know very little about assembler and I will leave this to experts to resolve. However I do see push rbx and pop rbx in the code. Does this not save and restore rbx? This code has been in the master branch of OmniThreadLibrary for ages and should have been tested extensively. It does appear to work well we the TMonitor fix.
  19. pyscripter

    Revisiting TThreadedQueue and TMonitor

    Success! If I use the CAS function from OmniThreadLibrary it works. So it appears that Delphi's InterlockedCompareExchange128 is broken. Passed a test with 1000 threads and 50 million calls to NewWaitObj. I am attaching the revised test code which should now work in both 32bits and 64bits. We should now try to convince Embarcadero to include the fixes in the Delphi RTL. iaStressTest.TThreadedQueue.PopItem.pas
  20. pyscripter

    Revisiting TThreadedQueue and TMonitor

    @Kas Ob.Thanks for the info about alignment. Actually Delphi does offer support for alignment (undocumented feature see @Uwe Raabe response in this Stackoverflow question). If you change the definition of TEventStack to TEventStack = record Head: PEventItemHolder; Counter: NativeInt; procedure Push(EventItem: PEventItemHolder); function Pop: PEventItemHolder; end align {$IFDEF CPUX64}16{$ELSE CPUX64}8{$ENDIF CPUX64}; There are not longer crashes. The test still fails though and I am investigating why. And by the way. There is also the [Align(8)] attribute which works with fields, global variables and records it seems.
  21. pyscripter

    Revisiting TThreadedQueue and TMonitor

    For reference this is from fpc. function InterlockedCompareExchange128(var Target: Int128Rec; NewValue: Int128Rec; Comperand: Int128Rec): Int128Rec; assembler; { win64: rcx ... pointer to result rdx ... target r8 ... NewValue r9 ... Comperand } {$ifdef win64} asm pushq %rbx { store result pointer for later use } pushq %rcx { load new value } movq (%r8),%rbx movq 8(%r8),%rcx { save target pointer for later use } movq %rdx,%r8 { load comperand } movq (%r9),%rax movq 8(%r9),%rdx {$ifdef oldbinutils} .byte 0xF0,0x49,0x0F,0xC7,0x08 {$else} lock cmpxchg16b (%r8) {$endif} { restore result pointer } popq %rcx { store result } movq %rax,(%rcx) movq %rdx,8(%rcx) popq %rbx end; Any chance of anyone coming up with a working InterlockedCompareExchange128 for Delphi?
  22. pyscripter

    Revisiting TThreadedQueue and TMonitor

    Over here win 32 300 threads 10 minutes: 2020-05-17 01.01.08: StressTestPopItem Start: Waiting up to [600] seconds for PopItem to prematurely timeout. 2020-05-17 01.01.08: Note: Using [10] as PopTimeout on TThreadedQueue creation 2020-05-17 01.01.09: TThreadCreator Start: Creating up to [300] threads 2020-05-17 01.01.09: Note: Creating threads with a StackSize of [65536] 2020-05-17 01.01.09: TThreadCreator End: [300] worker threads created ... 2020-05-17 01.11.05: New Wait objects created: 14800000 2020-05-17 01.11.08: StressTestPopItem End: Overall maximum time limit reached for this test without an error detected...we will call this a success! 2020-05-17 01.11.08: Hit <enter> to exit NewWaitObj was called 15 million times. and Win32 2000 threads 30 seconds. 2020-05-17 01.30.28: StressTestPopItem Start: Waiting up to [30] seconds for PopItem to prematurely timeout. 2020-05-17 01.30.28: Note: Using [10] as PopTimeout on TThreadedQueue creation 2020-05-17 01.30.29: TThreadCreator Start: Creating up to [2000] threads 2020-05-17 01.30.29: Note: Creating threads with a StackSize of [65536] 2020-05-17 01.30.29: TThreadCreator End: [2000] worker threads created ... 2020-05-17 01.30.58: New Wait objects created: 5000000 2020-05-17 01.30.58: StressTestPopItem End: Overall maximum time limit reached for this test without an error detected...we will call this a success! 2020-05-17 01.30.58: Hit <enter> to exit NewWaitObj was called 5 million times in 30 seconds. Sometimes with many threads say more than 300 I get an early failure before all threads are created and I am not sure why. Once all the threads are created I get no errors. If you add Sleep(1000) at the top of TTestThread.Execute() to give it time for all the threads to be created these startup errors are avoided.
  23. pyscripter

    Revisiting TThreadedQueue and TMonitor

    I am attaching a new version of the test code incorporating the suggestions of @Anders Melander in case anyone wants to try see attached iaStressTest.TThreadedQueue.PopItem that can be used with the original stress test of @Darian Miller). iaStressTest.TThreadedQueue.PopItem.pas
  24. pyscripter

    Revisiting TThreadedQueue and TMonitor

    function InterlockedCompareExchage(var Dest: TEventStack; NewValue, CurrentValue: TEventStack): Boolean; begin {$IFDEF CPUX64} Result := InterlockedCompareExchange128(@Dest, Int64(NewValue.Head), NewValue.Counter, @CurrentValue); {$ELSE CPUX64} Result := InterlockedCompareExchange64(Int64(Dest), Int64(NewValue), Int64(CurrentValue)) = Int64(CurrentValue); {$ENDIF CPUX64} end; I have tried the above. Works well in 32 bits, but crashes in 64bits. The implementation of InterlockedCompareExchange128 in WinApi.Windows.pas is: function InterlockedCompareExchange128(Destination: Pointer; ExchangeHigh, ExchangeLow: Int64; ComparandResult: Pointer): Bool; stdcall; asm MOV R10,RCX MOV RBX,R8 MOV RCX,RDX MOV RDX,[R9+8] MOV RAX,[R9] LOCK CMPXCHG16B [R10] SETZ AL end;
  25. pyscripter

    Revisiting TThreadedQueue and TMonitor

    Then I hope the weather will be bad 😀
×