Jump to content
pyscripter

Revisiting TThreadedQueue and TMonitor

Recommended Posts

Posted (edited)

@Anders MelanderCurrently the only way to pass the tests in 64bits is to use the modified version of the CAS function which now looks like this:

function CAS(const oldData: pointer; oldReference: NativeInt; newData: pointer;
  newReference: NativeInt; var destination): boolean;
asm
{$IFNDEF CPUX64}
...
{$ELSE CPUX64}
  .pushnv rbx
  mov   rax, oldData
  mov   rbx, newData
  mov   rcx, newReference
  mov   r8, [destination]
  lock cmpxchg16b [r8]
{$ENDIF CPUX64}
  setz  al
end; { CAS }

Your implementation of InterlockedCompareExchange128 does not pass the tests, same as Delphi's own implementation.  It would be nice to come up with a working InterlockedCompareExchange128 and submit that as a separate bug report with a working solution.

Edited by pyscripter

Share this post


Link to post
17 hours ago, pyscripter said:

Over here win 32 300 threads 10 minutes:

...

Sometimes with many threads say more than 300 I get an early failure before all threads are created and I am not sure why.  Once all the threads are created I get no errors.  If you add Sleep(1000) at the top of  TTestThread.Execute() to give it time for all the threads to be created these startup errors are avoided.

 

I set the max_worker_threads to 200 in Win32 and it bombed after 4 seconds and only 78 threads created.  It's definitely hit-n-miss

 

2020-05-17 10.41.16: StressTestPopItem Start: Waiting up to [60] seconds for PopItem to prematurely timeout.
2020-05-17 10.41.16: Note: Using [10] as PopTimeout on TThreadedQueue creation
2020-05-17 10.41.17: TThreadCreator Start: Creating up to [200] threads
2020-05-17 10.41.17: Note: Creating threads with a StackSize of [65536]
2020-05-17 10.41.20: TTestThread ERROR: TThreadedQueue.PopItem returned [wrTimeout] unexpectedly after [0ms]
2020-05-17 10.41.20: TThreadCreator Note: aborting thread creation at [78] threads as a failure has already been detected.
2020-05-17 10.41.20: TThreadCreator End: [78] worker threads created
2020-05-17 10.41.21: StressTestPopItem End: After [4] seconds, a failure of PopItem was detected in at least one thread
2020-05-17 10.41.21: Hit <enter> to exit

Share this post


Link to post

I got rid of the AV in CMPXCHG16B by allocating the stack head on the heap thus ensuring correct alignment, but I'm still experiencing problems with 64-bit. At some point, when freeing a block returned by Pop, I get an EInvalidPointer exception from FastMM. Other times I end up with a negative count of items on the stack. I can reproduce with just two threads! Occasionally the error occurs after less than 20 operations.

 

Test source attached.

UnitEventStackTest.pas

UnitEventStackTest.dfm

Share this post


Link to post

Hi, sorry for the inconvenience

maybe because this is a report for the beta?

Share this post


Link to post
40 minutes ago, pyscripter said:

Currently the only way to pass the tests in 64bits is to use the modified version of the CAS function

That function is too obscure; I'm not using it. The oldReference parameter is unused and as I said it's not setting the RDX register.

Share this post


Link to post
Just now, Koru said:

maybe because this is a report for the beta?

Probably. It's not a problem but of course it would be nice it we had access now that we're hacking at the problem.

Share this post


Link to post
3 minutes ago, pyscripter said:

@Darian Miller Please try with this version.  It uses the OnmiThreadLibrary CAS function in both 32-bit and 64-bit.  Works solidly here.

iaStressTest.TThreadedQueue.PopItem.pas

 

I set the max thread count to 250 and it failed at 183 threads in 10 seconds.   So I dropped it down to 175 and it lasted longer

 

2020-05-17 11.19.17: StressTestPopItem Start: Waiting up to [300] seconds for PopItem to prematurely timeout.
2020-05-17 11.19.17: Note: Using [10] as PopTimeout on TThreadedQueue creation
2020-05-17 11.19.18: TThreadCreator Start: Creating up to [175] threads
2020-05-17 11.19.18: Note: Creating threads with a StackSize of [65536]
2020-05-17 11.19.32: TThreadCreator End: [175] worker threads created
2020-05-17 11.19.43: New Wait objects created: 100000
2020-05-17 11.19.54: New Wait objects created: 200000
2020-05-17 11.20.06: New Wait objects created: 300000
2020-05-17 11.20.17: TTestThread ERROR: TThreadedQueue.PopItem returned [wrTimeout] unexpectedly after [0ms]

 

Share this post


Link to post
7 minutes ago, Anders Melander said:

That function is too obscure; I'm not using it.

Well, I just tried it anyway. With the CAS version my test passes with 100 threads each doing 10 million operations.

I'll have to compare the assembler of the calling code - but first I have some large quantities of pizza to prepare and eat and then I'm watching Astralis anihilate G2.

Share this post


Link to post
Posted (edited)
20 hours ago, pyscripter said:

Sometimes with many threads say more than 300 I get an early failure before all threads are created and I am not sure why.  Once all the threads are created I get no errors.  If you add Sleep(1000) at the top of  TTestThread.Execute() to give it time for all the threads to be created these startup errors are avoided.

I removed your sleep and used an event to start all threads after they have all been created, your latest additions quickly fail in Win32 using the latest version of Delphi.

 

 

iaStressTest.TThreadedQueue.PopItem.pas

Edited by Darian Miller

Share this post


Link to post
4 hours ago, pyscripter said:

If you remove the packed from from the TEventStack definition there will be no need to allocate on the heap.

Ah, that was why I couldn't get the alignment working. The "packed" wasn't there from the start but at some point I think I must have copied your changes back into my own.

Share this post


Link to post

I have packaged the fix as a stand-alone unit (see attached).  You just add this unit to your project.  Uses the CAS function from OmniThreadLibrary.  It would still be nice to fix InterlockedCompareExchange128 .

MonitorWaitStackFix.pas

Share this post


Link to post
7 hours ago, pyscripter said:

I have packaged the fix as a stand-alone unit (see attached).  You just add this unit to your project.  Uses the CAS function from OmniThreadLibrary.  It would still be nice to fix InterlockedCompareExchange128 .

MonitorWaitStackFix.pas

I submitted a QP report for InterlockedCompareExchange128 not restoring rbx

  • Thanks 1

Share this post


Link to post
1 minute ago, David Heffernan said:

I submitted a QP report for InterlockedCompareExchange128 not restoring rbx

Thanks but that can't be the only problem with it.

My tests fail even with that detail fixed but passes with the OmniThreadLibrary CAS. I can't spot what the problem is.

Share this post


Link to post
1 minute ago, Anders Melander said:

Thanks but that can't be the only problem with it.

My tests fail even with that detail fixed but passes with the OmniThreadLibrary CAS. I can't spot what the problem is.

Has anybody looked at the equivalent for msvc, gcc, clang, etc? 

Share this post


Link to post
7 minutes ago, David Heffernan said:

Has anybody looked at the equivalent for msvc, gcc, clang, etc? 

I've looked at several but they just wrap CMPXCHG16B like CAS does.

I think I have an old version of msvc installed somewhere but I think it predates InterlockedCompareExchange128.

Share this post


Link to post
Posted (edited)

There's one thing that bothers me with both InterlockedCompareExchange128 and CAS. They both return a boolean indicating if the swap was performed or not. They correctly do so by examining ZF after calling CMPXCHG16B. If the flag is set they set AL to 1, otherwise clear it to zero:

CMPXCHG16B [...]
SETZ AL

The problem is that the caller examines the value of AX to get the boolean result:

call InterlockedCompareExchange128
test eax,eax
jnz ...

Now what about the value that just happens to be in AH? I would think that we'd need to clear AX before calling SETZ:

CMPXCHG16B [...]
XOR EAX, EAX
SETZ AL

Edit: Whoops. XOR EAX,EAX sets ZF.

Edited by Anders Melander

Share this post


Link to post

Aaaaand there's an additional problem with InterlockedCompareExchange128: It doesn't return any value in the ComparandResult parameter. It's supposed to return the new value on success and the current value on failure.

 

Anyway, here's my final (hopefully) fixed version of InterlockedCompareExchange128 which conforms to the Win32 API documentation:

{$IFDEF Win64}
function InterlockedCompareExchange128(Destination: Pointer; ExchangeHigh, ExchangeLow: Int64; ComparandResult: Pointer): boolean;
// The parameters are in the RCX, RDX, R8 and R9 registers per the MS x64 calling convention:
//   RCX        Destination
//   RDX        ExchangeHigh
//   R8         ExchangeLow
//   R9         ComparandResult
//
// CMPXCHG16B requires the following register setup:
//   RDX:RAX    ComparandResult.High:ComparandResult.Low
//   RCX:RBX    ExchangeHigh:ExchangeLow
// See: https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b
asm
      .NOFRAME
      .PUSHNV RBX
      MOV   R10,Destination             // RCX
      MOV   RBX,ExchangeLow             // R8
      MOV   RCX,ExchangeHigh            // RDX
      MOV   RDX,[ComparandResult+8]     // R9
      MOV   RAX,[ComparandResult]       // R9
 LOCK CMPXCHG16B [R10]
      MOV   [ComparandResult+8],RDX     // R9
      MOV   [ComparandResult],RAX       // R9
      SETZ  BL
      XOR   EAX, EAX
      MOV   AL,BL
end;
{$ENDIF Win64}

and here's the test I used against it:

type
  T128 = record
    Low: pointer;
    High: pointer;
  end;

procedure TestInterlockedCompareExchange128;
const
  Value: T128 = (low: pointer($0010203040506070); high: pointer($8090A0B0C0D0E0F));
var
  Dest, OldValue, NewValue: T128;
begin
  (*
  ** Test success
  *)
  Dest := Value;
  OldValue := Value;
  NewValue.Low := pointer($ABABABABABABABAB);
  NewValue.High := pointer($1212121212121212);

  // InterlockedCompareExchange128 should return True
  Assert(InterlockedCompareExchange128(@Dest, Int64(NewValue.High), Int64(NewValue.Low), @OldValue), 'Success expected');

  // Dest contains new value
  Assert(Dest.Low = NewValue.Low);
  Assert(Dest.High = NewValue.High);
  // Comparand contains new value = old value
  Assert(OldValue.Low = Value.Low);
  Assert(OldValue.High = Value.High);

  (*
  ** Test failure
  *)
  Dest := Value;
  OldValue.Low := pointer($ABABABABABABABAB);
  OldValue.High := pointer($1212121212121212);
  NewValue.Low := nil;
  NewValue.High := nil;

  // InterlockedCompareExchange128 should return False
  Assert(not InterlockedCompareExchange128(@Dest, Int64(NewValue.High), Int64(NewValue.Low), @OldValue), 'Failure expected');

  // Dest contains original value
  Assert(Dest.Low = Value.Low);
  Assert(Dest.High = Value.High);
  // Comparand contains original value
  Assert(OldValue.Low = Value.Low);
  Assert(OldValue.High = Value.High);
end;

 

With this fix my event stack test passes no matter how hard I hit it.

Share this post


Link to post
Posted (edited)
      SETZ  BL
      XOR   EAX, EAX
      MOV   AL,BL

Don't you just need

      SETZ  AL
      MOVZX EAX, AL

Also, .NOFRAME is wrong here. You need the frame to preserve RBX. At least that's how I understand it.

Edited by David Heffernan
  • Thanks 1

Share this post


Link to post
Posted (edited)

It's astonishing that Emba's InterlockedCompareExchange128 returns a BOOL, i.e. a 4 byte boolean, LongBool. They are presumably trying to copy this one from MS https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange128 which is actually implemented as an intrinsic.

 

But look at the MS function's return type.  Yup, it's BOOLEAN which is a single byte.  The Emba version should use Boolean rather than Bool as the return type.  It's a screw-up from top to bottom.

 

Anyone might think that they just write this stuff and don't test it ......

Edited by David Heffernan

Share this post


Link to post
4 minutes ago, David Heffernan said:

Don't you just need 


      SETZ  AL
      MOVZX EAX, AL

Yes. You are right.

 

4 minutes ago, David Heffernan said:

Also, .NOFRAME is wrong here. You need the frame to preserve RBX. At least that's how I understand it.

As I read it .NOFRAME tells the compiler that I'm not calling anything and don't need a stack frame. Of course that means I'm relying on the existing stack frame having room for the single register we're pushing. I guess removing .NOFRAME would be the safe thing to do.

  • Like 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×