Jump to content

Recommended Posts

I am reinventing the wheel again (yes, I like doing that).


Can you see anything wrong with these SpinLock functions? (for Win32 only)

///<summary>
/// simple spin lock function, Lk must have been initialized with 0 before first use </summary>
procedure doLock(var _Lk: Integer);
asm
  mov edx, eax
  mov ecx, 1
@Loop:
  mov eax, 0
  lock cmpxchg dword ptr [edx], ecx
  jnz @Loop
end;

///<summary>
/// simple spin unlock function, Lk must have been initialized with 0 before first use </summary>
procedure doUnLock(var _Lk: Integer);
asm
  lock dec dword ptr[eax];
end;

///<summary>
/// simple spin trylock function, Lk must have been initialized with 0 before first use
/// @returns True if the lock could be acquired, False otherwise </summary>
function doTryLock(var _Lk: Integer): Boolean;
asm
  mov edx, eax
  mov ecx, 1
  mov eax, 0
  lock cmpxchg dword ptr [edx], ecx
  setz al
end;

doLock and doUnlock are taken from https://vitaliburkov.wordpress.com/2011/10/28/parallel-programming-with-delphi-part-ii-resolving-race-conditions/

doTryLock is based on doLock and uses the setz opcode to return a boolean rather than looping.

 

I have done tests and they seem to work, but it's always difficult to actually test multi threaded code.

Share this post


Link to post
36 minutes ago, A.M. Hoornweg said:

One problem I see is that on a single-core CPU, doLock will burn up all remaining CPU time of the time slice on contention. 

Yes, thats how a SpinLock is supposed to work, isn't it?

 

That's why I added the doTryLock function which could be used like this:

while not doTryLock(Lck) do
  Sleep(0);
Edited by dummzeuch

Share this post


Link to post
16 minutes ago, David Heffernan said:

Aren't you meant to use the YIELD instruction (if I've remembered it's name right) 

You mean instead of Sleep()?

 

I haven't thought much about that case yet.

Using doTryLock it with sleep which gives a much higher throughput than doLock or a Critical Section in my tests on my computer. But I am not yet sure whether this makes any difference in a real world scenario.

 

Apparently there the is the Pause instruction on Intel CPUs:

https://stackoverflow.com/questions/4725676/how-does-x86-pause-instruction-work-in-spinlock-and-can-it-be-used-in-other-sc

 

I'll have to look into it.

Edited by dummzeuch

Share this post


Link to post

You are writing your own spinlock to get better performance - and then you use Sleep() which is guaranteed to cause a context switch...?

Unless you really know what you're doing then you better of using the synchronization primitives provided by the OS.

 

1 hour ago, dummzeuch said:

Sleep(0);

https://devblogs.microsoft.com/oldnewthing/20051004-09/?p=33923

http://joeduffyblog.com/2006/08/22/priorityinduced-starvation-why-sleep1-is-better-than-sleep0-and-the-windows-balance-set-manager/

 

Apparently this can't be repeated often enough.

  • Like 2

Share this post


Link to post
5 minutes ago, Anders Melander said:

You are writing your own spinlock to get better performance - and then you use Sleep() which is guaranteed to cause a context switch...?

I'm writing my own spinlock to get a feel for what can be done. One of my programs shows some odd behavior when using a critical section so I tried to see whether this changes when switching to a spin lock. It seemed to work at first, but my last test failed miserably so I'm "going home" now.

 

6 minutes ago, Anders Melander said:

Unless you really know what you're doing then you better of using the synchronization primitives provided by the OS.

Quite possible, but since these require API calls, they have a significant overhead in the best case (no spinning required).

 

I knew about Sleep(0) vs. Sleep(1), but wasn't that supposed to be changed in Windows Vista and later? (reading the second link) OK, apparently not. But since the threads in question are running with the same priority, Sleep(0) should work fine.

 

Apart from that: the above code was just an example, how to prevent running a single CPU system at 100%.

Share this post


Link to post
15 hours ago, dummzeuch said:

Yes, thats how a SpinLock is supposed to work, isn't it?

The purpose of a spinlock is to obtain the lock as quickly as possible.   On a single core system, doLock should yield on contention because the thread that owns the lock is waiting to be resumed.  On multi-core systems it can keep spinning.  

 

 

In the initialization of the unit, call GetProcessAffinityMask() and count the bits in the integer to obtain the number of cores available to the program. 

 

 

 

  • Like 2

Share this post


Link to post
17 minutes ago, A.M. Hoornweg said:

In the initialization of the unit, call GetProcessAffinityMask() and count the bits in the integer to obtain the number of cores available to the program.

... or simply assume a multi core system. When was the last time you saw a single core (Windows-) system in the wild?

 

OK, a program could be restricted to a single core by the user or the OS, so there is that. And there are VMs which can also be configured to run on a single core even on a multi core system.

  • Like 1

Share this post


Link to post
19 hours ago, dummzeuch said:

I knew about Sleep(0) vs. Sleep(1), but wasn't that supposed to be changed in Windows Vista and later? (reading the second link) OK, apparently not. But since the threads in question are running with the same priority, Sleep(0) should work fine. 

Yeah, right. Think again, Thomas.

 

I did some tests and it turned out that in these tests(!) on my computer(!) Sleep(1) increased the throughput by more than 50% over Sleep(0), even though all threads were running in the same application and with the same priority:

 

Sleep(0) 5 Writers 5 Readers
Run1:   Writes: 5738 Reads: 4683 [avg. per ms]
Run2:   Writes: 5662 Reads: 4716 [avg. per ms]

 

Sleep(1) 5 Writers 5 Readers
Run1:   Writes: 10444 Reads: 7528 [avg. per ms]
Run2:   Writes:  8587 Reads: 9411 [avg. per ms]

 

Note that it is quite possible that I bungled the tests. They are not as consistent as I would have expected.

Edited by dummzeuch

Share this post


Link to post

FWIW this is my spin lock

 

TSpinLock = class
  private
    FLock: Integer;
  public
    procedure Acquire;
    procedure Release;
  end;

procedure TSpinLock.Acquire;
begin
  Assert(AlignedOn32bitBoundary(@FLock));
  while AtomicCmpExchange(FLock, 1, 0)<>0 do begin
    //signal to CPU that we are spinning; see, for example, http://msdn.microsoft.com/en-us/magazine/cc163726.aspx
    YieldProcessor;
  end;
end;

procedure TSpinLock.Release;
begin
  AtomicExchange(FLock, 0);
end;

 

It's only useful on x86/x64 because Emba haven't provided an implementation of YieldProcessor for other processors!

 

I suspect that professional spin locks use different yield methods depending on how long they have been spinning.

Edited by David Heffernan
  • Like 1
  • Thanks 1

Share this post


Link to post

The equivalent of x86 PAUSE on ARM would be YIELD.

Perhaps there is no ARM version of YieldProcessor because the ASM block only is usable under x86?

 

 

Share this post


Link to post
1 hour ago, David Heffernan said:

FWIW this is my spin lock

 


TSpinLock = class
  private
    FLock: Integer;
  public
    procedure Acquire;
    procedure Release;
  end;

procedure TSpinLock.Acquire;
begin
  Assert(AlignedOn32bitBoundary(@FLock));
  while AtomicCmpExchange(FLock, 1, 0)<>0 do begin
    //signal to CPU that we are spinning; see, for example, http://msdn.microsoft.com/en-us/magazine/cc163726.aspx
    YieldProcessor;
  end;
end;

procedure TSpinLock.Release;
begin
  AtomicExchange(FLock, 0);
end;

 

It's only useful on x86/x64 because Emba haven't provided an implementation of YieldProcessor for other processors!

 

I suspect that professional spin locks use different yield methods depending on how long they have been spinning.

David, can you please provide the code for AlignedOn32bitBoundary() func? 

Thanks.

 

Edited by FloppyDesk

Share this post


Link to post

I read somewhere (when I was searching for 'Yield'), that Yield doesn't exist on Arm processors and there simply translates to NOP. Not sure whether I understood that correctly because at the time I wasn't really interested in Arm.

Edited by dummzeuch

Share this post


Link to post
1 hour ago, David Heffernan said:

FWIW this is my spin lock


procedure TSpinLock.Acquire;
begin
  Assert(AlignedOn32bitBoundary(@FLock));
  while AtomicCmpExchange(FLock, 1, 0)<>0 do begin
    //signal to CPU that we are spinning; see, for example, http://msdn.microsoft.com/en-us/magazine/cc163726.aspx
    YieldProcessor;
  end;
end;

 

Hm, shouldn't the assertion go into a constructor?

Share this post


Link to post
Guest
3 hours ago, dummzeuch said:

I did some tests and it turned out that in these tests(!) on my computer(!) Sleep(1) increased the throughput by more than 50% over Sleep(0),

That can and might happen in different scenarios.

 

and here my two cents:

1) Sleep (1) will ruin the performance in production ! and that is simply because you are testing with Windows timer accuracy is been modified by the Delphi IDE to 1 instead 15, repeating the same test between Sleep(0) and Sleep(1) with all IDE's closed will give you shocking result, so never use Sleep(1) unless it is well thought and needed, it has its merit but definitely not in synchronizing code.  

2) Sleep(0) render the execution to the system in unpredicted behaviour, not in bad way but in uncontrolled way, will try to explain with my English, Sleep(0) will give a hint to the OS that you want to wait , just like that, and the OS is free to choose to render the execution or not and here huge difference will be seen based on the priority of the thread against all other threads running on the OS, my Windows at this moment has +1600 threads, and priority will play huge rule with Sleep(0) and it different from the SwitchToThread.

3) What you really want to use is SwitchToThread https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-switchtothread this will instruct the OS to give the execution time slice to another thread as mentioned in the documentation based on the priority, but will activate any ready to run thread or one already stopped, <didn't use Sleep(x)>, think of SwitchToThread as sort of telling the OS that there is another thread disturbing me please try to give it a kick, and that is the main difference from Sleep(0).

4) Also many think SpinLock is faster than CS, both are slow and fast based on the usage place and the code protected, CS might cause travel to the kernel and cost thousands cycles or even hundreds thousands of them but again not always, while with spinlock it might burn the same count of the cycles without any productive gain, simply because the protected section does have long execution time ( more cycles), as example you protecting section of code that lets say will calculate the sum of 1000 integers, and lets say that algorithm takes 5k cycle, then any thread waiting with spinlock on that code in theory will burn the same amount, i think the idea is clear now , with more threads waiting and burning cycles in place the CPU will be 100% with only one thread calculating integers at a time, even the OS threads will suffer, so use SpinLock only on very short executed code, in other words with critical sections you are running at risk of wasting time and underloading the CPU, with spinlocks your are running at risk of overloading the CPU and also wasting time.

Share this post


Link to post
2 hours ago, dummzeuch said:

Hm, shouldn't the assertion go into a constructor?

It could but I don't have one and I'd have to add one, which i guess is why I put it there. 

Share this post


Link to post
33 minutes ago, Kas Ob. said:

That can and might happen in different scenarios.

re 1: Then I probably need to repeat these tests on a system where the Delphi IDE hasn't run (yet), because I am not sure whether it had been running before the tests or not.

 

re 2/3: There are lots of discussions on the web about Sleep(0) vs. Sleep(1) vs. SwitchToThread and there seems to be no definite answer which one to use in a spin lock. It all depends on the circumstances.

 

re 4: Again, this depends on the circumstances. With a small protected section and low contention a spin lock might be way faster than a critical section. I was trying to find out whether that might solve some issues I had with a particular program. It turned out that it didn't (but again, I need to repeat the test taking into account what I have learned in the meantime), but this had started me into the topic.

Edited by dummzeuch

Share this post


Link to post
Guest
57 minutes ago, dummzeuch said:

re 2/3: There are lots of discussions on the web about Sleep(0) vs. Sleep(1) vs. SwitchToThread and there seems to be no definite answer which one to use in a spin lock. It all depends on the circumstances.

Exactly, too much of discussions, but what i wrote is simplified and shortened description of what i am familiar with, as it is mostly depends on usage case (the length of the code, threads counts, does it involve another nested wait loop, is there I/O operation involved ..etc) 

For me Sleep(1) is out of question in spinlock, simply because critical section is better and faster here, Sleep(0) and SwitchToThread will involve the OS, and i explained a point in the little details on how OS react to them, and also there is PAUSE ( assembly instruction) which was something around 10 cycle, but recently Intel made it around 170 cycle in modern CPU's (modern Xeon too), this change a lot in spinlock and Intel advocates it as spinlock targeted instruction, but yet it will decrease the power consumption on looping without any OS call.

For me, i use few different spinlocks, hand made for specific code, some depend only on PAUSE, some spin in place with breaks, and some does use SwitchToThread, and in few cases i did loop on PAUSE with counter up to 4000 before calling SwitchToThread (assuming 4000*10 cycle should means there is a hug somewhere and the OS is starving), as it made sense .

 

1 hour ago, dummzeuch said:

re 4: Again, this depends on the circumstances. With a small protected section and low contention a spin lock might be way faster than a critical section. I was trying to find out whether that might solve some issues I had with a particular program. It turned out that it didn't (but again, I need to repeat the test taking into account what I have learned in the meantime), but this had started me into the topic.

Yes you are right there, the circumstances, if you can change the design by different approach, which i believe in many cases out-of-the-box thinking will help more, like sometimes you can use spin lock only to copy the data to the stack or newly allocated memory per thread might be faster, here spinlock will act as gate in enter and exit in two places before the processing and after might be possible approach hence will be faster as the lock is protecting small portion of code to copy in and copy out only, even if on copy out the data should be discarded as outdated, but sometimes it is faster, also the memory to receive the data is preallocated before the spinlock protected code, so the spin lock is merely protecting a copy process, but again it is one example.

 

1 hour ago, dummzeuch said:

re 1: Then I probably need to repeat these tests on a system where the Delphi IDE hasn't run (yet), because I am not sure whether it had been running before the tests or not.

The best way to benchmark for threading is by using complete blanc Hyper-V Windows without the shell, you can copy the binaries using PowerShell and execute them, it is time consuming process and require spare device or a dedicated/hosted one, just in case you want to do it at best accuracy available.

Share this post


Link to post
8 hours ago, dummzeuch said:

I read somewhere (when I was searching for 'Yield'), that Yield doesn't exist on Arm processors and there simply translates to NOP. Not sure whether I understood that correctly because at the time I wasn't really interested in Arm.

https://developer.arm.com/documentation/dui0473/m/arm-and-thumb-instructions/yield

It was added in V6. 

Share this post


Link to post
On 2/24/2021 at 4:37 PM, FloppyDesk said:

David, can you please provide the code for AlignedOn32bitBoundary() func?

Since I just wrote one myself, here you go:

 

program AlignedOn32BitBoundaryTest;

{$APPTYPE CONSOLE}

uses
  SysUtils;
  
{$IF not declared(UIntPtr)}
type
{$IF SizeOf(Pointer)=4}
  UIntPtr = UInt32;
{$ELSE}
  UIntPtr = UInt64;
{$IFEND}
{$IFEND}

function IsAlignedOn32bitBoundary(_Ptr: Pointer): Boolean;
begin
  Result := ((UIntPtr(_Ptr) and $3) = 0);
end;

procedure CheckAlignedOn32BitBoundary(const _s: string; _Ptr: Pointer);
begin
  if IsAlignedOn32bitBoundary(_Ptr) then
    WriteLn(_s + ' is aligned on a 32 bit boundary')
  else
    WriteLn(_s + ' is not aligned on a 32 bit boundary')
end;

type
  TSomeRect = packed record
    SomeByte: byte;
    SomeCardinal: Cardinal;
  end;

var
  SomeRect: TSomeRect;
begin
  CheckAlignedOn32BitBoundary('SomeByte', @SomeRect.SomeByte);
  CheckAlignedOn32BitBoundary('SomeCardinal', @SomeRect.SomeCardinal);
  WriteLn('press enter');
  ReadLn;
end.

 

Share this post


Link to post
On 2/24/2021 at 10:38 AM, dummzeuch said:

I read somewhere (when I was searching for 'Yield'), that Yield doesn't exist on Arm processors and there simply translates to NOP. Not sure whether I understood that correctly because at the time I wasn't really interested in Arm.

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×