Jump to content
Sign in to follow this  
dummzeuch

TCriticalSection and cache line size

Recommended Posts

In 2011 Eric Grange blogged about a problem with the then current implementation of TCriticalSection and TMonitor:

 

Quote

 

TCriticalSection (along with TMonitor*) suffers from a severe design flaw in which entering/leaving different TCriticalSection instances can end up serializing your threads, and the whole can even end up performing worse than if your threads had been serialized.

 

This is because it’s a small, dynamically allocated object, so several TCriticalSection instances can end up in the same CPU cache line, and when that happens, you’ll have cache conflicts aplenty between the cores running the threads.

How severe can that be? Well, it depends on how many cores you have, but the more cores you have, the more severe it can get. On a quad core, a bad case of contention can easily result in a 200% slowdown on top of the serialization. And it won’t always be reproducible, since it’s related to dynamic memory allocation.

 

According to Allen Bauer this was fixed for TMonitor in Delphi XE2:

Quote

TMonitor in XE2 will now always allocate a block the cache line size or the size of the TMonitor, whichever is greater.

I just looked at the code, it indeed has been done. But apparently nothing has been changed in TCriticalSection.

 

Is this still relevant?

 

If yes, is there a way to change the allocated memory of an object instance dynamically?

I just found that David Millington blogged about Custom object memory allocation in Delphi which seems to fit the bill.

Share this post


Link to post
13 minutes ago, David Heffernan said:

It's pretty unlikely to encounter this issue in real code. 

Unfortunately unlikely does not mean impossible. And if there is an easy fix, why not implement it? (I'm on my own time now, so I don't have to account for the time spent to my employer.)

 

Eric Grange proposed an array of byte that increases the size of a TCriticalSection instance to 128 bytes. This probably still works for all processors currently available (on my current computer apparently it's 64 bytes), but CPUs evolve.

 

This should do the trick:

unit u_dzCriticalSection;

interface

uses
  Windows,
  SyncObjs;

type
  TdzCriticalSection = class(TCriticalSection)
  public
    class function NewInstance: TObject; override;
  end;

implementation

function GetCacheLineSize: Integer;
var
  ProcInfo, CurInfo: PSystemLogicalProcessorInformation;
  Len: DWORD;
begin
  Len := 0;
  if (GetProcAddress(GetModuleHandle(kernel32), 'GetLogicalProcessorInformation') <> nil) and
    not GetLogicalProcessorInformation(nil, Len) and (GetLastError = ERROR_INSUFFICIENT_BUFFER) then begin
    GetMem(ProcInfo, Len);
    try
      GetLogicalProcessorInformation(ProcInfo, Len);
      CurInfo := ProcInfo;
      while Len > 0 do begin
        if (CurInfo.Relationship = RelationCache) and (CurInfo.Cache.Level = 1) then begin
          Result := CurInfo.Cache.LineSize;
          Exit;
        end;
        Inc(CurInfo);
        Dec(Len, SizeOf(CurInfo^));
      end;
    finally
      FreeMem(ProcInfo);
    end;
  end;
  Result := 64;
end;

var
  CacheLineSize: Integer;

{ TdzCriticalSection }

class function TdzCriticalSection.NewInstance: TObject;
// see
// http://delphitools.info/2011/11/30/fixing-tcriticalsection/
// for an explanation why this could speed up execution on multi core systems
var
  InstSize: Integer;
begin
  InstSize := InstanceSize;
  if InstSize < CacheLineSize then
    InstSize := CacheLineSize;
  Result := InitInstance(GetMemory(InstSize));
end;

initialization
  CacheLineSize := GetCacheLineSize;
end.

 

Share this post


Link to post
1 hour ago, dummzeuch said:

In 2011 Eric Grange blogged about a problem with the then current implementation of TCriticalSection and TMonitor:

 

According to Allen Bauer this was fixed for TMonitor in Delphi XE2:

I just looked at the code, it indeed has been done. But apparently nothing has been changed in TCriticalSection.

 

Is this still relevant?

 

If yes, is there a way to change the allocated memory of an object instance dynamically?

I just found that David Millington blogged about Custom object memory allocation in Delphi which seems to fit the bill.

I would use the TRTLCriticalSection which is a record defined in Windows.pas. 

  TRTLCriticalSection = record
    DebugInfo: PRTLCriticalSectionDebug;
    LockCount: Longint;
    RecursionCount: Longint;
    OwningThread: THandle;
    LockSemaphore: THandle;
    Reserved: ULONG_PTR;
  end;

A record helper is provided in SynchObjs.pas

  TCriticalSectionHelper = record helper for TRTLCriticalSection
    procedure Initialize; inline;
    procedure Destroy; inline;
    procedure Free; inline;
    procedure Enter; inline;
    procedure Leave; inline;
    function TryEnter: Boolean; inline;
  end;

According to Eric It is much less prone to this issue (I don't have the reference though) and does not incur the overhead of object creation/destruction.

 

The SpinLock is another option. So is the Windows Slim reader writer (SRW) lock.   And of course you can use TMonitor which nowadays is comparable in performance to the CriticalSection.   (SpinLock and SRW locks are faster).  

 

For anyone interested here is a record wrapper to SRW locks.

  (*
    Multiple Read Exclusive Write lock based on Windows slim reader/writer
    (SRW) Locks.  Can be also used instead of a critical session.
    Limitations: non-reentrant, not "fair"
  *)
  TSlimMREWSync = record
  private
    Lock: TRTLSRWLock;
  public
    procedure Create;
    procedure BeginRead;
    procedure EndRead;
    procedure BeginWrite;
    procedure EndWrite;
    function TryRead: Boolean;
    function TryWrite: Boolean;
  end;
  
  { TSMREWSync }

procedure TSlimMREWSync.BeginRead;
begin
  AcquireSRWLockShared(Lock);
end;

procedure TSlimMREWSync.BeginWrite;
begin
  AcquireSRWLockExclusive(Lock);
end;

procedure TSlimMREWSync.Create;
begin
  InitializeSRWLock(Lock);
end;

procedure TSlimMREWSync.EndRead;
begin
  ReleaseSRWLockShared(Lock);
end;

procedure TSlimMREWSync.EndWrite;
begin
  ReleaseSRWLockExclusive(Lock);
end;

function TSlimMREWSync.TryRead: Boolean;
begin
  Result := TryAcquireSRWLockShared(Lock);
end;

function TSlimMREWSync.TryWrite: Boolean;
begin
  Result := TryAcquireSRWLockExclusive(Lock);
end;

 

 

 

 

 

 

 

  • Thanks 1

Share this post


Link to post
Guest
21 minutes ago, Arnaud Bouchez said:

The cache line performance issue is still relevant.

I can't agree more.

 

One thing though, the proposed fix by Eric is to fix the size to 128b, now here is something to know, to my knowledge only Itanium processors have this cache line size, so that fix is only for FreePascal, on other hand for IBM Z series the cache line size is 256b, FreePascal does support IBM processors as i can see here 

ftp://public.dhe.ibm.com/eserver/zseries/zos/racf/pdf/ny_naspa_2016_04_z_sw_performance_optimization.pdf    page 8

https://wiki.lazarus.freepascal.org/ZSeries

 

So may be it is better to stick to 64b for Delphi and/or expand it to 256 for FPC.

 

Bonus fact: in that pdf of IBM zSeries, look at page 10 to see the penalty of cache miss table !

 

Share this post


Link to post

If you really worry about performance in this area then don't use TCriticalSection but embed the TRTLCriticalSection (or the appropriate thing for POSIX) into the object where you need it - that way you also eliminated an indirection. And make sure that your allocated memory from the memory manager is aligned cacheline friendly (it won't by default)

Edited by Stefan Glienke
  • Like 1
  • Thanks 1

Share this post


Link to post
13 hours ago, David Heffernan said:

It's pretty unlikely to encounter this issue in real code. 

It occurred to Eric at least with DWS. He actually monitored it before fixing it.
Of course, he dealt with a language compiler and execution stack, but I guess a lot of other kind of projects may have very tiny objects - if the SOLID principles are properly followed.

Share this post


Link to post
12 hours ago, Arnaud Bouchez said:

The cache line performance issue is still relevant. And it could happen in practice, e.g. when initialization in a raw

"In a raw"? Did yo mean "in a row", as in "initialize all the objects in one go, so many are positioned sequentially in the same memory block" ?

 

Or is there a meaning of "raw" that I don't know?

Share this post


Link to post
Guest

It is very rare case, but nasty one when happen and easily can be avoided, even if you are not initializing them in row, there is no guarantee that memory manager is separating them after hours or days, i witnessed it few times and it is visible when you see sudden drop in throughput then it return to normal, after some time in monitoring internet traffic and CPU and comparing with logs, tracking it lead me to critical section from the logs the address's of those objects i decided not to leave them closer than 64 byte, like ever.

 

It is harder to find when a critical section is created by 3rd-party library is conflicting with one of your own.

Share this post


Link to post
2 hours ago, Arnaud Bouchez said:

It occurred to Eric at least with DWS. He actually monitored it before fixing it.
Of course, he dealt with a language compiler and execution stack, but I guess a lot of other kind of projects may have very tiny objects - if the SOLID principles are properly followed.

I'm not denying the existence of the problem.  I'm just putting up a flag to say that I believe that in a great many codebases you are very unlikely to encounter it.

 

So, don't get to too stressed about this, and perhaps don't worry about trying to fix the problem unless you know for sure that you are affected by it.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×