TCriticalSection and cache line size

dummzeuch · July 26, 2020

In 2011 Eric Grange blogged about a problem with the then current implementation of TCriticalSection and TMonitor:

Quote

TCriticalSection (along with TMonitor*) suffers from a severe design flaw in which entering/leaving different TCriticalSection instances can end up serializing your threads, and the whole can even end up performing worse than if your threads had been serialized.

This is because it’s a small, dynamically allocated object, so several TCriticalSection instances can end up in the same CPU cache line, and when that happens, you’ll have cache conflicts aplenty between the cores running the threads.

How severe can that be? Well, it depends on how many cores you have, but the more cores you have, the more severe it can get. On a quad core, a bad case of contention can easily result in a 200% slowdown on top of the serialization. And it won’t always be reproducible, since it’s related to dynamic memory allocation.

According to Allen Bauer this was fixed for TMonitor in Delphi XE2:

Quote

TMonitor in XE2 will now always allocate a block the cache line size or the size of the TMonitor, whichever is greater.

I just looked at the code, it indeed has been done. But apparently nothing has been changed in TCriticalSection.

Is this still relevant?

If yes, is there a way to change the allocated memory of an object instance dynamically?

I just found that David Millington blogged about Custom object memory allocation in Delphi which seems to fit the bill.

David Heffernan · July 26, 2020

It's pretty unlikely to encounter this issue in real code.

dummzeuch · July 26, 2020

13 minutes ago, David Heffernan said:

It's pretty unlikely to encounter this issue in real code.

Unfortunately unlikely does not mean impossible. And if there is an easy fix, why not implement it? (I'm on my own time now, so I don't have to account for the time spent to my employer.)

Eric Grange proposed an array of byte that increases the size of a TCriticalSection instance to 128 bytes. This probably still works for all processors currently available (on my current computer apparently it's 64 bytes), but CPUs evolve.

This should do the trick:

unit u_dzCriticalSection;

interface

uses
  Windows,
  SyncObjs;

type
  TdzCriticalSection = class(TCriticalSection)
  public
    class function NewInstance: TObject; override;
  end;

implementation

function GetCacheLineSize: Integer;
var
  ProcInfo, CurInfo: PSystemLogicalProcessorInformation;
  Len: DWORD;
begin
  Len := 0;
  if (GetProcAddress(GetModuleHandle(kernel32), 'GetLogicalProcessorInformation') <> nil) and
    not GetLogicalProcessorInformation(nil, Len) and (GetLastError = ERROR_INSUFFICIENT_BUFFER) then begin
    GetMem(ProcInfo, Len);
    try
      GetLogicalProcessorInformation(ProcInfo, Len);
      CurInfo := ProcInfo;
      while Len > 0 do begin
        if (CurInfo.Relationship = RelationCache) and (CurInfo.Cache.Level = 1) then begin
          Result := CurInfo.Cache.LineSize;
          Exit;
        end;
        Inc(CurInfo);
        Dec(Len, SizeOf(CurInfo^));
      end;
    finally
      FreeMem(ProcInfo);
    end;
  end;
  Result := 64;
end;

var
  CacheLineSize: Integer;

{ TdzCriticalSection }

class function TdzCriticalSection.NewInstance: TObject;
// see
// http://delphitools.info/2011/11/30/fixing-tcriticalsection/
// for an explanation why this could speed up execution on multi core systems
var
  InstSize: Integer;
begin
  InstSize := InstanceSize;
  if InstSize < CacheLineSize then
    InstSize := CacheLineSize;
  Result := InitInstance(GetMemory(InstSize));
end;

initialization
  CacheLineSize := GetCacheLineSize;
end.

pyscripter · July 26, 2020

1 hour ago, dummzeuch said:

In 2011 Eric Grange blogged about a problem with the then current implementation of TCriticalSection and TMonitor:

According to Allen Bauer this was fixed for TMonitor in Delphi XE2:

I just looked at the code, it indeed has been done. But apparently nothing has been changed in TCriticalSection.

Is this still relevant?

If yes, is there a way to change the allocated memory of an object instance dynamically?

I just found that David Millington blogged about Custom object memory allocation in Delphi which seems to fit the bill.

I would use the TRTLCriticalSection which is a record defined in Windows.pas.

  TRTLCriticalSection = record
    DebugInfo: PRTLCriticalSectionDebug;
    LockCount: Longint;
    RecursionCount: Longint;
    OwningThread: THandle;
    LockSemaphore: THandle;
    Reserved: ULONG_PTR;
  end;

A record helper is provided in SynchObjs.pas

  TCriticalSectionHelper = record helper for TRTLCriticalSection
    procedure Initialize; inline;
    procedure Destroy; inline;
    procedure Free; inline;
    procedure Enter; inline;
    procedure Leave; inline;
    function TryEnter: Boolean; inline;
  end;

According to Eric It is much less prone to this issue (I don't have the reference though) and does not incur the overhead of object creation/destruction.

The SpinLock is another option. So is the Windows Slim reader writer (SRW) lock. And of course you can use TMonitor which nowadays is comparable in performance to the CriticalSection. (SpinLock and SRW locks are faster).

For anyone interested here is a record wrapper to SRW locks.

  (*
    Multiple Read Exclusive Write lock based on Windows slim reader/writer
    (SRW) Locks.  Can be also used instead of a critical session.
    Limitations: non-reentrant, not "fair"
  *)
  TSlimMREWSync = record
  private
    Lock: TRTLSRWLock;
  public
    procedure Create;
    procedure BeginRead;
    procedure EndRead;
    procedure BeginWrite;
    procedure EndWrite;
    function TryRead: Boolean;
    function TryWrite: Boolean;
  end;
  
  { TSMREWSync }

procedure TSlimMREWSync.BeginRead;
begin
  AcquireSRWLockShared(Lock);
end;

procedure TSlimMREWSync.BeginWrite;
begin
  AcquireSRWLockExclusive(Lock);
end;

procedure TSlimMREWSync.Create;
begin
  InitializeSRWLock(Lock);
end;

procedure TSlimMREWSync.EndRead;
begin
  ReleaseSRWLockShared(Lock);
end;

procedure TSlimMREWSync.EndWrite;
begin
  ReleaseSRWLockExclusive(Lock);
end;

function TSlimMREWSync.TryRead: Boolean;
begin
  Result := TryAcquireSRWLockShared(Lock);
end;

function TSlimMREWSync.TryWrite: Boolean;
begin
  Result := TryAcquireSRWLockExclusive(Lock);
end;

Arnaud Bouchez · July 26, 2020

The cache line performance issue is still relevant. And it could happen in practice, e.g. when initialization in a raw.
In mORMot, we used the filler to store some variants in https://synopse.info/files/html/Synopse mORMot Framework SAD 1.18.html#TITL_184

July 26, 2020

21 minutes ago, Arnaud Bouchez said:

The cache line performance issue is still relevant.

I can't agree more.

One thing though, the proposed fix by Eric is to fix the size to 128b, now here is something to know, to my knowledge only Itanium processors have this cache line size, so that fix is only for FreePascal, on other hand for IBM Z series the cache line size is 256b, FreePascal does support IBM processors as i can see here

ftp://public.dhe.ibm.com/eserver/zseries/zos/racf/pdf/ny_naspa_2016_04_z_sw_performance_optimization.pdf page 8

https://wiki.lazarus.freepascal.org/ZSeries

So may be it is better to stick to 64b for Delphi and/or expand it to 256 for FPC.

Bonus fact: in that pdf of IBM zSeries, look at page 10 to see the penalty of cache miss table !

Stefan Glienke · July 26, 2020

If you really worry about performance in this area then don't use TCriticalSection but embed the TRTLCriticalSection (or the appropriate thing for POSIX) into the object where you need it - that way you also eliminated an indirection. And make sure that your allocated memory from the memory manager is aligned cacheline friendly (it won't by default)

Edited July 26, 2020 by Stefan Glienke

Arnaud Bouchez · July 27, 2020

13 hours ago, David Heffernan said:

It's pretty unlikely to encounter this issue in real code.

It occurred to Eric at least with DWS. He actually monitored it before fixing it.
Of course, he dealt with a language compiler and execution stack, but I guess a lot of other kind of projects may have very tiny objects - if the SOLID principles are properly followed.

dummzeuch · July 27, 2020

12 hours ago, Arnaud Bouchez said:

The cache line performance issue is still relevant. And it could happen in practice, e.g. when initialization in a raw

"In a raw"? Did yo mean "in a row", as in "initialize all the objects in one go, so many are positioned sequentially in the same memory block" ?

Or is there a meaning of "raw" that I don't know?

Arnaud Bouchez · July 27, 2020

You are right - initialize all objects in one go.

My mistake. 🙂

July 27, 2020

It is very rare case, but nasty one when happen and easily can be avoided, even if you are not initializing them in row, there is no guarantee that memory manager is separating them after hours or days, i witnessed it few times and it is visible when you see sudden drop in throughput then it return to normal, after some time in monitoring internet traffic and CPU and comparing with logs, tracking it lead me to critical section from the logs the address's of those objects i decided not to leave them closer than 64 byte, like ever.

It is harder to find when a critical section is created by 3rd-party library is conflicting with one of your own.

David Heffernan · July 27, 2020

2 hours ago, Arnaud Bouchez said:

It occurred to Eric at least with DWS. He actually monitored it before fixing it.
Of course, he dealt with a language compiler and execution stack, but I guess a lot of other kind of projects may have very tiny objects - if the SOLID principles are properly followed.

I'm not denying the existence of the problem. I'm just putting up a flag to say that I believe that in a great many codebases you are very unlikely to encounter it.

So, don't get to too stressed about this, and perhaps don't worry about trying to fix the problem unless you know for sure that you are affected by it.

Sign In

TCriticalSection and cache line size

Recommended Posts

dummzeuch 1679

Share this post

Link to post

David Heffernan 2468

Share this post

Link to post

dummzeuch 1679

Share this post

Link to post

pyscripter 829

Share this post

Link to post

Arnaud Bouchez 412

Share this post

Link to post

Guest

Share this post

Link to post

Stefan Glienke 2151

Share this post

Link to post

Arnaud Bouchez 412

Share this post

Link to post

dummzeuch 1679

Share this post

Link to post

Arnaud Bouchez 412

Share this post

Link to post

Guest

Share this post

Link to post

David Heffernan 2468

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity