dummzeuch 1505 Posted July 26, 2020 In 2011 Eric Grange blogged about a problem with the then current implementation of TCriticalSection and TMonitor: Quote TCriticalSection (along with TMonitor*) suffers from a severe design flaw in which entering/leaving different TCriticalSection instances can end up serializing your threads, and the whole can even end up performing worse than if your threads had been serialized. This is because it’s a small, dynamically allocated object, so several TCriticalSection instances can end up in the same CPU cache line, and when that happens, you’ll have cache conflicts aplenty between the cores running the threads. How severe can that be? Well, it depends on how many cores you have, but the more cores you have, the more severe it can get. On a quad core, a bad case of contention can easily result in a 200% slowdown on top of the serialization. And it won’t always be reproducible, since it’s related to dynamic memory allocation. According to Allen Bauer this was fixed for TMonitor in Delphi XE2: Quote TMonitor in XE2 will now always allocate a block the cache line size or the size of the TMonitor, whichever is greater. I just looked at the code, it indeed has been done. But apparently nothing has been changed in TCriticalSection. Is this still relevant? If yes, is there a way to change the allocated memory of an object instance dynamically? I just found that David Millington blogged about Custom object memory allocation in Delphi which seems to fit the bill. Share this post Link to post
David Heffernan 2345 Posted July 26, 2020 It's pretty unlikely to encounter this issue in real code. Share this post Link to post
dummzeuch 1505 Posted July 26, 2020 13 minutes ago, David Heffernan said: It's pretty unlikely to encounter this issue in real code. Unfortunately unlikely does not mean impossible. And if there is an easy fix, why not implement it? (I'm on my own time now, so I don't have to account for the time spent to my employer.) Eric Grange proposed an array of byte that increases the size of a TCriticalSection instance to 128 bytes. This probably still works for all processors currently available (on my current computer apparently it's 64 bytes), but CPUs evolve. This should do the trick: unit u_dzCriticalSection; interface uses Windows, SyncObjs; type TdzCriticalSection = class(TCriticalSection) public class function NewInstance: TObject; override; end; implementation function GetCacheLineSize: Integer; var ProcInfo, CurInfo: PSystemLogicalProcessorInformation; Len: DWORD; begin Len := 0; if (GetProcAddress(GetModuleHandle(kernel32), 'GetLogicalProcessorInformation') <> nil) and not GetLogicalProcessorInformation(nil, Len) and (GetLastError = ERROR_INSUFFICIENT_BUFFER) then begin GetMem(ProcInfo, Len); try GetLogicalProcessorInformation(ProcInfo, Len); CurInfo := ProcInfo; while Len > 0 do begin if (CurInfo.Relationship = RelationCache) and (CurInfo.Cache.Level = 1) then begin Result := CurInfo.Cache.LineSize; Exit; end; Inc(CurInfo); Dec(Len, SizeOf(CurInfo^)); end; finally FreeMem(ProcInfo); end; end; Result := 64; end; var CacheLineSize: Integer; { TdzCriticalSection } class function TdzCriticalSection.NewInstance: TObject; // see // http://delphitools.info/2011/11/30/fixing-tcriticalsection/ // for an explanation why this could speed up execution on multi core systems var InstSize: Integer; begin InstSize := InstanceSize; if InstSize < CacheLineSize then InstSize := CacheLineSize; Result := InitInstance(GetMemory(InstSize)); end; initialization CacheLineSize := GetCacheLineSize; end. Share this post Link to post
pyscripter 689 Posted July 26, 2020 1 hour ago, dummzeuch said: In 2011 Eric Grange blogged about a problem with the then current implementation of TCriticalSection and TMonitor: According to Allen Bauer this was fixed for TMonitor in Delphi XE2: I just looked at the code, it indeed has been done. But apparently nothing has been changed in TCriticalSection. Is this still relevant? If yes, is there a way to change the allocated memory of an object instance dynamically? I just found that David Millington blogged about Custom object memory allocation in Delphi which seems to fit the bill. I would use the TRTLCriticalSection which is a record defined in Windows.pas. TRTLCriticalSection = record DebugInfo: PRTLCriticalSectionDebug; LockCount: Longint; RecursionCount: Longint; OwningThread: THandle; LockSemaphore: THandle; Reserved: ULONG_PTR; end; A record helper is provided in SynchObjs.pas TCriticalSectionHelper = record helper for TRTLCriticalSection procedure Initialize; inline; procedure Destroy; inline; procedure Free; inline; procedure Enter; inline; procedure Leave; inline; function TryEnter: Boolean; inline; end; According to Eric It is much less prone to this issue (I don't have the reference though) and does not incur the overhead of object creation/destruction. The SpinLock is another option. So is the Windows Slim reader writer (SRW) lock. And of course you can use TMonitor which nowadays is comparable in performance to the CriticalSection. (SpinLock and SRW locks are faster). For anyone interested here is a record wrapper to SRW locks. (* Multiple Read Exclusive Write lock based on Windows slim reader/writer (SRW) Locks. Can be also used instead of a critical session. Limitations: non-reentrant, not "fair" *) TSlimMREWSync = record private Lock: TRTLSRWLock; public procedure Create; procedure BeginRead; procedure EndRead; procedure BeginWrite; procedure EndWrite; function TryRead: Boolean; function TryWrite: Boolean; end; { TSMREWSync } procedure TSlimMREWSync.BeginRead; begin AcquireSRWLockShared(Lock); end; procedure TSlimMREWSync.BeginWrite; begin AcquireSRWLockExclusive(Lock); end; procedure TSlimMREWSync.Create; begin InitializeSRWLock(Lock); end; procedure TSlimMREWSync.EndRead; begin ReleaseSRWLockShared(Lock); end; procedure TSlimMREWSync.EndWrite; begin ReleaseSRWLockExclusive(Lock); end; function TSlimMREWSync.TryRead: Boolean; begin Result := TryAcquireSRWLockShared(Lock); end; function TSlimMREWSync.TryWrite: Boolean; begin Result := TryAcquireSRWLockExclusive(Lock); end; 1 Share this post Link to post
Arnaud Bouchez 407 Posted July 26, 2020 The cache line performance issue is still relevant. And it could happen in practice, e.g. when initialization in a raw. In mORMot, we used the filler to store some variants in https://synopse.info/files/html/Synopse mORMot Framework SAD 1.18.html#TITL_184 Share this post Link to post
Guest Posted July 26, 2020 21 minutes ago, Arnaud Bouchez said: The cache line performance issue is still relevant. I can't agree more. One thing though, the proposed fix by Eric is to fix the size to 128b, now here is something to know, to my knowledge only Itanium processors have this cache line size, so that fix is only for FreePascal, on other hand for IBM Z series the cache line size is 256b, FreePascal does support IBM processors as i can see here ftp://public.dhe.ibm.com/eserver/zseries/zos/racf/pdf/ny_naspa_2016_04_z_sw_performance_optimization.pdf page 8 https://wiki.lazarus.freepascal.org/ZSeries So may be it is better to stick to 64b for Delphi and/or expand it to 256 for FPC. Bonus fact: in that pdf of IBM zSeries, look at page 10 to see the penalty of cache miss table ! Share this post Link to post
Stefan Glienke 2002 Posted July 26, 2020 (edited) If you really worry about performance in this area then don't use TCriticalSection but embed the TRTLCriticalSection (or the appropriate thing for POSIX) into the object where you need it - that way you also eliminated an indirection. And make sure that your allocated memory from the memory manager is aligned cacheline friendly (it won't by default) Edited July 26, 2020 by Stefan Glienke 1 1 Share this post Link to post
Arnaud Bouchez 407 Posted July 27, 2020 13 hours ago, David Heffernan said: It's pretty unlikely to encounter this issue in real code. It occurred to Eric at least with DWS. He actually monitored it before fixing it. Of course, he dealt with a language compiler and execution stack, but I guess a lot of other kind of projects may have very tiny objects - if the SOLID principles are properly followed. Share this post Link to post
dummzeuch 1505 Posted July 27, 2020 12 hours ago, Arnaud Bouchez said: The cache line performance issue is still relevant. And it could happen in practice, e.g. when initialization in a raw "In a raw"? Did yo mean "in a row", as in "initialize all the objects in one go, so many are positioned sequentially in the same memory block" ? Or is there a meaning of "raw" that I don't know? Share this post Link to post
Arnaud Bouchez 407 Posted July 27, 2020 You are right - initialize all objects in one go. My mistake. 🙂 Share this post Link to post
Guest Posted July 27, 2020 It is very rare case, but nasty one when happen and easily can be avoided, even if you are not initializing them in row, there is no guarantee that memory manager is separating them after hours or days, i witnessed it few times and it is visible when you see sudden drop in throughput then it return to normal, after some time in monitoring internet traffic and CPU and comparing with logs, tracking it lead me to critical section from the logs the address's of those objects i decided not to leave them closer than 64 byte, like ever. It is harder to find when a critical section is created by 3rd-party library is conflicting with one of your own. Share this post Link to post
David Heffernan 2345 Posted July 27, 2020 2 hours ago, Arnaud Bouchez said: It occurred to Eric at least with DWS. He actually monitored it before fixing it. Of course, he dealt with a language compiler and execution stack, but I guess a lot of other kind of projects may have very tiny objects - if the SOLID principles are properly followed. I'm not denying the existence of the problem. I'm just putting up a flag to say that I believe that in a great many codebases you are very unlikely to encounter it. So, don't get to too stressed about this, and perhaps don't worry about trying to fix the problem unless you know for sure that you are affected by it. Share this post Link to post