Jump to content
RDP1974

64bit RTL patches with Intel OneApi and TBB

Recommended Posts

hi,

I have built the libraries with the latest sources of https://www.intel.com/content/www/us/en/developer/tools/oneapi/ipp.html and https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html

I had zero warnings or problems on compile.

Here the files https://github.com/RDP1974/Delphi64RTL

Look the TBB allocator is very prone to detect memory errors as double free or overruns.

In multithreaded apps as web applications you will get a large performance improvement.

Btw. Intel license is totally permissive free to distribute and deploy everywhere

please let me know if you discover errors

 

Quick test with WebBroker Indy app producing a plain:

 

program Project1;
uses
  RDPMM64,
  Vcl.Forms,
  Web.WebReq,
...

 

procedure TWebModule1.WebModule1DefaultHandlerAction(Sender: TObject;
  Request: TWebRequest; Response: TWebResponse; var Handled: Boolean);
begin
  Response.Content :=
    '<html>' +
    '<head><title>Web Server Application</title></head>' +
    '<body>Web Server Application '+FormatDateTime('yyyymmdd.hhnnss',Now)+'</body>' +
    '</html>

end;

 

Hyper-V i9 cpu windows 2022 server, 16 cores

Host i9 cpu windows 10 pro

 

Apache bench ab -n 1000 -c 100 -k -r http://localhost:8080/

 

Delphi 11 default

Concurrency Level:      100
Time taken for tests:   1.845 seconds
Complete requests:      1000
Failed requests:        0
Keep-Alive requests:    0
Total transferred:      250000 bytes
HTML transferred:       114000 bytes

Requests per second:    542.04 [#/sec] (mean)
Time per request:       184.488 [ms] (mean)
Time per request:       1.845 [ms] (mean, across all concurrent requests)
Transfer rate:          132.33 [Kbytes/sec] received

 

Delphi 11 (with Intel libs):

Concurrency Level:      100
Time taken for tests:   0.297 seconds
Complete requests:      1000
Failed requests:        0
Keep-Alive requests:    0
Total transferred:      250000 bytes
HTML transferred:       114000 bytes
Requests per second:    3364.56 [#/sec] (mean)
Time per request:       29.722 [ms] (mean)
Time per request:       0.297 [ms] (mean, across all concurrent requests)
Transfer rate:          821.42 [Kbytes/sec] received

Edited by RDP1974
  • Like 1

Share this post


Link to post

First of there is no 10.5 - I assume you meant 11.

 

That's like the mother of pointless benchmarks.

We know that default MM is prone to problems with heavy multithreading. Test with FastMM5 because that has addressed that issue.

For using TBB, I think it was mentioned in another thread that also the memory footprint should be considered.

Part of the improvement might come from the better system routines such as Move (FillChar has already been improved in 11.1) - I have been working on an improved version but I am afraid we will not get it before Delphi 12.

 

Also - I think I mentioned this before: in the age of supply chain attacks and malicious code being distributed via open source platforms, I would be very careful about using some binaries. You mentioned before how you compiled them: why are you reluctant to post the code on GitHub so everyone can compile it themselves?

Edited by Stefan Glienke
  • Like 3

Share this post


Link to post

Hi,

 

about allocator, it's well used in industry as for videogames, server apps, the large initial footprint is due to a caching tls thread pool, and it is negligible imho

but there is not only the mm, the patch replaces fundamental RTL:

 

function SeaZero(Pdst: PByte; Len: NativeUint): Integer; cdecl; // fillchar 0 (zeromem)
function SeaCopy(const Psrc: PByte; Pdst: PByte; Len: NativeUint): Integer; // copymem
function SeaMove(const Psrc: PByte; Pdst: PByte; Len: NativeUint): Integer;  //movemem
function SeaSet(Val: Byte; Pdst: PByte; Len: NativeUint): Integer; cdecl; // fillchar with #char
function SeaFind(const Psrc: PByte; Len: NativeUint; const Pfind: PByte; Lenfind: NativeUint; Pindex: PNativeUint): Integer; cdecl; //very fast Pos()
function SeaCompare(const Psrc1: PByte; const Psrc2: PByte; Len: NativeInt; Presult: PNativeInt): Integer; cdecl; // comparemem
function SeaUpperCase(const PSrcDst: PByte; Len: NativeUint): Integer; cdecl;  // uppercase Latin 8bit
function SeaReplace(const Psrc: PByte; Pdst: PByte; Len: NativeUint; oldVal: Byte; ipp8u: Byte): Integer; cdecl; // char replace

 

then zlib too it is patched to use simd instructions (5x faster than default gzip)

 

the sources are from these tools https://www.intel.com/content/www/us/en/developer/tools/oneapi/ipp.html and https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html

I cannot post Intel sources of course

I can tell that the DLLs are done perfectly with updated and clean visual studio 2022 toolchain, without touching the Intel sources, produced with zero warnings. Absolutely clean.

kind regards

R.

 

btw. Delphi 12 with FMM5 and enhanced Move, FillChar and Pos will solve everything!

Edited by RDP1974

Share this post


Link to post
On 8/25/2022 at 8:58 AM, RDP1974 said:

I can tell that the DLLs are done perfectly with updated and clean visual studio 2022 toolchain, without touching the Intel sources, produced with zero warnings. Absolutely clean.

Then please put the DLL projects into the repo with documentation on how to get the missing pieces to produce them ourselves instead of distributing binaries.

 

As for the RTL routines - as mentioned FillChar already has been improved in 11.1 (and hopefully will get another final improvements - some of which are detailed here: https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines/).

Pos has also been improved (mostly Win64 is affected by this as the new code is basically the purepascal code of what was the asm implementation for Win32). Hopefully for 12, we will get an improved Move I was working on (using SSE2 as that is the supported instruction set RTL can assume). Because Move handles forward/backward and overlap it covers what memcopy and memmove do.

 

That leaves CompareMem which I already looked into in context of faster string comparison, UpperCase and Replace.

Edited by Stefan Glienke
  • Like 6

Share this post


Link to post

Out of curiosity: How would binaries created with this fare on a Ryzen processor? Are Intel and AMD still 100% compatible? AFAIK Ryzen had problems on Win11 well into this summer, but that was because of TPM.

Share this post


Link to post
On 9/1/2022 at 11:18 AM, Sherlock said:

Out of curiosity: How would binaries created with this fare on a Ryzen processor? Are Intel and AMD still 100% compatible? AFAIK Ryzen had problems on Win11 well into this summer, but that was because of TPM.

They fixed ryzen problems with bios updates. I'm running the library (SeaMM & SeaRTL) on my Ryzen 5800 and it's working fine!

Share this post


Link to post

That's another reason why precompiled binaries are bad - if I had to guess I would say they are compiled for CPUs that support AVX which Nehalem did not have.

  • Like 3

Share this post


Link to post

hi,

 

the lib will automatically adapts functions upon the instruction set of the cpu. From sse2 to avx512. 
 

Personally I use it on servers and also, for example in desktop apps with devexpress vcl grids, firedac, etc. Runs on production over i3, i5, i7, i9 without glitches and absolutely reliable.

 

I had thousands of downloads without problems reported, but thanks.

 

look, obviously runs in 64bit only.

 

Consider with some libraries the mmalloc can produce exceptions, you should see the source and correct, anyway rarely happens on bad code on debug mode.

 

btw. I’m non endorsed with Intel, simply these libs are very excellent and well used in industry.

 

Kind regards.

 

 

Btw. for sources simply install from the links the products, there is a cmake for mmalloc and a python script for the rtl.

 

Btw. of course I prefer native pascal roots, waiting the enhancements on next Delphi updates.

Edited by RDP1974

Share this post


Link to post

many people asked me about an allocator compiled statically as object, with LLVM clang for example

I have seen many builds from many authors, but as far I have tested them, cannot be possible to produce static objects compatible,

many of them are producing C++ libs, some other are using C runtime functions not available under delphi, or using special $tls api not implemented in the linker, or calling visual c runtime (in windows port).
 

Edited by RDP1974

Share this post


Link to post
5 hours ago, RDP1974 said:

many people asked me about an allocator compiled statically as object, with LLVM clang for example

I have seen many builds from many authors, but as far I have tested them, cannot be possible to produce static objects compatible,

many of them are producing C++ libs, some other are using C runtime functions not available under delphi, or using special $tls api not implemented in the linker, or calling visual c runtime (in windows port).
 

This is surely correct. It's strange that people wouldn't be happy to have the allocator in a dll. 

Share this post


Link to post

I did a test with FMM5 and the performances with apachebench and webbroker are similar to the Intel allocator. But I don't know the reliability and fragmentation during the time.

Old projects in pascal code as NexusMM or scalemm2 I have not benchmarked them, those projects seems abandoned.


About C allocators world I have done a try with the most used:

hoard,

jemalloc,

tcmalloc,

mimalloc,

rpmalloc,

umm_malloc,

tbbmalloc:

none of these can be linked statically $L inside Delphi (without DLL dependancy) neither using visual c wrappers (the main problem is the $TLS linker error)

 

Share this post


Link to post

Been using FastMM5 in production for over 2 years now and never looked back.

I don't know of any reliability or fragmentation issues.

Share this post


Link to post

for curiosity I have done a test of this basic allocator over the heap functions of windows (without visual c runtime) and performances are equiparable to fmm5 and tbbmalloc.

However I don't know if windows under the hood manages correctly fragmentation, paging? 

 

unit MSHeap;

{$O+}

interface
  uses Windows;

implementation

var
  ProcessHeap: THandle;  
  
function SysGetMem(Size: NativeInt): Pointer;
begin
  Result := HeapAlloc(ProcessHeap, 0, Size);
end;

function SysFreeMem(P: Pointer): Integer;
begin
  HeapFree(ProcessHeap, 0, P);
  Result := 0;
end;

function SysReallocMem(P: Pointer; Size: NativeInt): Pointer;
begin
  Result := HeapReAlloc(ProcessHeap, 0, P, Size);
end;

function SysAllocMem(Size: NativeInt): Pointer;
begin
  Result := HeapAlloc(ProcessHeap, 0, Size);
  
  if (Result <> nil) then
    FillChar(Result^, Size, #0);  
end;

function SysRegisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
  Result := False;
end;

function SysUnregisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
  Result := False;
end;

const
  MemoryManager: TMemoryManagerEx =
  (
    GetMem: SysGetmem;
    FreeMem: SysFreeMem;
    ReallocMem: SysReAllocMem;
    AllocMem: SysAllocMem;
    RegisterExpectedMemoryLeak: SysRegisterExpectedMemoryLeak;
    UnregisterExpectedMemoryLeak: SysUnregisterExpectedMemoryLeak
  );

initialization
  ProcessHeap := GetProcessHeap;
  SetMemoryManager(MemoryManager);

end.

Share this post


Link to post

https://docs.microsoft.com/it-it/windows/win32/memory/heap-functions

https://docs.microsoft.com/it-it/windows/win32/memory/low-fragmentation-heap interesting feature

https://docs.microsoft.com/it-IT/windows/win32/api/heapapi/nf-heapapi-heapsetinformation //we can try to use flag 3 optimize to shrink the cache 

Enable the low-fragmentation heap (LFH). Starting with Windows Vista the LFH is enabled by default but this call does not cause an error.

// HeapInformation = HEAP_LFH;

bResult = HeapSetInformation(hHeap, HeapCompatibilityInformation, &HeapInformation, sizeof(HeapInformation));

HeapOptimizeResources
3
If HeapSetInformation is called with HeapHandle set to NULL, then all heaps in the process with a low-fragmentation heap (LFH) will have their caches optimized, and the memory will be decommitted if possible.
If a heap pointer is supplied in HeapHandle, then only that heap will be optimized.

Note that the HEAP_OPTIMIZE_RESOURCES_INFORMATION structure passed in HeapInformation must be properly initialized.

 

 

well, if it is reliable, can be a good solution for delphi next update maybe? look actually also 11.5 performs poorly in multithread scenario

Edited by RDP1974

Share this post


Link to post

look here https://users.rust-lang.org/t/why-dont-windows-targets-use-malloc-instead-of-heapalloc/57936

 

Rust calls directly Winapi for the heap,

also there tells that using an allocator over the Windows allocator it is not a correct way

 

so it is ok to use directly the Winapi allocator as before explained. I will use it for next projects seeing the behavior (but if the Rust language uses it directly seems  a better way and to get rid of the default mm)

 

kind regards

Share this post


Link to post

The winapi heap is what I use for my MM. With an added twist that I have distinct heaps for each NUMA node so that I can arrange that threads get memory that is efficient to access on NUMA machines. 

Share this post


Link to post

hi,

https://github.com/RDP1974/DelphiMSHeap

please can somebody do a speed test for single thread application?

I did a test, see attachment, and single thread performances are identical

(and with multithreaded web app it's quicker than intel tbbmalloc)

thank you

btw. I did small changes as inline directive and zeromemory on sysalloc within the api call

MM32.jpg

MM64.jpg

top = 32bit down = 64bit

sx = default MM delphi
dx = MSHeap delphi
(delphi 11.2 i9 cpu windows 10)

 

pokerBench.rar

 

webbroker test (see first post at begin)

Concurrency Level:      100
Time taken for tests:   0.269 seconds
Complete requests:      1000
Failed requests:        0
Keep-Alive requests:    0
Total transferred:      250000 bytes
HTML transferred:       114000 bytes
Requests per second:    3716.52 [#/sec] (mean)
Time per request:       26.907 [ms] (mean)
Time per request:       0.269 [ms] (mean, across all concurrent requests)
Transfer rate:          907.35 [Kbytes/sec] received

 

Edited by RDP1974
  • Like 1

Share this post


Link to post

It still surprises me that people are surprised by how much performance improves under heavy multithreading when not using the default MM.

AFAIK mORMot does not use the default MM anyway.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×