64bit RTL patches with Intel OneApi and TBB

RDP1974 · August 24, 2022

hi,

I have built the libraries with the latest sources of https://www.intel.com/content/www/us/en/developer/tools/oneapi/ipp.html and https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html

I had zero warnings or problems on compile.

Here the files https://github.com/RDP1974/Delphi64RTL

Look the TBB allocator is very prone to detect memory errors as double free or overruns.

In multithreaded apps as web applications you will get a large performance improvement.

Btw. Intel license is totally permissive free to distribute and deploy everywhere

please let me know if you discover errors

Quick test with WebBroker Indy app producing a plain:

program Project1;
uses
RDPMM64,
Vcl.Forms,
Web.WebReq,
...

procedure TWebModule1.WebModule1DefaultHandlerAction(Sender: TObject;
Request: TWebRequest; Response: TWebResponse; var Handled: Boolean);
begin
Response.Content :=
'<html>' +
'<head><title>Web Server Application</title></head>' +
'<body>Web Server Application '+FormatDateTime('yyyymmdd.hhnnss',Now)+'</body>' +
'</html>

end;

Hyper-V i9 cpu windows 2022 server, 16 cores

Host i9 cpu windows 10 pro

Apache bench ab -n 1000 -c 100 -k -r http://localhost:8080/

Delphi 11 default

Concurrency Level: 100
Time taken for tests: 1.845 seconds
Complete requests: 1000
Failed requests: 0
Keep-Alive requests: 0
Total transferred: 250000 bytes
HTML transferred: 114000 bytes

Requests per second: 542.04 [#/sec] (mean)
Time per request: 184.488 [ms] (mean)
Time per request: 1.845 [ms] (mean, across all concurrent requests)
Transfer rate: 132.33 [Kbytes/sec] received

Delphi 11 (with Intel libs):

Concurrency Level: 100
Time taken for tests: 0.297 seconds
Complete requests: 1000
Failed requests: 0
Keep-Alive requests: 0
Total transferred: 250000 bytes
HTML transferred: 114000 bytes
Requests per second: 3364.56 [#/sec] (mean)
Time per request: 29.722 [ms] (mean)
Time per request: 0.297 [ms] (mean, across all concurrent requests)
Transfer rate: 821.42 [Kbytes/sec] received

Edited August 25, 2022 by RDP1974

Stefan Glienke · August 25, 2022

First of there is no 10.5 - I assume you meant 11.

That's like the mother of pointless benchmarks.

We know that default MM is prone to problems with heavy multithreading. Test with FastMM5 because that has addressed that issue.

For using TBB, I think it was mentioned in another thread that also the memory footprint should be considered.

Part of the improvement might come from the better system routines such as Move (FillChar has already been improved in 11.1) - I have been working on an improved version but I am afraid we will not get it before Delphi 12.

Also - I think I mentioned this before: in the age of supply chain attacks and malicious code being distributed via open source platforms, I would be very careful about using some binaries. You mentioned before how you compiled them: why are you reluctant to post the code on GitHub so everyone can compile it themselves?

Edited August 25, 2022 by Stefan Glienke

RDP1974 · August 25, 2022

Hi,

about allocator, it's well used in industry as for videogames, server apps, the large initial footprint is due to a caching tls thread pool, and it is negligible imho

but there is not only the mm, the patch replaces fundamental RTL:

function SeaZero(Pdst: PByte; Len: NativeUint): Integer; cdecl; // fillchar 0 (zeromem)
function SeaCopy(const Psrc: PByte; Pdst: PByte; Len: NativeUint): Integer; // copymem
function SeaMove(const Psrc: PByte; Pdst: PByte; Len: NativeUint): Integer; //movemem
function SeaSet(Val: Byte; Pdst: PByte; Len: NativeUint): Integer; cdecl; // fillchar with #char
function SeaFind(const Psrc: PByte; Len: NativeUint; const Pfind: PByte; Lenfind: NativeUint; Pindex: PNativeUint): Integer; cdecl; //very fast Pos()
function SeaCompare(const Psrc1: PByte; const Psrc2: PByte; Len: NativeInt; Presult: PNativeInt): Integer; cdecl; // comparemem
function SeaUpperCase(const PSrcDst: PByte; Len: NativeUint): Integer; cdecl; // uppercase Latin 8bit
function SeaReplace(const Psrc: PByte; Pdst: PByte; Len: NativeUint; oldVal: Byte; ipp8u: Byte): Integer; cdecl; // char replace

then zlib too it is patched to use simd instructions (5x faster than default gzip)

the sources are from these tools https://www.intel.com/content/www/us/en/developer/tools/oneapi/ipp.html and https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html

I cannot post Intel sources of course

I can tell that the DLLs are done perfectly with updated and clean visual studio 2022 toolchain, without touching the Intel sources, produced with zero warnings. Absolutely clean.

kind regards

R.

btw. Delphi 12 with FMM5 and enhanced Move, FillChar and Pos will solve everything!

Edited August 25, 2022 by RDP1974

Fr0sT.Brutal · August 25, 2022

Would be interesting to test against FastMM5

Stefan Glienke · August 30, 2022

On 8/25/2022 at 8:58 AM, RDP1974 said:

I can tell that the DLLs are done perfectly with updated and clean visual studio 2022 toolchain, without touching the Intel sources, produced with zero warnings. Absolutely clean.

Then please put the DLL projects into the repo with documentation on how to get the missing pieces to produce them ourselves instead of distributing binaries.

As for the RTL routines - as mentioned FillChar already has been improved in 11.1 (and hopefully will get another final improvements - some of which are detailed here: https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines/).

Pos has also been improved (mostly Win64 is affected by this as the new code is basically the purepascal code of what was the asm implementation for Win32). Hopefully for 12, we will get an improved Move I was working on (using SSE2 as that is the supported instruction set RTL can assume). Because Move handles forward/backward and overlap it covers what memcopy and memmove do.

That leaves CompareMem which I already looked into in context of faster string comparison, UpperCase and Replace.

Edited August 30, 2022 by Stefan Glienke

Sherlock · September 1, 2022

Out of curiosity: How would binaries created with this fare on a Ryzen processor? Are Intel and AMD still 100% compatible? AFAIK Ryzen had problems on Win11 well into this summer, but that was because of TPM.

chmichael · September 2, 2022

On 9/1/2022 at 11:18 AM, Sherlock said:

Out of curiosity: How would binaries created with this fare on a Ryzen processor? Are Intel and AMD still 100% compatible? AFAIK Ryzen had problems on Win11 well into this summer, but that was because of TPM.

They fixed ryzen problems with bios updates. I'm running the library (SeaMM & SeaRTL) on my Ryzen 5800 and it's working fine!

chmichael · September 2, 2022

Hmm, it crashes with i5 750 ....

Stefan Glienke · September 2, 2022

That's another reason why precompiled binaries are bad - if I had to guess I would say they are compiled for CPUs that support AVX which Nehalem did not have.

RDP1974 · September 4, 2022

hi,

the lib will automatically adapts functions upon the instruction set of the cpu. From sse2 to avx512.

Personally I use it on servers and also, for example in desktop apps with devexpress vcl grids, firedac, etc. Runs on production over i3, i5, i7, i9 without glitches and absolutely reliable.

I had thousands of downloads without problems reported, but thanks.

look, obviously runs in 64bit only.

Consider with some libraries the mmalloc can produce exceptions, you should see the source and correct, anyway rarely happens on bad code on debug mode.

btw. I’m non endorsed with Intel, simply these libs are very excellent and well used in industry.

Kind regards.

Btw. for sources simply install from the links the products, there is a cmake for mmalloc and a python script for the rtl.

Btw. of course I prefer native pascal roots, waiting the enhancements on next Delphi updates.

Edited September 4, 2022 by RDP1974

RDP1974 · September 5, 2022

many people asked me about an allocator compiled statically as object, with LLVM clang for example

I have seen many builds from many authors, but as far I have tested them, cannot be possible to produce static objects compatible,

many of them are producing C++ libs, some other are using C runtime functions not available under delphi, or using special $tls api not implemented in the linker, or calling visual c runtime (in windows port).

Edited September 5, 2022 by RDP1974

David Heffernan · September 5, 2022

5 hours ago, RDP1974 said:

many people asked me about an allocator compiled statically as object, with LLVM clang for example

I have seen many builds from many authors, but as far I have tested them, cannot be possible to produce static objects compatible,

many of them are producing C++ libs, some other are using C runtime functions not available under delphi, or using special $tls api not implemented in the linker, or calling visual c runtime (in windows port).

This is surely correct. It's strange that people wouldn't be happy to have the allocator in a dll.

RDP1974 · September 6, 2022

https://github.com/YWtheGod/LIBC this is an interesting project, but C sources of the objects are not provided, so I will not try to use it

there is System.Win.Crtl, but still needs visual c runtime redistribution in the os

RDP1974 · September 6, 2022

I did a test with FMM5 and the performances with apachebench and webbroker are similar to the Intel allocator. But I don't know the reliability and fragmentation during the time.

Old projects in pascal code as NexusMM or scalemm2 I have not benchmarked them, those projects seems abandoned.

About C allocators world I have done a try with the most used:

hoard,

jemalloc,

tcmalloc,

mimalloc,

rpmalloc,

umm_malloc,

tbbmalloc:

none of these can be linked statically $L inside Delphi (without DLL dependancy) neither using visual c wrappers (the main problem is the $TLS linker error)

Stefan Glienke · September 6, 2022

Been using FastMM5 in production for over 2 years now and never looked back.

I don't know of any reliability or fragmentation issues.

RDP1974 · September 6, 2022

for curiosity I have done a test of this basic allocator over the heap functions of windows (without visual c runtime) and performances are equiparable to fmm5 and tbbmalloc.

However I don't know if windows under the hood manages correctly fragmentation, paging?

unit MSHeap;

{$O+}

interface
uses Windows;

implementation

var
ProcessHeap: THandle;

function SysGetMem(Size: NativeInt): Pointer;
begin
Result := HeapAlloc(ProcessHeap, 0, Size);
end;

function SysFreeMem(P: Pointer): Integer;
begin
HeapFree(ProcessHeap, 0, P);
Result := 0;
end;

function SysReallocMem(P: Pointer; Size: NativeInt): Pointer;
begin
Result := HeapReAlloc(ProcessHeap, 0, P, Size);
end;

function SysAllocMem(Size: NativeInt): Pointer;
begin
Result := HeapAlloc(ProcessHeap, 0, Size);

if (Result <> nil) then
FillChar(Result^, Size, #0);
end;

function SysRegisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
Result := False;
end;

function SysUnregisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
Result := False;
end;

const
MemoryManager: TMemoryManagerEx =
(
GetMem: SysGetmem;
FreeMem: SysFreeMem;
ReallocMem: SysReAllocMem;
AllocMem: SysAllocMem;
RegisterExpectedMemoryLeak: SysRegisterExpectedMemoryLeak;
UnregisterExpectedMemoryLeak: SysUnregisterExpectedMemoryLeak
);

initialization
ProcessHeap := GetProcessHeap;
SetMemoryManager(MemoryManager);

end.

RDP1974 · September 6, 2022

https://docs.microsoft.com/it-it/windows/win32/memory/heap-functions

https://docs.microsoft.com/it-it/windows/win32/memory/low-fragmentation-heap interesting feature

https://docs.microsoft.com/it-IT/windows/win32/api/heapapi/nf-heapapi-heapsetinformation //we can try to use flag 3 optimize to shrink the cache

Enable the low-fragmentation heap (LFH). Starting with Windows Vista the LFH is enabled by default but this call does not cause an error.

// HeapInformation = HEAP_LFH;

bResult = HeapSetInformation(hHeap, HeapCompatibilityInformation, &HeapInformation, sizeof(HeapInformation));

HeapOptimizeResources
3
If HeapSetInformation is called with HeapHandle set to NULL, then all heaps in the process with a low-fragmentation heap (LFH) will have their caches optimized, and the memory will be decommitted if possible.
If a heap pointer is supplied in HeapHandle, then only that heap will be optimized.

Note that the HEAP_OPTIMIZE_RESOURCES_INFORMATION structure passed in HeapInformation must be properly initialized.

well, if it is reliable, can be a good solution for delphi next update maybe? look actually also 11.5 performs poorly in multithread scenario

Edited September 6, 2022 by RDP1974

RDP1974 · September 9, 2022

look here https://users.rust-lang.org/t/why-dont-windows-targets-use-malloc-instead-of-heapalloc/57936

Rust calls directly Winapi for the heap,

also there tells that using an allocator over the Windows allocator it is not a correct way

so it is ok to use directly the Winapi allocator as before explained. I will use it for next projects seeing the behavior (but if the Rust language uses it directly seems a better way and to get rid of the default mm)

kind regards

David Heffernan · September 9, 2022

The winapi heap is what I use for my MM. With an added twist that I have distinct heaps for each NUMA node so that I can arrange that threads get memory that is efficient to access on NUMA machines.

RDP1974 · September 10, 2022

hi,

https://github.com/RDP1974/DelphiMSHeap

please can somebody do a speed test for single thread application?

I did a test, see attachment, and single thread performances are identical

(and with multithreaded web app it's quicker than intel tbbmalloc)

thank you

btw. I did small changes as inline directive and zeromemory on sysalloc within the api call

top = 32bit down = 64bit

sx = default MM delphi
dx = MSHeap delphi
(delphi 11.2 i9 cpu windows 10)

pokerBench.rar

webbroker test (see first post at begin)

Concurrency Level: 100
Time taken for tests: 0.269 seconds
Complete requests: 1000
Failed requests: 0
Keep-Alive requests: 0
Total transferred: 250000 bytes
HTML transferred: 114000 bytes
Requests per second: 3716.52 [#/sec] (mean)
Time per request: 26.907 [ms] (mean)
Time per request: 0.269 [ms] (mean, across all concurrent requests)
Transfer rate: 907.35 [Kbytes/sec] received

Edited September 10, 2022 by RDP1974

Stefan Glienke · September 11, 2022

Tests that run for approx one second are really the way to go when deciding on the proper memory manager. 😂

RDP1974 · September 11, 2022

I have just asked the DMVC group if they can do a test over a real application

kind regards

RDP1974 · September 16, 2022

https://gist.github.com/danieleteti/1422ef290e20e9529106ae7c9aed0968?fbclid=IwAR1iokzrUKdV2hm-gp63ufE9g-DQFsrt3Qyvwaga8TzKafgigYSKJmjc344

from dmvc fb group

from 353 to 4869

https://www.facebook.com/groups/delphimvcframework

Edwin Yip · September 18, 2022

On 9/16/2022 at 6:31 PM, RDP1974 said:

https://gist.github.com/danieleteti/1422ef290e20e9529106ae7c9aed0968?fbclid=IwAR1iokzrUKdV2hm-gp63ufE9g-DQFsrt3Qyvwaga8TzKafgigYSKJmjc344

from dmvc fb group

from 353 to 4869

https://www.facebook.com/groups/delphimvcframework

Wow, that's a 10x improvement! Not sure how it affects the performance of framework like mORMot @Arnaud Bouchez :)

Stefan Glienke · September 18, 2022

It still surprises me that people are surprised by how much performance improves under heavy multithreading when not using the default MM.

AFAIK mORMot does not use the default MM anyway.

Sign In

64bit RTL patches with Intel OneApi and TBB

Recommended Posts

RDP1974 40

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Fr0sT.Brutal 903

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

Sherlock 685

Share this post

Link to post

chmichael 14

Share this post

Link to post

chmichael 14

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Edwin Yip 154

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

Create an account or sign in to comment

Create an account