RDP1974 40 Posted August 24, 2022 (edited) hi, I have built the libraries with the latest sources of https://www.intel.com/content/www/us/en/developer/tools/oneapi/ipp.html and https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html I had zero warnings or problems on compile. Here the files https://github.com/RDP1974/Delphi64RTL Look the TBB allocator is very prone to detect memory errors as double free or overruns. In multithreaded apps as web applications you will get a large performance improvement. Btw. Intel license is totally permissive free to distribute and deploy everywhere please let me know if you discover errors Quick test with WebBroker Indy app producing a plain: program Project1; uses RDPMM64, Vcl.Forms, Web.WebReq, ... procedure TWebModule1.WebModule1DefaultHandlerAction(Sender: TObject; Request: TWebRequest; Response: TWebResponse; var Handled: Boolean); begin Response.Content := '<html>' + '<head><title>Web Server Application</title></head>' + '<body>Web Server Application '+FormatDateTime('yyyymmdd.hhnnss',Now)+'</body>' + '</html> end; Hyper-V i9 cpu windows 2022 server, 16 cores Host i9 cpu windows 10 pro Apache bench ab -n 1000 -c 100 -k -r http://localhost:8080/ Delphi 11 default Concurrency Level: 100 Time taken for tests: 1.845 seconds Complete requests: 1000 Failed requests: 0 Keep-Alive requests: 0 Total transferred: 250000 bytes HTML transferred: 114000 bytes Requests per second: 542.04 [#/sec] (mean) Time per request: 184.488 [ms] (mean) Time per request: 1.845 [ms] (mean, across all concurrent requests) Transfer rate: 132.33 [Kbytes/sec] received Delphi 11 (with Intel libs): Concurrency Level: 100 Time taken for tests: 0.297 seconds Complete requests: 1000 Failed requests: 0 Keep-Alive requests: 0 Total transferred: 250000 bytes HTML transferred: 114000 bytes Requests per second: 3364.56 [#/sec] (mean) Time per request: 29.722 [ms] (mean) Time per request: 0.297 [ms] (mean, across all concurrent requests) Transfer rate: 821.42 [Kbytes/sec] received Edited August 25, 2022 by RDP1974 1 Share this post Link to post
Stefan Glienke 2002 Posted August 25, 2022 (edited) First of there is no 10.5 - I assume you meant 11. That's like the mother of pointless benchmarks. We know that default MM is prone to problems with heavy multithreading. Test with FastMM5 because that has addressed that issue. For using TBB, I think it was mentioned in another thread that also the memory footprint should be considered. Part of the improvement might come from the better system routines such as Move (FillChar has already been improved in 11.1) - I have been working on an improved version but I am afraid we will not get it before Delphi 12. Also - I think I mentioned this before: in the age of supply chain attacks and malicious code being distributed via open source platforms, I would be very careful about using some binaries. You mentioned before how you compiled them: why are you reluctant to post the code on GitHub so everyone can compile it themselves? Edited August 25, 2022 by Stefan Glienke 3 Share this post Link to post
RDP1974 40 Posted August 25, 2022 (edited) Hi, about allocator, it's well used in industry as for videogames, server apps, the large initial footprint is due to a caching tls thread pool, and it is negligible imho but there is not only the mm, the patch replaces fundamental RTL: function SeaZero(Pdst: PByte; Len: NativeUint): Integer; cdecl; // fillchar 0 (zeromem) function SeaCopy(const Psrc: PByte; Pdst: PByte; Len: NativeUint): Integer; // copymem function SeaMove(const Psrc: PByte; Pdst: PByte; Len: NativeUint): Integer; //movemem function SeaSet(Val: Byte; Pdst: PByte; Len: NativeUint): Integer; cdecl; // fillchar with #char function SeaFind(const Psrc: PByte; Len: NativeUint; const Pfind: PByte; Lenfind: NativeUint; Pindex: PNativeUint): Integer; cdecl; //very fast Pos() function SeaCompare(const Psrc1: PByte; const Psrc2: PByte; Len: NativeInt; Presult: PNativeInt): Integer; cdecl; // comparemem function SeaUpperCase(const PSrcDst: PByte; Len: NativeUint): Integer; cdecl; // uppercase Latin 8bit function SeaReplace(const Psrc: PByte; Pdst: PByte; Len: NativeUint; oldVal: Byte; ipp8u: Byte): Integer; cdecl; // char replace then zlib too it is patched to use simd instructions (5x faster than default gzip) the sources are from these tools https://www.intel.com/content/www/us/en/developer/tools/oneapi/ipp.html and https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html I cannot post Intel sources of course I can tell that the DLLs are done perfectly with updated and clean visual studio 2022 toolchain, without touching the Intel sources, produced with zero warnings. Absolutely clean. kind regards R. btw. Delphi 12 with FMM5 and enhanced Move, FillChar and Pos will solve everything! Edited August 25, 2022 by RDP1974 Share this post Link to post
Fr0sT.Brutal 900 Posted August 25, 2022 Would be interesting to test against FastMM5 Share this post Link to post
Stefan Glienke 2002 Posted August 30, 2022 (edited) On 8/25/2022 at 8:58 AM, RDP1974 said: I can tell that the DLLs are done perfectly with updated and clean visual studio 2022 toolchain, without touching the Intel sources, produced with zero warnings. Absolutely clean. Then please put the DLL projects into the repo with documentation on how to get the missing pieces to produce them ourselves instead of distributing binaries. As for the RTL routines - as mentioned FillChar already has been improved in 11.1 (and hopefully will get another final improvements - some of which are detailed here: https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines/). Pos has also been improved (mostly Win64 is affected by this as the new code is basically the purepascal code of what was the asm implementation for Win32). Hopefully for 12, we will get an improved Move I was working on (using SSE2 as that is the supported instruction set RTL can assume). Because Move handles forward/backward and overlap it covers what memcopy and memmove do. That leaves CompareMem which I already looked into in context of faster string comparison, UpperCase and Replace. Edited August 30, 2022 by Stefan Glienke 6 Share this post Link to post
Sherlock 663 Posted September 1, 2022 Out of curiosity: How would binaries created with this fare on a Ryzen processor? Are Intel and AMD still 100% compatible? AFAIK Ryzen had problems on Win11 well into this summer, but that was because of TPM. Share this post Link to post
chmichael 12 Posted September 2, 2022 On 9/1/2022 at 11:18 AM, Sherlock said: Out of curiosity: How would binaries created with this fare on a Ryzen processor? Are Intel and AMD still 100% compatible? AFAIK Ryzen had problems on Win11 well into this summer, but that was because of TPM. They fixed ryzen problems with bios updates. I'm running the library (SeaMM & SeaRTL) on my Ryzen 5800 and it's working fine! Share this post Link to post
Stefan Glienke 2002 Posted September 2, 2022 That's another reason why precompiled binaries are bad - if I had to guess I would say they are compiled for CPUs that support AVX which Nehalem did not have. 3 Share this post Link to post
RDP1974 40 Posted September 4, 2022 (edited) hi, the lib will automatically adapts functions upon the instruction set of the cpu. From sse2 to avx512. Personally I use it on servers and also, for example in desktop apps with devexpress vcl grids, firedac, etc. Runs on production over i3, i5, i7, i9 without glitches and absolutely reliable. I had thousands of downloads without problems reported, but thanks. look, obviously runs in 64bit only. Consider with some libraries the mmalloc can produce exceptions, you should see the source and correct, anyway rarely happens on bad code on debug mode. btw. I’m non endorsed with Intel, simply these libs are very excellent and well used in industry. Kind regards. Btw. for sources simply install from the links the products, there is a cmake for mmalloc and a python script for the rtl. Btw. of course I prefer native pascal roots, waiting the enhancements on next Delphi updates. Edited September 4, 2022 by RDP1974 Share this post Link to post
RDP1974 40 Posted September 5, 2022 (edited) many people asked me about an allocator compiled statically as object, with LLVM clang for example I have seen many builds from many authors, but as far I have tested them, cannot be possible to produce static objects compatible, many of them are producing C++ libs, some other are using C runtime functions not available under delphi, or using special $tls api not implemented in the linker, or calling visual c runtime (in windows port). Edited September 5, 2022 by RDP1974 Share this post Link to post
David Heffernan 2345 Posted September 5, 2022 5 hours ago, RDP1974 said: many people asked me about an allocator compiled statically as object, with LLVM clang for example I have seen many builds from many authors, but as far I have tested them, cannot be possible to produce static objects compatible, many of them are producing C++ libs, some other are using C runtime functions not available under delphi, or using special $tls api not implemented in the linker, or calling visual c runtime (in windows port). This is surely correct. It's strange that people wouldn't be happy to have the allocator in a dll. Share this post Link to post
RDP1974 40 Posted September 6, 2022 https://github.com/YWtheGod/LIBC this is an interesting project, but C sources of the objects are not provided, so I will not try to use it there is System.Win.Crtl, but still needs visual c runtime redistribution in the os Share this post Link to post
RDP1974 40 Posted September 6, 2022 I did a test with FMM5 and the performances with apachebench and webbroker are similar to the Intel allocator. But I don't know the reliability and fragmentation during the time. Old projects in pascal code as NexusMM or scalemm2 I have not benchmarked them, those projects seems abandoned. About C allocators world I have done a try with the most used: hoard, jemalloc, tcmalloc, mimalloc, rpmalloc, umm_malloc, tbbmalloc: none of these can be linked statically $L inside Delphi (without DLL dependancy) neither using visual c wrappers (the main problem is the $TLS linker error) Share this post Link to post
Stefan Glienke 2002 Posted September 6, 2022 Been using FastMM5 in production for over 2 years now and never looked back. I don't know of any reliability or fragmentation issues. Share this post Link to post
RDP1974 40 Posted September 6, 2022 for curiosity I have done a test of this basic allocator over the heap functions of windows (without visual c runtime) and performances are equiparable to fmm5 and tbbmalloc. However I don't know if windows under the hood manages correctly fragmentation, paging? unit MSHeap; {$O+} interface uses Windows; implementation var ProcessHeap: THandle; function SysGetMem(Size: NativeInt): Pointer; begin Result := HeapAlloc(ProcessHeap, 0, Size); end; function SysFreeMem(P: Pointer): Integer; begin HeapFree(ProcessHeap, 0, P); Result := 0; end; function SysReallocMem(P: Pointer; Size: NativeInt): Pointer; begin Result := HeapReAlloc(ProcessHeap, 0, P, Size); end; function SysAllocMem(Size: NativeInt): Pointer; begin Result := HeapAlloc(ProcessHeap, 0, Size); if (Result <> nil) then FillChar(Result^, Size, #0); end; function SysRegisterExpectedMemoryLeak(P: Pointer): Boolean; begin Result := False; end; function SysUnregisterExpectedMemoryLeak(P: Pointer): Boolean; begin Result := False; end; const MemoryManager: TMemoryManagerEx = ( GetMem: SysGetmem; FreeMem: SysFreeMem; ReallocMem: SysReAllocMem; AllocMem: SysAllocMem; RegisterExpectedMemoryLeak: SysRegisterExpectedMemoryLeak; UnregisterExpectedMemoryLeak: SysUnregisterExpectedMemoryLeak ); initialization ProcessHeap := GetProcessHeap; SetMemoryManager(MemoryManager); end. Share this post Link to post
RDP1974 40 Posted September 6, 2022 (edited) https://docs.microsoft.com/it-it/windows/win32/memory/heap-functions https://docs.microsoft.com/it-it/windows/win32/memory/low-fragmentation-heap interesting feature https://docs.microsoft.com/it-IT/windows/win32/api/heapapi/nf-heapapi-heapsetinformation //we can try to use flag 3 optimize to shrink the cache Enable the low-fragmentation heap (LFH). Starting with Windows Vista the LFH is enabled by default but this call does not cause an error. // HeapInformation = HEAP_LFH; bResult = HeapSetInformation(hHeap, HeapCompatibilityInformation, &HeapInformation, sizeof(HeapInformation)); HeapOptimizeResources 3 If HeapSetInformation is called with HeapHandle set to NULL, then all heaps in the process with a low-fragmentation heap (LFH) will have their caches optimized, and the memory will be decommitted if possible. If a heap pointer is supplied in HeapHandle, then only that heap will be optimized. Note that the HEAP_OPTIMIZE_RESOURCES_INFORMATION structure passed in HeapInformation must be properly initialized. well, if it is reliable, can be a good solution for delphi next update maybe? look actually also 11.5 performs poorly in multithread scenario Edited September 6, 2022 by RDP1974 Share this post Link to post
RDP1974 40 Posted September 9, 2022 look here https://users.rust-lang.org/t/why-dont-windows-targets-use-malloc-instead-of-heapalloc/57936 Rust calls directly Winapi for the heap, also there tells that using an allocator over the Windows allocator it is not a correct way so it is ok to use directly the Winapi allocator as before explained. I will use it for next projects seeing the behavior (but if the Rust language uses it directly seems a better way and to get rid of the default mm) kind regards Share this post Link to post
David Heffernan 2345 Posted September 9, 2022 The winapi heap is what I use for my MM. With an added twist that I have distinct heaps for each NUMA node so that I can arrange that threads get memory that is efficient to access on NUMA machines. Share this post Link to post
RDP1974 40 Posted September 10, 2022 (edited) hi, https://github.com/RDP1974/DelphiMSHeap please can somebody do a speed test for single thread application? I did a test, see attachment, and single thread performances are identical (and with multithreaded web app it's quicker than intel tbbmalloc) thank you btw. I did small changes as inline directive and zeromemory on sysalloc within the api call top = 32bit down = 64bit sx = default MM delphi dx = MSHeap delphi (delphi 11.2 i9 cpu windows 10) pokerBench.rar webbroker test (see first post at begin) Concurrency Level: 100 Time taken for tests: 0.269 seconds Complete requests: 1000 Failed requests: 0 Keep-Alive requests: 0 Total transferred: 250000 bytes HTML transferred: 114000 bytes Requests per second: 3716.52 [#/sec] (mean) Time per request: 26.907 [ms] (mean) Time per request: 0.269 [ms] (mean, across all concurrent requests) Transfer rate: 907.35 [Kbytes/sec] received Edited September 10, 2022 by RDP1974 1 Share this post Link to post
Stefan Glienke 2002 Posted September 11, 2022 Tests that run for approx one second are really the way to go when deciding on the proper memory manager. 😂 1 Share this post Link to post
RDP1974 40 Posted September 11, 2022 I have just asked the DMVC group if they can do a test over a real application kind regards Share this post Link to post
RDP1974 40 Posted September 16, 2022 https://gist.github.com/danieleteti/1422ef290e20e9529106ae7c9aed0968?fbclid=IwAR1iokzrUKdV2hm-gp63ufE9g-DQFsrt3Qyvwaga8TzKafgigYSKJmjc344 from dmvc fb group from 353 to 4869 https://www.facebook.com/groups/delphimvcframework 1 Share this post Link to post
Edwin Yip 154 Posted September 18, 2022 On 9/16/2022 at 6:31 PM, RDP1974 said: https://gist.github.com/danieleteti/1422ef290e20e9529106ae7c9aed0968?fbclid=IwAR1iokzrUKdV2hm-gp63ufE9g-DQFsrt3Qyvwaga8TzKafgigYSKJmjc344 from dmvc fb group from 353 to 4869 https://www.facebook.com/groups/delphimvcframework Wow, that's a 10x improvement! Not sure how it affects the performance of framework like mORMot @Arnaud Bouchez :) Share this post Link to post
Stefan Glienke 2002 Posted September 18, 2022 It still surprises me that people are surprised by how much performance improves under heavy multithreading when not using the default MM. AFAIK mORMot does not use the default MM anyway. Share this post Link to post