RDP1974 40 Posted February 7, 2020 hello, I'm used to deploy custom Intel TBB memory manager and IPP with Delphi 64bit server apps, with highly satisfaction (see https://github.com/RDP1974/Sea-Delphi-RTL-IIS-Filter results). I have found FastMM4 to be slow and give me many fatal errors under multithreaded stress test (especially the last AVX-512 fork) Many people asked me for a 32bit version of Intel TBB malloc, and many other asked me to do embedded code instead of external DLL to distribute. The point are: - impossible to do static objects of Intel libs without rely on MSVCRT redistribution, further Delphi linker cannot manage the objs (see below) - have tried 32bit TBB DLL, but don't works, giving fatal errors on runtime - have tried CLANG compiled of other two good allocators, doing static objects, but Delphi32 can't link them due to architecture limits ($ThreadLocalStorage functions not managed) The allocators I have tried after reviewing dozens are: - https://github.com/mjansson/rpmalloc - https://github.com/microsoft/mimalloc Now the nasty :) question: Does somebody likes to join me to do a Delphi pascal native version of rpmalloc (seems the easier and cleaner IMHO)? Look, I have really few spare time, but I think we can do it. Any opinion? Regards. (sorry my not good english) Roberto Della Pasqua www.dellapasqua.com Share this post Link to post
Stefan Glienke 2002 Posted February 7, 2020 How about fixing the issues that FastMM4 has rather than inventing yet another newly mostly untested and unproven memory manager? 3 Share this post Link to post
RDP1974 40 Posted February 7, 2020 Really we need a new model for the MM, FastMM4 is bloat, should be cool to have a lock-free allocator using threadvar and/or TLS API also with a small thread pool preallocated, so to compete with other high performance languages. As far I have tested to make Delphi perfect again: - TLS lock-free allocator - SIMD FillChar, Move, Pos I ask Mr.Allen of Grizly :-P Regards. Share this post Link to post
Clément 148 Posted February 7, 2020 Have you compared your results with: https://github.com/maximmasiutin/FastMM4-AVX ? Share this post Link to post
David Heffernan 2345 Posted February 7, 2020 4 hours ago, RDP1974 said: should be cool to have a lock-free allocator using threadvar and/or TLS API Threadvar is implemented on top of TLS on Windows. How is lock free going to handle deallocations made from a different thread from that which allocated the memory? But hey, if you want to write this code, go for it. Share this post Link to post
Arnaud Bouchez 407 Posted February 7, 2020 (edited) @David Heffernan You just store the ThreadID within the memory block information, or you use a per-convention identification of the memory buffer. Cross-thread deallocations also usually require a ThreadEnded-like event handler, which doesn't exist on Delphi IIRC - but does exist on FPC - so need to hack TThread. @RDP1974 Last time I checked, FastMM4 (trunk or AVX2 fork) don't work well with Linux (at least under FPC). Under Delphi + Linux, FastMM4 is not used at all - it just call libc free/malloc IIRC. I am not convinced the slowness comes from libc heap - which is very good from our tests. But from how Delphi/Linux is not optimized (yet). Other MM like BrainMM or our ScaleMM are not designed for Linux. We tried also a lot of allocators on Linux - see https://github.com/synopse/mORMot/blob/master/SynFPCCMemAligned.pas - in the context of highly multi-threaded servers. In a nutshell, see https://github.com/synopse/mORMot/blob/master/SynFPCCMemAligned.pas#L57 for some numbers. The big issue with those C-based allocators, which is not listed in those comments, apart from loading a lot of RAM, is that they stop the executable as soon as some GPF occurs: e.g. a double free will call a SIGABORT! So they are NOT usable on production unless you use them with ValGrid and proper debugging. We fallback into using the FPC default heap, which is a bit slower, consumes a lot of RAM (since it has a per-thread heap for smaller blocks) but is very stable. It is written in plain pascal. And the main idea about performance is to avoid as much memory allocation as possible - which is what we tried with mORMot from the ground up: for instance, we define most of the temp strings in the stack, not in the heap. I don't think that re-writing a C allocator into pascal would be much faster. It is very likely to be slower. Only a pure asm version may have some noticeable benefits - just like FastMM4. And, personally, I wouldn't invest into Delphi for Linux for server process: FPC is so much stable, faster and better maintained... for free! Edited February 7, 2020 by Arnaud Bouchez 1 Share this post Link to post
RDP1974 40 Posted February 8, 2020 Hi, I’m using FPC with Arm Linux, is very nice, but: - the community release updates so slowly, seems stopped - the RTL and the whole classes quality source code is a lot better and polished in Delphi - RTTI at runtime? - I did a test and Delphi 64 was double faster than FPC in low level loops, sets, arrays and collections - high quality database layer in Delphi IMHO Share this post Link to post
Remy Lebeau 1393 Posted February 8, 2020 4 hours ago, Arnaud Bouchez said: You just store the ThreadID within the memory block information, or you use a per-convention identification of the memory buffer. Cross-thread deallocations also usually require a ThreadEnded-like event handler, which doesn't exist on Delphi IIRC - but does exist on FPC - so need to hack TThread. And how are you going to handle the case where a thread allocates some memory, then that thread terminates/dies and its ThreadID gets reused by a new thread, and then that new thread wants to deallocate the earlier memory? Storing the ThreadID in the memory metadata may not suffice, and if you use TLS storage for the memory then you lose the original memory altogether. Even if the SAME thread does the allocating and deallocating, you still need some kind of thread-safe mechanism to synchronize (de)allocations with OTHER threads so they don't try to reuse/trample the same memory block while it is still being used. Share this post Link to post
Guest Posted February 8, 2020 13 hours ago, David Heffernan said: How is lock free going to handle deallocations made from a different thread from that which allocated the memory? 10 hours ago, Arnaud Bouchez said: You just store the ThreadID within the memory block information, or you use a per-convention identification of the memory buffer. Cross-thread deallocations also usually require a ThreadEnded-like event handler, which doesn't exist on Delphi IIRC - but does exist on FPC - so need to hack TThread. That exactly what ScaleMM does, and exactly how it is designed, it has a MM per thread while it still manage to have global MM to coordinate between them, and that why it is very fast in multi threaded application, in fact it is multi times faster than FastMM, 10 hours ago, Arnaud Bouchez said: Cross-thread deallocations also usually require a ThreadEnded-like event handler, which doesn't exist on Delphi IIRC - but does exist on FPC - so need to hack TThread. There is EndThreadProc and SystemThreadEndProc since Delphi 2009 at least, also there is hooks for starting threads BeginThreadProc and TSystemThreadFuncProc. 10 hours ago, Arnaud Bouchez said: I don't think that re-writing a C allocator into pascal would be much faster. It is very likely to be slower. I think so too, I would recommend to fix both/either ScaleMM or BrainMM, as they both have rare dangerous bugs, i could repeat them in controlled environment so i dropped using them. 10 hours ago, Arnaud Bouchez said: And, personally, I wouldn't invest into Delphi for Linux for server process: FPC is so much stable, faster and better maintained... for free! True and fact. 5 hours ago, Remy Lebeau said: And how are you going to handle the case where a thread allocates some memory, then that thread terminates/dies and its ThreadID gets reused by a new thread, and then that new thread wants to deallocate the earlier memory? Storing the ThreadID in the memory metadata may not suffice, and if you use TLS storage for the memory then you lose the original memory altogether. Here again i suggest to have a look on ScaleMM and test it with your most heavy application and see how it does perform, as it does that brilliantly. One thing to note: Older EurekaLog ( not the latest versions ) is not helping with ScaleMM, there is no Memory leak but the thread allocated memory is not free until application exits due in conflict how both hooks EndThread, so in case you want to test ScaleMM make sure you have the latest EL, or disable it. Share this post Link to post
Guest Posted February 8, 2020 One more thing about ScaleMM: a really heavy stress test with RTC SDK ( client/server) that takes +3.5 minutes to complete with ScaleMM it takes 40 seconds, that is the advantage of those small single thread MM. Share this post Link to post
Tommi Prami 130 Posted February 10, 2020 On 2/8/2020 at 9:29 AM, Kas Ob. said: One more thing about ScaleMM: a really heavy stress test with RTC SDK ( client/server) that takes +3.5 minutes to complete with ScaleMM it takes 40 seconds, that is the advantage of those small single thread MM. Could you clarify which takes which time. I can read that both ways 😄 Share this post Link to post
Guest Posted February 10, 2020 13 minutes ago, Tommi Prami said: Could you clarify which takes which time. I can read that both ways Reading what i wrote, i found it unclear too 😞 The stock MM is the one taking more than three and half minutes, while ScaleMM is finishing my stress test in 41 second to be exact, the stress test is multithreaded with 32 threads, with extreme quantity of string processing operations (assigning, comparison and concatenation ), the most impressive of ScaleMM is the low thread contention compared to FastMM. Share this post Link to post
David Heffernan 2345 Posted February 10, 2020 Imagine how fast it would be if you wrote code that didn't stress the heap allocator. That's the real route to performance. Real world benchmarks are the only ones that matter. 1 Share this post Link to post
Guest Posted February 10, 2020 It is fast, i always profile and benchmark, but IMHO both benchmarking and profiling results have accuracy problems when parallelism involved, the result start to get far from the expectation specially when strings involved, and here there is a little what can be done, for that i found timed and repeatable stress test can also be used to measure the code optimization for multi threading . By little can be done: i am pointing to MM involvement in handling strings with the amount of repeated thread contention. Share this post Link to post
Arnaud Bouchez 407 Posted February 17, 2020 @David Heffernan Yes, the fastest heap is the one not used - I tend to allocate a lot of temporary small buffers (e.g. for number to text conversion) from the stack instead of using a temporary string. See http://blog.synopse.info/post/2011/05/20/How-to-write-fast-multi-thread-Delphi-applications 2 Share this post Link to post