Arnaud Bouchez 407 Posted September 20, 2022 (edited) All those tests on the localhost on Windows are not very representative. If you want something fast and scaling, use a Linux server, and not over the loopback, which is highly bypassed by the OS itself. Changing the MM in mORMot tests is never of 10x improvements, because the framework tries to avoid heap allocation as much as possible. @Edwin Yip @Stefan Glienke If you don't make any memory allocation, then you have the best performance. Our THttpAsyncServer Event-Driven HTTP Server tries to minimize the memory allocation, and we get very high numbers. https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.async.pas If I understand correctly, the performance came from 353 to 4869 requests per second with ab. I need to emphasize that ab is not a good benchmarking tool for high-performance numbers. You need to use something more scalable like wrk. With a mORMot 2 HTTP server on Linux, a benchmark test with wrk has requests per second much higher than those. With the default FPC memory manager. And if we use the FastMM4-based mORMot MM (which is tuned for multithreading) we reach 100K per second. On my old Core i5 7200u laptop: abouchez@aaa:~/$ wrk -c 100 -d 15s -t 4 http://localhost:8080/plaintext Running 15s test @ http://localhost:8080/plaintext 4 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.41ms 3.74ms 45.25ms 93.57% Req/Sec 30.84k 6.58k 48.49k 65.72% 1845696 requests in 15.09s, 288.67MB read Requests/sec: 122341.58 Transfer/sec: 19.13MB Server code is available in https://github.com/synopse/mORMot2/tree/master/ex/techempower-bench If I run the test with ab, I get: $ ab -c 100 -n 10000 http://localhost:8080/plaintext This is ApacheBench, Version 2.3 <$Revision: 1901567 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Server Software: mORMot2 Server Hostname: localhost Server Port: 8080 Document Path: /plaintext Document Length: 13 bytes Concurrency Level: 100 Time taken for tests: 0.616 seconds Complete requests: 10000 Failed requests: 0 Total transferred: 1590000 bytes HTML transferred: 130000 bytes Requests per second: 16245.71 [#/sec] (mean) Time per request: 6.155 [ms] (mean) Time per request: 0.062 [ms] (mean, across all concurrent requests) Transfer rate: 2522.53 [Kbytes/sec] received As you can see, ab is not very good at scaling on multiple threads, especially because by default it does NOT keep alive the connection. So if you add the -k switch, then you will have kept-alive connections, which is closer to the actual use of a server I guess: $ ab -k -c 100 -n 100000 http://localhost:8080/plaintext This is ApacheBench, Version 2.3 <$Revision: 1901567 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Server Software: mORMot2 Server Hostname: localhost Server Port: 8080 Document Path: /plaintext Document Length: 13 bytes Concurrency Level: 100 Time taken for tests: 1.284 seconds Complete requests: 100000 Failed requests: 0 Keep-Alive requests: 100000 Total transferred: 16400000 bytes HTML transferred: 1300000 bytes Requests per second: 77879.68 [#/sec] (mean) Time per request: 1.284 [ms] (mean) Time per request: 0.013 [ms] (mean, across all concurrent requests) Transfer rate: 12472.92 [Kbytes/sec] received Therefore, ab achieves only about half of the requests per second rate that wrk is able to do. Just forget about ab. And when I close my test server, I have the following stats: { "ApiVersion": "Debian Linux 5.10.0 epoll", "ServerName": "mORMot2 (Linux)", "ProcessName": "8080", "SockPort": "8080", "ServerKeepAliveTimeOut": 300000, "HeadersDefaultBufferSize": 2048, "HeadersMaximumSize": 65535, "Async": { "ThreadPoolCount": 16, "ConnectionHigh": 100, "Clients": { "ReadCount": 1784548, "WriteCount": 1627649, "ReadBytes": 95843130, "WriteBytes": 267528514, "Total": 10313 }, "Server": { "Server": "0.0.0.0", "Port": "8080", "RawSocket": 5, "TimeOut": 10000 }, "Accepted": 10313, "MaxConnections": 7777777, "MaxPending": 100000 } } Flags: SERVER assumulthrd lockless erms debug repmemleak Small: blocks=14K size=997KB (part of Medium arena) Medium: 10MB/21MB peak=21MB current=8 alloc=17 free=9 sleep=0 Large: 0B/0B peak=0B current=0 alloc=0 free=0 sleep=0 Small Blocks since beginning: 503K/46MB (as small=43/46 tiny=56/56) 64=306K 48=104K 32=39K 96=11K 160=9K 192=5K 320=5K 448=5K 2176=5K 256=4K 80=1K 144=1K 128=888 1056=609 112=469 1152=450 Small Blocks current: 14K/997KB 64=10K 48=3K 352=194 32=172 112=128 128=85 96=83 80=76 16=38 176=23 192=16 576=14 144=12 880=10 272=10 160=9 The last block is the detailed information of our FastMM4 fork for FPC, in x86_64 asm. https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas The peak MM consumption was 21MB of memory. Compare it with what WebBroker consumes in a similar test. In particular "sleep=0" indicates that there was NO contention at all during the whole server process. Just by adding some enhancements to FastMM4 original code, like a thread-safe round-robin of small blocks arenas. It was with memory leaks reporting (none reported here), and debug/stats mode - so you could save a few % by disabling those features. To conclude, it seems that it is the WebBroker technology which is fundamentally broken in terms of performance, not the MM itself. I would also consider how much MB of memory the processes are consuming. I suspect the MSHeap consumes more than FastMM4. Intel TBB was a nightmare, not usable on production servers, from my tests, in that respect. (Sorry if I was a bit long, but it is a subject I like very much) Edited September 20, 2022 by Arnaud Bouchez Share this post Link to post
RDP1974 40 Posted September 20, 2022 (edited) can I ask, "ninja" coders tell that's an error to encapsulate a memory manager over/inside the os heap manager C, C++, Rust are using heap calls directly under windows (heap api) and glib api malloc* under gnu those latest OS are driving and managing correctly fragmentation, etc. without the needings of a MM layer can I ask your opinion? thanks Edited September 20, 2022 by RDP1974 Share this post Link to post
Arnaud Bouchez 407 Posted September 20, 2022 (edited) 22 minutes ago, RDP1974 said: can I ask, "ninja" coders tell that's an error to encapsulate a memory manager over/inside the os heap manager C, C++, Rust are using heap calls directly under windows (heap api) and glib api malloc* under gnu those latest OS are driving and managing correctly fragmentation, etc. without the needings of a MM layer FastMM4, MSHeap, TBB or the libc fpalloc are not encapsulating the OS heap manager, they use low-level OS calls like VirtualAlloc or mmap() to reserve big blocks of memory (a few MB), then split them and manage smaller blocks. My guess is that you are making some confusion. About MSHeap, I guess it is documented in https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals-wp.pdf Edited September 20, 2022 by Arnaud Bouchez Share this post Link to post
RDP1974 40 Posted September 20, 2022 (edited) "they use low-level OS calls like VirtualAlloc or mmap() to reserve big blocks of memory (a few MB), then split them and manage smaller blocks" I know this, but look there https://users.rust-lang.org/t/why-dont-windows-targets-use-malloc-instead-of-heapalloc/57936 I guess if we can use directly the os heap api as default memory manager btw. under Delphi for Linux webbroker as apache module, for example, has dozens of times the performances of windows (with default MM) (but there it's a license violation, windows server agreement denies to use/show performance benchmarks..) btw. only for talks, webbroker classes are excellent coded, they "pass" natively through "httpd extensions" as isapi or apache modules, but in native socket they use Indy, a one to one blocking thread architecture that doesn't performs well under windows (ancient bsd style architecture). For sure windows server has equiparable performances and scalability as linux using kernel servers (eg.winhttp) or iocompletionports with async overlapped io (very hard to code although) btw. tbb is well used in industry, of course ram use is high kind regards Edited September 21, 2022 by RDP1974 Share this post Link to post
Arnaud Bouchez 407 Posted September 27, 2022 (edited) On 9/20/2022 at 10:29 PM, RDP1974 said: For sure windows server has equiparable performances and scalability as linux using kernel servers (eg.winhttp) or iocompletionports with async overlapped io (very hard to code although) btw. tbb is well used in industry, of course ram use is high From my tests running REST services on the same hardware, a Linux server using epoll is always much faster than http.sys. By a huge amount. My remark against WebBroker was not about its coding architecture, it was about its actual memory pressure, and performance overhead. And I won't understand why Apache may still be used for any benchmark. 🙂 About Rust/Malloc/Heap this is because the MS CRT malloc() is poorly coded. At best, it redirects to the MS heap. Nothing in common with our discussion. Edited September 27, 2022 by Arnaud Bouchez 1 1 Share this post Link to post
RDP1974 40 Posted September 27, 2022 anyway (IMHO) Delphi should have an updated default allocator, multithreading friendly (especially for web applications) kind regards Share this post Link to post
eivindbakkestuen 47 Posted September 28, 2022 On 9/7/2022 at 2:24 AM, RDP1974 said: Old projects in pascal code as NexusMM or scalemm2 I have not benchmarked them, those projects seems abandoned Nexus MM is alive and well, as it is part of NexusDB database library, which gets regular updates. Not sure where the abandoned impression comes from. Eivind Share this post Link to post
RDP1974 40 Posted September 28, 2022 please sorry I'm a old customer of Nexus and let me tell you that your code is a masterpiece! kind regards Share this post Link to post
RDP1974 40 Posted October 22, 2022 On 9/20/2022 at 10:29 PM, RDP1974 said: " btw. under Delphi for Linux webbroker as apache module, for example, has dozens of times the performances of windows (with default MM) " I did a test of WebBroker console app within Windows 2022 and Ubuntu 18, they have similar performances (around 4k reqs/sec) kind regards Share this post Link to post
Fons N 17 Posted October 22, 2022 On 9/28/2022 at 10:01 AM, eivindbakkestuen said: Not sure where the abandoned impression comes from Go to Nexus Memory Manager | NexusDB and click on Version History. This is what you see. Not too difficult to understand why someone gets the impression that it is abandoned. Share this post Link to post
eivindbakkestuen 47 Posted October 22, 2022 8 hours ago, Fons N said: Go to Nexus Memory Manager | NexusDB and click on Version History. This is what you see. Not too difficult to understand why someone gets the impression that it is abandoned. Thanks, that is genuinely useful feedback. In this case, it's the history page that hasn't been updated; the product still gets regular updates since it is the backbone of our NexusDB Database products. I've removed that page pending a proper page update. For anyone looking, the change log in our tracker always has up to date information. https://www.nexusdb.com/mantis/changelog_page.php Share this post Link to post
RDP1974 40 Posted October 24, 2022 if you wish, provide me a copy of this MM, I will benchmark it, agree? Share this post Link to post
eivindbakkestuen 47 Posted October 28, 2022 @RDP1974 Please see private message for exchange of details Share this post Link to post
RDP1974 40 Posted November 3, 2022 Hi Eivind, I post here the results of my test (Delphi 11.2.1, I9 win11 2h22 host, win 2022 server hyper-v) In single thread we obtain similar performances among default mm, msheap and nexus in attached image: MSHEAP, NEXUS, DEFAULT MM in webserver multithreading it's excellent: MSHEAP Total transferred: 250000 bytes HTML transferred: 114000 bytes Requests per second: 2933.46 [#/sec] (mean) Transfer rate: 716.18 [Kbytes/sec] received NEXUS Total transferred: 250000 bytes HTML transferred: 114000 bytes Requests per second: 2943.01 [#/sec] (mean) Transfer rate: 718.51 [Kbytes/sec] received DEFAULT Total transferred: 250000 bytes HTML transferred: 114000 bytes Requests per second: 291.80 [#/sec] (mean) Transfer rate: 71.24 [Kbytes/sec] received Share this post Link to post
Stefan Glienke 2002 Posted November 3, 2022 That Poker Benchmark is completely pointless as it has almost zero memory allocations - the majority of CPU time is spent sorting cards and stuff. Share this post Link to post
RDP1974 40 Posted November 3, 2022 where to find a general purpose "benchmark" or stress test? Share this post Link to post
RDP1974 40 Posted November 4, 2022 On 11/3/2022 at 12:06 PM, RDP1974 said: Hi Eivind, I post here the results of my test (Delphi 11.2.1, I9 win11 2h22 host, win 2022 server hyper-v) In single thread we obtain similar performances among default mm, msheap and nexus in attached image: MSHEAP, NEXUS, DEFAULT MM in webserver multithreading it's excellent: MSHEAP Total transferred: 250000 bytes HTML transferred: 114000 bytes Requests per second: 2933.46 [#/sec] (mean) Transfer rate: 716.18 [Kbytes/sec] received NEXUS Total transferred: 250000 bytes HTML transferred: 114000 bytes Requests per second: 2943.01 [#/sec] (mean) Transfer rate: 718.51 [Kbytes/sec] received DEFAULT Total transferred: 250000 bytes HTML transferred: 114000 bytes Requests per second: 291.80 [#/sec] (mean) Transfer rate: 71.24 [Kbytes/sec] received it's odd to me also that msheap and default mm are producing identical score, thus you can see the numbers inside are different. Really I didn't touch the images, only print-screen. Probably the source is using GetTickCount that's not so accurate, producing identical result. For sure using a highres api as QueryPerformance* the results should be more accurate. Anyway this is not a good tool for memory stress test. Share this post Link to post