64bit RTL patches with Intel OneApi and TBB

Arnaud Bouchez · September 20, 2022

All those tests on the localhost on Windows are not very representative.
If you want something fast and scaling, use a Linux server, and not over the loopback, which is highly bypassed by the OS itself.

Changing the MM in mORMot tests is never of 10x improvements, because the framework tries to avoid heap allocation as much as possible.

@Edwin Yip @Stefan Glienke

If you don't make any memory allocation, then you have the best performance.

Our THttpAsyncServer Event-Driven HTTP Server tries to minimize the memory allocation, and we get very high numbers.

https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.async.pas

If I understand correctly, the performance came from 353 to 4869 requests per second with ab.

I need to emphasize that ab is not a good benchmarking tool for high-performance numbers.
You need to use something more scalable like wrk.

With a mORMot 2 HTTP server on Linux, a benchmark test with wrk has requests per second much higher than those. With the default FPC memory manager. And if we use the FastMM4-based mORMot MM (which is tuned for multithreading) we reach 100K per second.
On my old Core i5 7200u laptop:

abouchez@aaa:~/$ wrk -c 100 -d 15s -t 4 http://localhost:8080/plaintext
Running 15s test @ http://localhost:8080/plaintext
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.41ms    3.74ms  45.25ms   93.57%
    Req/Sec    30.84k     6.58k   48.49k    65.72%
  1845696 requests in 15.09s, 288.67MB read
Requests/sec: 122341.58
Transfer/sec:     19.13MB

Server code is available in https://github.com/synopse/mORMot2/tree/master/ex/techempower-bench

If I run the test with ab, I get:

$ ab -c 100 -n 10000  http://localhost:8080/plaintext
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Server Software:        mORMot2
Server Hostname:        localhost
Server Port:            8080

Document Path:          /plaintext
Document Length:        13 bytes

Concurrency Level:      100
Time taken for tests:   0.616 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      1590000 bytes
HTML transferred:       130000 bytes
Requests per second:    16245.71 [#/sec] (mean)
Time per request:       6.155 [ms] (mean)
Time per request:       0.062 [ms] (mean, across all concurrent requests)
Transfer rate:          2522.53 [Kbytes/sec] received

As you can see, ab is not very good at scaling on multiple threads, especially because by default it does NOT keep alive the connection.

So if you add the -k switch, then you will have kept-alive connections, which is closer to the actual use of a server I guess:

$ ab -k -c 100 -n 100000  http://localhost:8080/plaintext
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Server Software:        mORMot2
Server Hostname:        localhost
Server Port:            8080

Document Path:          /plaintext
Document Length:        13 bytes

Concurrency Level:      100
Time taken for tests:   1.284 seconds
Complete requests:      100000
Failed requests:        0
Keep-Alive requests:    100000
Total transferred:      16400000 bytes
HTML transferred:       1300000 bytes
Requests per second:    77879.68 [#/sec] (mean)
Time per request:       1.284 [ms] (mean)
Time per request:       0.013 [ms] (mean, across all concurrent requests)
Transfer rate:          12472.92 [Kbytes/sec] received

Therefore, ab achieves only about half of the requests per second rate that wrk is able to do.
Just forget about ab.

And when I close my test server, I have the following stats:

{
	"ApiVersion": "Debian Linux 5.10.0 epoll",
	"ServerName": "mORMot2 (Linux)",
	"ProcessName": "8080",
	"SockPort": "8080",
	"ServerKeepAliveTimeOut": 300000,
	"HeadersDefaultBufferSize": 2048,
	"HeadersMaximumSize": 65535,
	"Async": 
	{
		"ThreadPoolCount": 16,
		"ConnectionHigh": 100,
		"Clients": 
		{
			"ReadCount": 1784548,
			"WriteCount": 1627649,
			"ReadBytes": 95843130,
			"WriteBytes": 267528514,
			"Total": 10313
		},
		"Server": 
		{
			"Server": "0.0.0.0",
			"Port": "8080",
			"RawSocket": 5,
			"TimeOut": 10000
		},
		"Accepted": 10313,
		"MaxConnections": 7777777,
		"MaxPending": 100000
	}
}
 
 Flags: SERVER  assumulthrd lockless erms debug repmemleak
 Small:  blocks=14K size=997KB (part of Medium arena)
 Medium: 10MB/21MB    peak=21MB current=8 alloc=17 free=9 sleep=0
 Large:  0B/0B    peak=0B current=0 alloc=0 free=0 sleep=0
 Small Blocks since beginning: 503K/46MB (as small=43/46 tiny=56/56)
  64=306K  48=104K  32=39K  96=11K  160=9K  192=5K  320=5K  448=5K
  2176=5K  256=4K  80=1K  144=1K  128=888  1056=609  112=469  1152=450
 Small Blocks current: 14K/997KB
  64=10K  48=3K  352=194  32=172  112=128  128=85  96=83  80=76
  16=38  176=23  192=16  576=14  144=12  880=10  272=10  160=9

The last block is the detailed information of our FastMM4 fork for FPC, in x86_64 asm.
https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas
The peak MM consumption was 21MB of memory. Compare it with what WebBroker consumes in a similar test.
In particular "sleep=0" indicates that there was NO contention at all during the whole server process.
Just by adding some enhancements to FastMM4 original code, like a thread-safe round-robin of small blocks arenas.
It was with memory leaks reporting (none reported here), and debug/stats mode - so you could save a few % by disabling those features.

To conclude, it seems that it is the WebBroker technology which is fundamentally broken in terms of performance, not the MM itself.

I would also consider how much MB of memory the processes are consuming. I suspect the MSHeap consumes more than FastMM4. Intel TBB was a nightmare, not usable on production servers, from my tests, in that respect.

(Sorry if I was a bit long, but it is a subject I like very much)

Edited September 20, 2022 by Arnaud Bouchez

RDP1974 · September 20, 2022

can I ask, "ninja" coders tell that's an error to encapsulate a memory manager over/inside the os heap manager

C, C++, Rust are using heap calls directly under windows (heap api) and glib api malloc* under gnu

those latest OS are driving and managing correctly fragmentation, etc. without the needings of a MM layer

can I ask your opinion?

thanks

Edited September 20, 2022 by RDP1974

Arnaud Bouchez · September 20, 2022

22 minutes ago, RDP1974 said:

can I ask, "ninja" coders tell that's an error to encapsulate a memory manager over/inside the os heap manager

C, C++, Rust are using heap calls directly under windows (heap api) and glib api malloc* under gnu

those latest OS are driving and managing correctly fragmentation, etc. without the needings of a MM layer

FastMM4, MSHeap, TBB or the libc fpalloc are not encapsulating the OS heap manager, they use low-level OS calls like VirtualAlloc or mmap() to reserve big blocks of memory (a few MB), then split them and manage smaller blocks.
My guess is that you are making some confusion.

About MSHeap, I guess it is documented in https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals-wp.pdf

Edited September 20, 2022 by Arnaud Bouchez

RDP1974 · September 20, 2022

"they use low-level OS calls like VirtualAlloc or mmap() to reserve big blocks of memory (a few MB), then split them and manage smaller blocks"

I know this, but look there

https://users.rust-lang.org/t/why-dont-windows-targets-use-malloc-instead-of-heapalloc/57936

I guess if we can use directly the os heap api as default memory manager

btw. under Delphi for Linux webbroker as apache module, for example, has dozens of times the performances of windows (with default MM)
(but there it's a license violation, windows server agreement denies to use/show performance benchmarks..)

btw. only for talks, webbroker classes are excellent coded, they "pass" natively through "httpd extensions" as isapi or apache modules, but in native socket they use Indy, a one to one blocking thread architecture that doesn't performs well under windows (ancient bsd style architecture). For sure windows server has equiparable performances and scalability as linux using kernel servers (eg.winhttp) or iocompletionports with async overlapped io (very hard to code although)

btw. tbb is well used in industry, of course ram use is high

kind regards

Edited September 21, 2022 by RDP1974

Arnaud Bouchez · September 27, 2022

On 9/20/2022 at 10:29 PM, RDP1974 said:

For sure windows server has equiparable performances and scalability as linux using kernel servers (eg.winhttp) or iocompletionports with async overlapped io (very hard to code although)

btw. tbb is well used in industry, of course ram use is high

From my tests running REST services on the same hardware, a Linux server using epoll is always much faster than http.sys. By a huge amount.
My remark against WebBroker was not about its coding architecture, it was about its actual memory pressure, and performance overhead.
And I won't understand why Apache may still be used for any benchmark. 🙂

About Rust/Malloc/Heap this is because the MS CRT malloc() is poorly coded. At best, it redirects to the MS heap.
Nothing in common with our discussion.

Edited September 27, 2022 by Arnaud Bouchez

RDP1974 · September 27, 2022

anyway (IMHO) Delphi should have an updated default allocator, multithreading friendly (especially for web applications)

kind regards

eivindbakkestuen · September 28, 2022

On 9/7/2022 at 2:24 AM, RDP1974 said:

Old projects in pascal code as NexusMM or scalemm2 I have not benchmarked them, those projects seems abandoned

Nexus MM is alive and well, as it is part of NexusDB database library, which gets regular updates. Not sure where the abandoned impression comes from.

Eivind

RDP1974 · September 28, 2022

please sorry

I'm a old customer of Nexus and let me tell you that your code is a masterpiece!

kind regards

RDP1974 · October 22, 2022

On 9/20/2022 at 10:29 PM, RDP1974 said:

"

btw. under Delphi for Linux webbroker as apache module, for example, has dozens of times the performances of windows (with default MM)

"

I did a test of WebBroker console app within Windows 2022 and Ubuntu 18, they have similar performances (around 4k reqs/sec)

kind regards

Fons N · October 22, 2022

On 9/28/2022 at 10:01 AM, eivindbakkestuen said:

Not sure where the abandoned impression comes from

Go to Nexus Memory Manager | NexusDB and click on Version History. This is what you see. Not too difficult to understand why someone gets the impression that it is abandoned.

eivindbakkestuen · October 22, 2022

8 hours ago, Fons N said:

Go to Nexus Memory Manager | NexusDB and click on Version History. This is what you see. Not too difficult to understand why someone gets the impression that it is abandoned.

Thanks, that is genuinely useful feedback. In this case, it's the history page that hasn't been updated; the product still gets regular updates since it is the backbone of our NexusDB Database products. I've removed that page pending a proper page update.

For anyone looking, the change log in our tracker always has up to date information.

https://www.nexusdb.com/mantis/changelog_page.php

RDP1974 · October 24, 2022

if you wish, provide me a copy of this MM, I will benchmark it, agree?

eivindbakkestuen · October 28, 2022

@RDP1974 Please see private message for exchange of details

RDP1974 · November 3, 2022

Hi Eivind,

I post here the results of my test (Delphi 11.2.1, I9 win11 2h22 host, win 2022 server hyper-v)

In single thread we obtain similar performances among default mm, msheap and nexus

in attached image: MSHEAP, NEXUS, DEFAULT MM

in webserver multithreading it's excellent:

MSHEAP

Total transferred: 250000 bytes

HTML transferred: 114000 bytes

Requests per second: 2933.46 [#/sec] (mean)

Transfer rate: 716.18 [Kbytes/sec] received

NEXUS

Total transferred: 250000 bytes

HTML transferred: 114000 bytes

Requests per second: 2943.01 [#/sec] (mean)

Transfer rate: 718.51 [Kbytes/sec] received

DEFAULT

Total transferred: 250000 bytes

HTML transferred: 114000 bytes

Requests per second: 291.80 [#/sec] (mean)

Transfer rate: 71.24 [Kbytes/sec] received

Stefan Glienke · November 3, 2022

That Poker Benchmark is completely pointless as it has almost zero memory allocations - the majority of CPU time is spent sorting cards and stuff.

RDP1974 · November 3, 2022

where to find a general purpose "benchmark" or stress test?

RDP1974 · November 4, 2022

On 11/3/2022 at 12:06 PM, RDP1974 said:

Hi Eivind,

I post here the results of my test (Delphi 11.2.1, I9 win11 2h22 host, win 2022 server hyper-v)

In single thread we obtain similar performances among default mm, msheap and nexus

in attached image: MSHEAP, NEXUS, DEFAULT MM

in webserver multithreading it's excellent:

MSHEAP

Total transferred:      250000 bytes

HTML transferred:       114000 bytes

Requests per second:    2933.46 [#/sec] (mean)

Transfer rate:          716.18 [Kbytes/sec] received

NEXUS

Total transferred:      250000 bytes

HTML transferred:       114000 bytes

Requests per second:    2943.01 [#/sec] (mean)

Transfer rate:          718.51 [Kbytes/sec] received

DEFAULT

Total transferred:      250000 bytes

HTML transferred:       114000 bytes

Requests per second:    291.80 [#/sec] (mean)

Transfer rate:          71.24 [Kbytes/sec] received

it's odd to me also that msheap and default mm are producing identical score, thus you can see the numbers inside are different. Really I didn't touch the images, only print-screen.
Probably the source is using GetTickCount that's not so accurate, producing identical result. For sure using a highres api as QueryPerformance* the results should be more accurate.
Anyway this is not a good tool for memory stress test.

Mauro Botta · June 17

Great post.

Any News ?

RDP1974 · June 17

50 minutes ago, Mauro Botta said:

Great post.

Any News ?

well, somebody ask me to build a static dll without dependencies to visual c runtime and visual c++

this should be done with clang and mingw (but now I don't have time)

about my repository I added a thread safe fifo queue for highest performance producer-consumer between threads

https://github.com/RDP1974/Delphi64RTL check testqueue

Anders Melander · June 17

25 minutes ago, RDP1974 said:

I added a thread safe fifo queue for highest performance producer-consumer between threads

AFAIR a lock-free FIFO queue can be implemented with just TInterlocked.CompareExchange (i.e. InterlockedCompareExchange).

Sign In

64bit RTL patches with Intel OneApi and TBB

Recommended Posts

Arnaud Bouchez 409

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Arnaud Bouchez 409

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Arnaud Bouchez 409

Share this post

Link to post

RDP1974 40

Share this post

Link to post

eivindbakkestuen 55

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Fons N 17

Share this post

Link to post

eivindbakkestuen 55

Share this post

Link to post

RDP1974 40

Share this post

Link to post

eivindbakkestuen 55

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Mauro Botta 1

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity