Jump to content
RDP1974

64bit RTL patches with Intel OneApi and TBB

Recommended Posts

All those tests on the localhost on Windows are not very representative.
If you want something fast and scaling, use a Linux server, and not over the loopback, which is highly bypassed by the OS itself.

 

Changing the MM in mORMot tests is never of 10x improvements, because the framework tries to avoid heap allocation as much as possible.

@Edwin Yip @Stefan Glienke

 

If you don't make any memory allocation, then you have the best performance.

Our THttpAsyncServer Event-Driven HTTP Server tries to minimize the memory allocation, and we get very high numbers.

https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.async.pas


If I understand correctly, the performance came from 353 to 4869 requests per second with ab.

I need to emphasize that ab is not a good benchmarking tool for high-performance numbers.
You need to use something more scalable like wrk.


With a mORMot 2 HTTP server on Linux, a benchmark test with wrk has requests per second much higher than those. With the default FPC memory manager. And if we use the FastMM4-based mORMot MM (which is tuned for multithreading) we reach 100K per second.
On my old Core i5 7200u laptop:

abouchez@aaa:~/$ wrk -c 100 -d 15s -t 4 http://localhost:8080/plaintext
Running 15s test @ http://localhost:8080/plaintext
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.41ms    3.74ms  45.25ms   93.57%
    Req/Sec    30.84k     6.58k   48.49k    65.72%
  1845696 requests in 15.09s, 288.67MB read
Requests/sec: 122341.58
Transfer/sec:     19.13MB

Server code is available in https://github.com/synopse/mORMot2/tree/master/ex/techempower-bench

 

If I run the test with ab, I get:

$ ab -c 100 -n 10000  http://localhost:8080/plaintext
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Server Software:        mORMot2
Server Hostname:        localhost
Server Port:            8080

Document Path:          /plaintext
Document Length:        13 bytes

Concurrency Level:      100
Time taken for tests:   0.616 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      1590000 bytes
HTML transferred:       130000 bytes
Requests per second:    16245.71 [#/sec] (mean)
Time per request:       6.155 [ms] (mean)
Time per request:       0.062 [ms] (mean, across all concurrent requests)
Transfer rate:          2522.53 [Kbytes/sec] received

As you can see, ab is not very good at scaling on multiple threads, especially because by default it does NOT keep alive the connection.

 

So if you add the -k switch, then you will have kept-alive connections, which is closer to the actual use of a server I guess:

$ ab -k -c 100 -n 100000  http://localhost:8080/plaintext
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Server Software:        mORMot2
Server Hostname:        localhost
Server Port:            8080

Document Path:          /plaintext
Document Length:        13 bytes

Concurrency Level:      100
Time taken for tests:   1.284 seconds
Complete requests:      100000
Failed requests:        0
Keep-Alive requests:    100000
Total transferred:      16400000 bytes
HTML transferred:       1300000 bytes
Requests per second:    77879.68 [#/sec] (mean)
Time per request:       1.284 [ms] (mean)
Time per request:       0.013 [ms] (mean, across all concurrent requests)
Transfer rate:          12472.92 [Kbytes/sec] received

Therefore, ab achieves only about half of the requests per second rate that wrk is able to do.
Just forget about ab.

 

And when I close my test server, I have the following stats:

{
	"ApiVersion": "Debian Linux 5.10.0 epoll",
	"ServerName": "mORMot2 (Linux)",
	"ProcessName": "8080",
	"SockPort": "8080",
	"ServerKeepAliveTimeOut": 300000,
	"HeadersDefaultBufferSize": 2048,
	"HeadersMaximumSize": 65535,
	"Async": 
	{
		"ThreadPoolCount": 16,
		"ConnectionHigh": 100,
		"Clients": 
		{
			"ReadCount": 1784548,
			"WriteCount": 1627649,
			"ReadBytes": 95843130,
			"WriteBytes": 267528514,
			"Total": 10313
		},
		"Server": 
		{
			"Server": "0.0.0.0",
			"Port": "8080",
			"RawSocket": 5,
			"TimeOut": 10000
		},
		"Accepted": 10313,
		"MaxConnections": 7777777,
		"MaxPending": 100000
	}
}
 
 Flags: SERVER  assumulthrd lockless erms debug repmemleak
 Small:  blocks=14K size=997KB (part of Medium arena)
 Medium: 10MB/21MB    peak=21MB current=8 alloc=17 free=9 sleep=0
 Large:  0B/0B    peak=0B current=0 alloc=0 free=0 sleep=0
 Small Blocks since beginning: 503K/46MB (as small=43/46 tiny=56/56)
  64=306K  48=104K  32=39K  96=11K  160=9K  192=5K  320=5K  448=5K
  2176=5K  256=4K  80=1K  144=1K  128=888  1056=609  112=469  1152=450
 Small Blocks current: 14K/997KB
  64=10K  48=3K  352=194  32=172  112=128  128=85  96=83  80=76
  16=38  176=23  192=16  576=14  144=12  880=10  272=10  160=9

The last block is the detailed information of our FastMM4 fork for FPC, in x86_64 asm.
https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas
The peak MM consumption was 21MB of memory. Compare it with what WebBroker consumes in a similar test.
In particular "sleep=0" indicates that there was NO contention at all during the whole server process.
Just by adding some enhancements to FastMM4 original code, like a thread-safe round-robin of small blocks arenas.
It was with memory leaks reporting (none reported here), and debug/stats mode - so you could save a few % by disabling those features.

 

To conclude, it seems that it is the WebBroker technology which is fundamentally broken in terms of performance, not the MM itself.


I would also consider how much MB of memory the processes are consuming. I suspect the MSHeap consumes more than FastMM4. Intel TBB was a nightmare, not usable on production servers, from my tests, in that respect.

 

(Sorry if I was a bit long, but it is a subject I like very much)

Edited by Arnaud Bouchez

Share this post


Link to post

can I ask, "ninja" coders tell that's an error to encapsulate a memory manager over/inside the os heap manager

C, C++, Rust are using heap calls directly under windows (heap api) and glib api malloc* under gnu

those latest OS are driving and managing correctly fragmentation, etc. without the needings of a MM layer

can I ask your opinion?

thanks

 

Edited by RDP1974

Share this post


Link to post
22 minutes ago, RDP1974 said:

can I ask, "ninja" coders tell that's an error to encapsulate a memory manager over/inside the os heap manager

C, C++, Rust are using heap calls directly under windows (heap api) and glib api malloc* under gnu

those latest OS are driving and managing correctly fragmentation, etc. without the needings of a MM layer

FastMM4, MSHeap, TBB or the libc fpalloc are not encapsulating the OS heap manager, they use low-level OS calls like VirtualAlloc or mmap() to reserve big blocks of memory (a few MB), then split them and manage smaller blocks.
My guess is that you are making some confusion.
 

About MSHeap, I guess it is documented in https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals-wp.pdf

Edited by Arnaud Bouchez

Share this post


Link to post

"they use low-level OS calls like VirtualAlloc or mmap() to reserve big blocks of memory (a few MB), then split them and manage smaller blocks"

I know this, but look there

https://users.rust-lang.org/t/why-dont-windows-targets-use-malloc-instead-of-heapalloc/57936

I guess if we can use directly the os heap api as default memory manager

btw. under Delphi for Linux webbroker as apache module, for example, has dozens of times the performances of windows (with default MM)
(but there it's a license violation, windows server agreement denies to use/show performance benchmarks..)

 

btw. only for talks, webbroker classes are excellent coded, they "pass" natively through "httpd extensions" as isapi or apache modules, but in native socket they use Indy, a one to one blocking thread architecture that doesn't performs well under windows (ancient bsd style architecture). For sure windows server has equiparable performances and scalability as linux using kernel servers (eg.winhttp) or iocompletionports with async overlapped io (very hard to code although)

btw. tbb is well used in industry, of course ram use is high

kind regards

Edited by RDP1974

Share this post


Link to post
On 9/20/2022 at 10:29 PM, RDP1974 said:

For sure windows server has equiparable performances and scalability as linux using kernel servers (eg.winhttp) or iocompletionports with async overlapped io (very hard to code although)

btw. tbb is well used in industry, of course ram use is high

From my tests running REST services on the same hardware, a Linux server using epoll is always much faster than http.sys. By a huge amount.
My remark against WebBroker was not about its coding architecture, it was about its actual memory pressure, and performance overhead.
And I won't understand why Apache may still be used for any benchmark. 🙂

 

About Rust/Malloc/Heap this is because the MS CRT malloc() is poorly coded. At best, it redirects to the MS heap.
Nothing in common with our discussion.

Edited by Arnaud Bouchez
  • Thanks 1

Share this post


Link to post

anyway (IMHO) Delphi should have an updated default allocator, multithreading friendly (especially for web applications)

kind regards

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×