a pair of MM test

RDP1974 · August 26, 2020

only for sharing talks

I'm benchmarking a single thread app (poker app) and multithreaded app (webbroker http) with MM's

I did a test of FastMM5 and TBB+IPP https://github.com/RDP1974

FastMM5 is fast as TBB under Webbroker with apachebench 100 concurrent users (finally overcoming the FM4 problems), but

TBB is 5x faster than FM5 under TParallel class

TBB is fast as FM4/FM5 in single thread

with // RedirectCode(@System.Move, @Move2); in RDPSimd64, bcz small moves are faster than under TBB SIMD (condition penalty)

so:

waiting FM5 will correct TParallel contention?

or Delphi AVX support for Synopse MM?

Edited August 26, 2020 by RDP1974

Stefan Glienke · August 26, 2020

X is n times faster than Y is not leading anywhere unless you profile the bottleneck of Y in your benchmark. Run it under VTune or uProf and report your findings.

As for your MM it might be faster but nobody serious will download and use some random dlls (that are not even signed) from the internet and put them into production.

Edited August 26, 2020 by Stefan Glienke

RDP1974 · August 26, 2020

5x quicker TParallel.For is from FastMM5 benchmark utility results

DLL? Are from Intel libs, I did only wrappers

are plain, optimized config, compiled dll from Intel TBB and IPP royalty free packages, no custom source code changes are done

you can compile by yourself, I have put them in the repository because many people cannot build them, or not having the time

https://github.com/oneapi-src/oneTBB/releases

https://github.com/oneapi-src/oneTBB/archive/v2020.3.zip

-> see folder TBBMalloc

for the RTL simd patches:

https://software.seek.intel.com/performance-libraries

-> IPP

run the utility to build a custom DLL and export:

'ippsZero_8u';
'ippsCopy_8u';
'ippsMove_8u';
'ippsSet_8u';
'ippsFind_8u';
'ippsCompare_8u';
'ippsUppercaseLatin_8u_I';
'ippsReplaceC_8u';

for the ZLIB acceleration (3x-5x quicker than windows gzip, webbroker helper provided)-> extract IPP under Linux, see the readme how to patch zlib, take the patched sources and compile them with MS VC++

kind regards

R.

btw. I use them from a pair of years with my customers, never had a trouble

Edited August 26, 2020 by RDP1974

Stefan Glienke · August 27, 2020

Thanks, might be worth putting that info into the github readme because right now this looks like the Sea*.dll are yours where no code nor their source is found for in the repo.

Also apart from the raw speed numbers do you have total/peak memory allocated comparisons for the tests you mention as well?

Edit: what is the "FastMM5 benchmark utility" you referring to? I see no benchmark in the FastMM5 repo

Edited August 27, 2020 by Stefan Glienke

RDP1974 · August 27, 2020

thanks, tomorrow will send info

I wait embarcadero to update the linker to accomodate C $tls api, then we can use static libs without dll dependencyI

Arnaud Bouchez · August 27, 2020

11 hours ago, RDP1974 said:

FastMM5 is fast as TBB under Webbroker with apachebench 100 concurrent users (finally overcoming the FM4 problems), but

TBB is 5x faster than FM5 under TParallel class

TBB is fast as FM4/FM5 in single thread

TBB is fast in benchmarks, but from our experiment not usable on production on a server.

TBB consumes A LOT of memory, much more than FM4/FM5 and alternatives.

Numbers for a real multi-threaded Linux server are a show stopper for using TBB.

On production on a huge Multi Xeon server, RAM consumption after a few hours stabilisation is gblic=2.6GB vs TBB=170GB - 60 times more memory ! With almost no actual performance boost.

This mORMot service handles TB of incoming data, sent by block every second, with thousands of simultaneous HTTPS connections.

See https://github.com/synopse/mORMot/blob/master/SynFPCCMemAligned.pas#L55

So never trust any benchmark.
Try with your real workload.

Quote

Delphi AVX support for Synopse MM?

What we found out with https://github.com/synopse/mORMot/blob/master/SynFPCx64MM.pas may be interesting for the discussion.

Using AVX for medium blocks moves/realloc doesn't change in practice in respect to an inlined SSE2 move (tiny/small/medium blocks), or a non-temporal move (using movntdq opcode instead of plain mov - for large blocks).
For large blocks, using mremap/VirtualAlloc in-place reallocation is a better approach: relying on the OS and performing no move is faster than AVX/AVX2/AVX512.

SynFPCx64MM is currently only for FPC. Used on production with heavily loaded servers.
It is based on FastMM4 design, fully optimized in x86_64 asm, but with a lockless round-robin algorithm for tiny blocks (<=256 bytes), and an optional lockless list for FreeMem - which are the bottleneck for most actual servers. It has several spinning alternatives in case of contention.
And it is really Open Source - not like FastMM5.
We may publish a Delphi-compatible version in the next weeks.

Edited August 27, 2020 by Arnaud Bouchez

RDP1974 · August 27, 2020

45 minutes ago, Arnaud Bouchez said:

We may publish a Delphi-compatible version in the next weeks.

Thanks

RDP1974 · August 27, 2020

but from Intel sources:

similar or little bigger peak consumption and until 17x speedup, see test3

https://software.intel.com/content/www/us/en/develop/articles/controlling-memory-consumption-with-intel-threading-building-blocks-intel-tbb-scalable.html

RDP1974 · August 27, 2020

also this is very interesting https://github.com/daanx/mimalloc-bench

anyway I agree that's better to have a pascal native MM than a C based one

https://github.com/d-mozulyov/BrainMM this was cool, but under Win64 is crashing somewhere

hi Arnaud,

can I ask, in your Synopse software do you use Windows API IoCompletionPorts with thread pool and WSA* Overlapped I/O read write calls?

TBBmalloc has sophisticated caching, it can be (mostly, with some exceptions) cleaned via scalable_allocation_command(TBBMALLOC_CLEAN_ALL_BUFFERS, 0) -> Strictly speaking per-thread caches of free objects are cleaned on thread termination

btw. I'm testing other languages and frameworks, but for Windows app Delphi is still a fantastic, the best way !!!

Did a real test benchmark with huge tasks (in practice preallocating heap), plz check the attached pascal benchmark source

results on I9 Win10 D10.3.3

Win32 default 8 th

image.png.8326f0d95bfe207d3010db823c5935d1.png

Win64 default 8 th

image.png.29479b23c62eecd93f604b1d05f21684.png

Win64 with TBB+IPP (RDP64) 8th

image.png.7b10de5846b5a94bea4a622ebe6c9cc0.png

WIn64 with TBB+IPP (RDP64) 1th

image.png.c494c9d9adb700bcaa90426b71a38414.png

Win64 default 1 th

image.png.497aa8b43002b8da56b9f822766f681d.png

Win32 default 1 th

image.png.38466957bc5bcbb194f488bf3cbd5c3c.png

So:

8 threads

Win32 default = 15.5M

Win64 default = 15.8M

Win64 + RDP = 19.3M

1 thread

Win32 default = 4.2M

Win64 default = 4.3M

Win64 + RDP = 5.2M

So, with this test, TBB+IPP is averagely 25% faster in single thread app, and 35% faster in multithread than default MM

Will be nice to see FPC64 and Linux results

Anyway, only for amusement

kind regards

R.

BenchPoker.rar

anyway TParallel.For seems having race problems

I'll test using default TTheads, results using threads should be higher

Fr0sT.Brutal · August 27, 2020

Dude probably you should start a blog instead

Edited August 27, 2020 by Fr0sT.Brutal

RDP1974 · August 27, 2020

ok I don't know how to use TParallel (setminworkerthread)

with 16 threads, using beginthread on I9 results doubles (I suppose there is a limit of 8 threads in the pool of TThread)

NumThreads := StrToInt(Th.Text);
SetLength(ArrThreads, NumThreads);
SW := TStopWatch.Create;
SW.Start;
for i := 0 to NumThreads - 1 do
ArrThreads:=BeginThread(nil, 0, Addr(ThExecute), nil, 0, dm);

WaitForMultipleObjects(NumThreads, @ArrThreads[0], True, Infinite);
SW.Stop;

lbxResults.Items.Add('All done in ' + SW.ElapsedMilliseconds.ToString + 'msec');
lbxResults.Items.Add('Tot hands ' + TotHands.ToString);
lbxResults.Items.Add('Hands for second ' + (TotHands / SW.ElapsedMilliseconds * 1000).ToString);

for i := 0 to NumThreads - 1 do
CloseHandle(ArrThreads)

but, you are allright, this is a useless benchmark, real app scenario should be a database, web app, etc. where using tthreadvar (TLS)

come back to work, I'll stop to bother the forum

cu

Edited August 27, 2020 by RDP1974

Stefan Glienke · August 27, 2020

I know english is not your native language but to be honest just throwing around numbers and all kinds of different benchmarks it totally confusing for me - it would be very helpful if you could present your findings in a more structured way as I really find the topic interesting but it is very exhausting to follow you.

RDP1974 · August 27, 2020

sorry, because I'm in a hurry, I'll try to enhance my syntax 🙂

Lars Fosdal · August 27, 2020

A man in a hurry is always late.

Arnaud Bouchez · August 27, 2020

11 hours ago, RDP1974 said:

can I ask, in your Synopse software do you use Windows API IoCompletionPorts with thread pool and WSA* Overlapped I/O read write calls?

On Windows, we use http.sys kernel mode which scales better than anything on this platform. It is faster than IOCP since it runs in the kernel.

On Linux, we use our own thread-pool of socket server, with a nginx frontend as reverse proxy on the unix socket loopback, handling HTTPS and HTTP/2. This is very safe and scalable.

And don't trust micro benchmarks. Even worse, don't write your own benchmark. They won't be as good as measuring of a real application.

As I wrote, Intel TBB is a no-go for real server work due to huge memory consumption. If you have to run some specific API calls to release the memory, this is a big design flow - may be considered as a bug (we don't want to have the application stale as it would have with a GC) - and we would never do it.

To be more precise, we use long-living threads from thread pools. So in practice, the threads are never released, and the memory allocation and the memory release are done in diverse threads: one thread pool handles the socket communication, then other thread pool will consume the data and release the memory. This is a scenario typical from most event-driven servers, running on multi-core CPUs, with a proven ring-oriented architecture. Perhaps Intel TBB is not very good at releasing memory with such pattern - whereas our SynFPCx64MM is very efficient in this case. And we almost never realloc - just alloc/free using the stack as working buffer if necessary.

Edited August 27, 2020 by Arnaud Bouchez

RDP1974 · August 28, 2020

hi

looks here https://blog.digitaltundra.com/?p=902

another MM

pascal code,

free

with my test I9 16 threads is the faster among all the MM tested

image.png.c13602d7f3aa1623d8ccaaa4b401b00a.png

(Tundra vs default)

it's using threadvar tls for each thread cache

Edited August 28, 2020 by RDP1974

RDP1974 · September 2, 2020

On 8/28/2020 at 11:07 AM, RDP1974 said:

hi

looks here https://blog.digitaltundra.com/?p=902

another MM

pascal code,

free

with my test I9 16 threads is the faster among all the MM tested

(Tundra vs default)

it's using threadvar tls for each thread cache

ok, did a test with FastMM5, with 16 threads results are identical to BigbrainMM, and with single thread a little better (2501 vs 2727) 8% quicker

Lars Fosdal · September 2, 2020

Seriously...

Anders Melander · September 2, 2020

36 minutes ago, Lars Fosdal said:

Seriously...

Why is that man trying to swallow an invisible shoe?

September 22, 2020

@Lars Fosdal, it would be interesting to know the ratio of your forum rep regarding jokes vs. actually helping out...

Lars Fosdal · September 22, 2020

Clearly, it is based on jokes alone.

Sherlock · September 22, 2020

34 minutes ago, Lars Fosdal said:

Clearly, it is based on jokes alone.

Same here. Plus the tons of bots that just randomly like our posts.

Attila Kovacs · September 22, 2020

@Dany Marmur Just look it up:

image.png.548bf220b01094e7ad06e5642d25e713.png

image.png.a44b15681caf0b8f621728597b5c091a.png

Edited September 22, 2020 by Attila Kovacs

Sign In

a pair of MM test

Recommended Posts

RDP1974 40

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Arnaud Bouchez 411

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Fr0sT.Brutal 903

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Lars Fosdal 1866

Share this post

Link to post

Arnaud Bouchez 411

Share this post

Link to post

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Lars Fosdal 1866

Share this post

Link to post

Anders Melander 2037

Share this post

Link to post

Guest

Share this post

Link to post

Lars Fosdal 1866

Share this post

Link to post

Sherlock 685

Share this post

Link to post

Attila Kovacs 675

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity