Jump to content
RDP1974

a pair of MM test

Recommended Posts

only for sharing talks

 

I'm benchmarking a single thread app (poker app) and multithreaded app (webbroker http) with MM's

 

I did a test of FastMM5 and TBB+IPP https://github.com/RDP1974

 

FastMM5 is fast as TBB under Webbroker with apachebench 100 concurrent users (finally overcoming the FM4 problems), but

TBB is 5x faster than FM5 under TParallel class

TBB is fast as FM4/FM5 in single thread

with // RedirectCode(@System.Move, @Move2); in RDPSimd64, bcz small moves are faster than under TBB SIMD (condition penalty)

 

so:

waiting FM5 will correct TParallel contention?

or Delphi AVX support for Synopse MM?

Edited by RDP1974

Share this post


Link to post

X is n times faster than Y is not leading anywhere unless you profile the bottleneck of Y in your benchmark. Run it under VTune or uProf and report your findings.

 

As for your MM it might be faster but nobody serious will download and use some random dlls (that are not even signed) from the internet and put them into production.

Edited by Stefan Glienke
  • Like 1

Share this post


Link to post

5x quicker TParallel.For is from FastMM5 benchmark utility results

 

DLL? Are from Intel libs, I did only wrappers

are plain, optimized config, compiled dll from Intel TBB and IPP royalty free packages, no custom source code changes are done

you can compile by yourself, I have put them in the repository because many people cannot build them, or not having the time

 

https://github.com/oneapi-src/oneTBB/releases

https://github.com/oneapi-src/oneTBB/archive/v2020.3.zip

-> see folder TBBMalloc

 

for the RTL simd patches:

https://software.seek.intel.com/performance-libraries

-> IPP

run the utility to build a custom DLL and export:

'ippsZero_8u';
'ippsCopy_8u';
'ippsMove_8u';
'ippsSet_8u';
'ippsFind_8u';
'ippsCompare_8u';
'ippsUppercaseLatin_8u_I';
'ippsReplaceC_8u';
 

for the ZLIB acceleration (3x-5x quicker than windows gzip, webbroker helper provided)-> extract IPP under Linux, see the readme how to patch zlib, take the patched sources and compile them with MS VC++

 

kind regards

R.

 

btw. I use them from a pair of years with my customers, never had a trouble

 

Edited by RDP1974
  • Like 1

Share this post


Link to post

Thanks, might be worth putting that info into the github readme because right now this looks like the Sea*.dll are yours where no code nor their source is found for in the repo.

 

Also apart from the raw speed numbers do you have total/peak memory allocated comparisons for the tests you mention as well?

 

Edit: what is the "FastMM5 benchmark utility" you referring to? I see no benchmark in the FastMM5 repo

Edited by Stefan Glienke
  • Like 1

Share this post


Link to post

thanks, tomorrow will send info

I wait embarcadero to update the linker to accomodate C $tls api, then we can use static libs without dll dependencyI

 

 

Share this post


Link to post
11 hours ago, RDP1974 said:

FastMM5 is fast as TBB under Webbroker with apachebench 100 concurrent users (finally overcoming the FM4 problems), but

TBB is 5x faster than FM5 under TParallel class

TBB is fast as FM4/FM5 in single thread

TBB is fast in benchmarks, but from our experiment not usable on production on a server.

TBB consumes A LOT of memory, much more than FM4/FM5 and alternatives.

 

Numbers for a real multi-threaded Linux server are a show stopper for using TBB.

On production on a huge Multi Xeon server, RAM consumption after a few hours stabilisation is gblic=2.6GB vs TBB=170GB - 60 times more memory ! With almost no actual performance boost.

This mORMot service handles TB of incoming data, sent by block every second, with thousands of simultaneous HTTPS connections.

See https://github.com/synopse/mORMot/blob/master/SynFPCCMemAligned.pas#L55

 

So never trust any benchmark.
Try with your real workload.

 

Quote

Delphi AVX support for Synopse MM?

What we found out with https://github.com/synopse/mORMot/blob/master/SynFPCx64MM.pas may be interesting for the discussion.

Using AVX for medium blocks moves/realloc doesn't change in practice in respect to an inlined SSE2 move (tiny/small/medium blocks), or a non-temporal move (using movntdq opcode instead of plain mov - for large blocks).
For large blocks, using mremap/VirtualAlloc in-place reallocation is a better approach: relying on the OS and performing no move is faster than AVX/AVX2/AVX512.

 

SynFPCx64MM is currently only for FPC. Used on production with heavily loaded servers.
It is based on FastMM4 design, fully optimized in x86_64 asm, but with a lockless round-robin algorithm for tiny blocks (<=256 bytes), and an optional lockless list for FreeMem - which are the bottleneck for most actual servers. It has several spinning alternatives in case of contention.
And it is really Open Source - not like FastMM5.
We may publish a Delphi-compatible version in the next weeks.

Edited by Arnaud Bouchez
  • Like 3
  • Thanks 2

Share this post


Link to post
45 minutes ago, Arnaud Bouchez said:

We may publish a Delphi-compatible version in the next weeks.

Thanks

Share this post


Link to post

also this is very interesting https://github.com/daanx/mimalloc-bench

anyway I agree that's better to have a pascal native MM than a C based one

 

https://github.com/d-mozulyov/BrainMM this was cool, but under Win64 is crashing somewhere

 

hi Arnaud,

 

can I ask, in your Synopse software do you use Windows API IoCompletionPorts with thread pool and WSA* Overlapped I/O read write calls?

 

TBBmalloc has sophisticated caching, it can be (mostly, with some exceptions) cleaned via scalable_allocation_command(TBBMALLOC_CLEAN_ALL_BUFFERS, 0) -> Strictly speaking per-thread caches of free objects are cleaned on thread termination

 

btw. I'm testing other languages and frameworks, but for Windows app Delphi is still a fantastic, the best way !!!

 

Did a real test benchmark with huge tasks (in practice preallocating heap), plz check the attached pascal benchmark source

 

results on I9 Win10 D10.3.3

 

Win32 default 8 th

image.png.8326f0d95bfe207d3010db823c5935d1.png

 

Win64 default 8 th

image.png.29479b23c62eecd93f604b1d05f21684.png

 

Win64 with TBB+IPP (RDP64) 8th

image.png.7b10de5846b5a94bea4a622ebe6c9cc0.png

 

WIn64 with TBB+IPP (RDP64) 1th

image.png.c494c9d9adb700bcaa90426b71a38414.png

 

Win64 default 1 th

image.png.497aa8b43002b8da56b9f822766f681d.png

 

Win32 default 1 th

image.png.38466957bc5bcbb194f488bf3cbd5c3c.png

 

So:

8 threads

Win32 default = 15.5M

Win64 default = 15.8M

Win64 + RDP = 19.3M

 

1 thread

Win32 default = 4.2M

Win64 default = 4.3M

Win64 + RDP = 5.2M

 

So, with this test, TBB+IPP is averagely 25% faster in single thread app, and 35% faster in multithread than default MM

 

Will be nice to see FPC64 and Linux results

 

Anyway, only for amusement

kind regards

R.

 

image.png

image.png

image.png

image.png

image.png

image.png

BenchPoker.rar

 

anyway TParallel.For seems having race problems

I'll test using default TTheads, results using threads should be higher

Share this post


Link to post

ok I don't know how to use TParallel (setminworkerthread)

with 16 threads, using beginthread on I9 results doubles (I suppose there is a limit of 8 threads in the pool of TThread)

 

NumThreads := StrToInt(Th.Text);
  SetLength(ArrThreads, NumThreads);
  SW := TStopWatch.Create;
  SW.Start;
  for i := 0 to NumThreads - 1 do
   ArrThreads:=BeginThread(nil, 0, Addr(ThExecute), nil, 0, dm);

  WaitForMultipleObjects(NumThreads, @ArrThreads[0], True, Infinite);
  SW.Stop;

  lbxResults.Items.Add('All done in ' + SW.ElapsedMilliseconds.ToString + 'msec');
  lbxResults.Items.Add('Tot hands ' + TotHands.ToString);
  lbxResults.Items.Add('Hands for second ' + (TotHands / SW.ElapsedMilliseconds * 1000).ToString);

    for i := 0 to NumThreads - 1 do
    CloseHandle(ArrThreads)

 

but, you are allright, this is a useless benchmark, real app scenario should be a database, web app, etc. where using tthreadvar (TLS) 

 

come back to work, I'll stop to bother the forum

cu

 

Edited by RDP1974

Share this post


Link to post

I know english is not your native language but to be honest just throwing around numbers and all kinds of different benchmarks it totally confusing for me - it would be very helpful if you could present your findings in a more structured way as I really find the topic interesting but it is very exhausting to follow you.

  • Like 1

Share this post


Link to post
11 hours ago, RDP1974 said:

can I ask, in your Synopse software do you use Windows API IoCompletionPorts with thread pool and WSA* Overlapped I/O read write calls?

On Windows, we use http.sys kernel mode which scales better than anything on this platform. It is faster than IOCP since it runs in the kernel.

On Linux, we use our own thread-pool of socket server, with a nginx frontend as reverse proxy on the unix socket loopback, handling HTTPS and HTTP/2. This is very safe and scalable.

 

And don't trust micro benchmarks. Even worse, don't write your own benchmark. They won't be as good as measuring of a real application.

As I wrote, Intel TBB is a no-go for real server work due to huge memory consumption. If you have to run some specific API calls to release the memory, this is a big design flow - may be considered as a bug (we don't want to have the application stale as it would have with a GC) - and we would never do it.

To be more precise, we use long-living threads from thread pools. So in practice, the threads are never released, and the memory allocation and the memory release are done in diverse threads: one thread pool handles the socket communication, then other thread pool will consume the data and release the memory. This is a scenario typical from most event-driven servers, running on multi-core CPUs, with a proven ring-oriented architecture. Perhaps Intel TBB is not very good at releasing memory with such pattern - whereas our SynFPCx64MM is very efficient in this case. And we almost never realloc - just alloc/free using the stack as working buffer if necessary.

Edited by Arnaud Bouchez
  • Like 3
  • Thanks 1

Share this post


Link to post
On 8/28/2020 at 11:07 AM, RDP1974 said:

 

hi

looks here https://blog.digitaltundra.com/?p=902

another MM

pascal code,

free

 

with my test I9 16 threads is the faster among all the MM tested

image.png.c13602d7f3aa1623d8ccaaa4b401b00a.png

(Tundra vs default)

it's using threadvar tls for each thread cache

ok, did a test with FastMM5, with 16 threads results are identical to BigbrainMM, and with single thread a little better (2501 vs 2727) 8% quicker

Share this post


Link to post
Guest

@Lars Fosdal, it would be interesting to know the ratio of your forum rep regarding jokes vs. actually helping out...

Share this post


Link to post
34 minutes ago, Lars Fosdal said:

Clearly, it is based on jokes alone. 

Same here. Plus the tons of bots that just randomly like our posts. :classic_ninja:

  • Haha 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×