RDP1974 40 Posted August 26, 2020 (edited) only for sharing talks I'm benchmarking a single thread app (poker app) and multithreaded app (webbroker http) with MM's I did a test of FastMM5 and TBB+IPP https://github.com/RDP1974 FastMM5 is fast as TBB under Webbroker with apachebench 100 concurrent users (finally overcoming the FM4 problems), but TBB is 5x faster than FM5 under TParallel class TBB is fast as FM4/FM5 in single thread with // RedirectCode(@System.Move, @Move2); in RDPSimd64, bcz small moves are faster than under TBB SIMD (condition penalty) so: waiting FM5 will correct TParallel contention? or Delphi AVX support for Synopse MM? Edited August 26, 2020 by RDP1974 Share this post Link to post
Stefan Glienke 2002 Posted August 26, 2020 (edited) X is n times faster than Y is not leading anywhere unless you profile the bottleneck of Y in your benchmark. Run it under VTune or uProf and report your findings. As for your MM it might be faster but nobody serious will download and use some random dlls (that are not even signed) from the internet and put them into production. Edited August 26, 2020 by Stefan Glienke 1 Share this post Link to post
RDP1974 40 Posted August 26, 2020 (edited) 5x quicker TParallel.For is from FastMM5 benchmark utility results DLL? Are from Intel libs, I did only wrappers are plain, optimized config, compiled dll from Intel TBB and IPP royalty free packages, no custom source code changes are done you can compile by yourself, I have put them in the repository because many people cannot build them, or not having the time https://github.com/oneapi-src/oneTBB/releases https://github.com/oneapi-src/oneTBB/archive/v2020.3.zip -> see folder TBBMalloc for the RTL simd patches: https://software.seek.intel.com/performance-libraries -> IPP run the utility to build a custom DLL and export: 'ippsZero_8u'; 'ippsCopy_8u'; 'ippsMove_8u'; 'ippsSet_8u'; 'ippsFind_8u'; 'ippsCompare_8u'; 'ippsUppercaseLatin_8u_I'; 'ippsReplaceC_8u'; for the ZLIB acceleration (3x-5x quicker than windows gzip, webbroker helper provided)-> extract IPP under Linux, see the readme how to patch zlib, take the patched sources and compile them with MS VC++ kind regards R. btw. I use them from a pair of years with my customers, never had a trouble Edited August 26, 2020 by RDP1974 1 Share this post Link to post
Stefan Glienke 2002 Posted August 27, 2020 (edited) Thanks, might be worth putting that info into the github readme because right now this looks like the Sea*.dll are yours where no code nor their source is found for in the repo. Also apart from the raw speed numbers do you have total/peak memory allocated comparisons for the tests you mention as well? Edit: what is the "FastMM5 benchmark utility" you referring to? I see no benchmark in the FastMM5 repo Edited August 27, 2020 by Stefan Glienke 1 Share this post Link to post
RDP1974 40 Posted August 27, 2020 thanks, tomorrow will send info I wait embarcadero to update the linker to accomodate C $tls api, then we can use static libs without dll dependencyI Share this post Link to post
Arnaud Bouchez 407 Posted August 27, 2020 (edited) 11 hours ago, RDP1974 said: FastMM5 is fast as TBB under Webbroker with apachebench 100 concurrent users (finally overcoming the FM4 problems), but TBB is 5x faster than FM5 under TParallel class TBB is fast as FM4/FM5 in single thread TBB is fast in benchmarks, but from our experiment not usable on production on a server. TBB consumes A LOT of memory, much more than FM4/FM5 and alternatives. Numbers for a real multi-threaded Linux server are a show stopper for using TBB. On production on a huge Multi Xeon server, RAM consumption after a few hours stabilisation is gblic=2.6GB vs TBB=170GB - 60 times more memory ! With almost no actual performance boost. This mORMot service handles TB of incoming data, sent by block every second, with thousands of simultaneous HTTPS connections. See https://github.com/synopse/mORMot/blob/master/SynFPCCMemAligned.pas#L55 So never trust any benchmark. Try with your real workload. Quote Delphi AVX support for Synopse MM? What we found out with https://github.com/synopse/mORMot/blob/master/SynFPCx64MM.pas may be interesting for the discussion. Using AVX for medium blocks moves/realloc doesn't change in practice in respect to an inlined SSE2 move (tiny/small/medium blocks), or a non-temporal move (using movntdq opcode instead of plain mov - for large blocks). For large blocks, using mremap/VirtualAlloc in-place reallocation is a better approach: relying on the OS and performing no move is faster than AVX/AVX2/AVX512. SynFPCx64MM is currently only for FPC. Used on production with heavily loaded servers. It is based on FastMM4 design, fully optimized in x86_64 asm, but with a lockless round-robin algorithm for tiny blocks (<=256 bytes), and an optional lockless list for FreeMem - which are the bottleneck for most actual servers. It has several spinning alternatives in case of contention. And it is really Open Source - not like FastMM5. We may publish a Delphi-compatible version in the next weeks. Edited August 27, 2020 by Arnaud Bouchez 3 2 Share this post Link to post
RDP1974 40 Posted August 27, 2020 45 minutes ago, Arnaud Bouchez said: We may publish a Delphi-compatible version in the next weeks. Thanks Share this post Link to post
RDP1974 40 Posted August 27, 2020 but from Intel sources: similar or little bigger peak consumption and until 17x speedup, see test3 https://software.intel.com/content/www/us/en/develop/articles/controlling-memory-consumption-with-intel-threading-building-blocks-intel-tbb-scalable.html Share this post Link to post
RDP1974 40 Posted August 27, 2020 also this is very interesting https://github.com/daanx/mimalloc-bench anyway I agree that's better to have a pascal native MM than a C based one https://github.com/d-mozulyov/BrainMM this was cool, but under Win64 is crashing somewhere hi Arnaud, can I ask, in your Synopse software do you use Windows API IoCompletionPorts with thread pool and WSA* Overlapped I/O read write calls? TBBmalloc has sophisticated caching, it can be (mostly, with some exceptions) cleaned via scalable_allocation_command(TBBMALLOC_CLEAN_ALL_BUFFERS, 0) -> Strictly speaking per-thread caches of free objects are cleaned on thread termination btw. I'm testing other languages and frameworks, but for Windows app Delphi is still a fantastic, the best way !!! Did a real test benchmark with huge tasks (in practice preallocating heap), plz check the attached pascal benchmark source results on I9 Win10 D10.3.3 Win32 default 8 th Win64 default 8 th Win64 with TBB+IPP (RDP64) 8th WIn64 with TBB+IPP (RDP64) 1th Win64 default 1 th Win32 default 1 th So: 8 threads Win32 default = 15.5M Win64 default = 15.8M Win64 + RDP = 19.3M 1 thread Win32 default = 4.2M Win64 default = 4.3M Win64 + RDP = 5.2M So, with this test, TBB+IPP is averagely 25% faster in single thread app, and 35% faster in multithread than default MM Will be nice to see FPC64 and Linux results Anyway, only for amusement kind regards R. BenchPoker.rar anyway TParallel.For seems having race problems I'll test using default TTheads, results using threads should be higher Share this post Link to post
Fr0sT.Brutal 900 Posted August 27, 2020 (edited) Dude probably you should start a blog instead Edited August 27, 2020 by Fr0sT.Brutal 1 1 1 Share this post Link to post
RDP1974 40 Posted August 27, 2020 (edited) ok I don't know how to use TParallel (setminworkerthread) with 16 threads, using beginthread on I9 results doubles (I suppose there is a limit of 8 threads in the pool of TThread) NumThreads := StrToInt(Th.Text); SetLength(ArrThreads, NumThreads); SW := TStopWatch.Create; SW.Start; for i := 0 to NumThreads - 1 do ArrThreads:=BeginThread(nil, 0, Addr(ThExecute), nil, 0, dm); WaitForMultipleObjects(NumThreads, @ArrThreads[0], True, Infinite); SW.Stop; lbxResults.Items.Add('All done in ' + SW.ElapsedMilliseconds.ToString + 'msec'); lbxResults.Items.Add('Tot hands ' + TotHands.ToString); lbxResults.Items.Add('Hands for second ' + (TotHands / SW.ElapsedMilliseconds * 1000).ToString); for i := 0 to NumThreads - 1 do CloseHandle(ArrThreads) but, you are allright, this is a useless benchmark, real app scenario should be a database, web app, etc. where using tthreadvar (TLS) come back to work, I'll stop to bother the forum cu Edited August 27, 2020 by RDP1974 Share this post Link to post
Stefan Glienke 2002 Posted August 27, 2020 I know english is not your native language but to be honest just throwing around numbers and all kinds of different benchmarks it totally confusing for me - it would be very helpful if you could present your findings in a more structured way as I really find the topic interesting but it is very exhausting to follow you. 1 Share this post Link to post
RDP1974 40 Posted August 27, 2020 sorry, because I'm in a hurry, I'll try to enhance my syntax 🙂 Share this post Link to post
Lars Fosdal 1792 Posted August 27, 2020 A man in a hurry is always late. 2 2 1 Share this post Link to post
Arnaud Bouchez 407 Posted August 27, 2020 (edited) 11 hours ago, RDP1974 said: can I ask, in your Synopse software do you use Windows API IoCompletionPorts with thread pool and WSA* Overlapped I/O read write calls? On Windows, we use http.sys kernel mode which scales better than anything on this platform. It is faster than IOCP since it runs in the kernel. On Linux, we use our own thread-pool of socket server, with a nginx frontend as reverse proxy on the unix socket loopback, handling HTTPS and HTTP/2. This is very safe and scalable. And don't trust micro benchmarks. Even worse, don't write your own benchmark. They won't be as good as measuring of a real application. As I wrote, Intel TBB is a no-go for real server work due to huge memory consumption. If you have to run some specific API calls to release the memory, this is a big design flow - may be considered as a bug (we don't want to have the application stale as it would have with a GC) - and we would never do it. To be more precise, we use long-living threads from thread pools. So in practice, the threads are never released, and the memory allocation and the memory release are done in diverse threads: one thread pool handles the socket communication, then other thread pool will consume the data and release the memory. This is a scenario typical from most event-driven servers, running on multi-core CPUs, with a proven ring-oriented architecture. Perhaps Intel TBB is not very good at releasing memory with such pattern - whereas our SynFPCx64MM is very efficient in this case. And we almost never realloc - just alloc/free using the stack as working buffer if necessary. Edited August 27, 2020 by Arnaud Bouchez 3 1 Share this post Link to post
RDP1974 40 Posted August 28, 2020 (edited) hi looks here https://blog.digitaltundra.com/?p=902 another MM pascal code, free with my test I9 16 threads is the faster among all the MM tested (Tundra vs default) it's using threadvar tls for each thread cache Edited August 28, 2020 by RDP1974 Share this post Link to post
RDP1974 40 Posted September 2, 2020 On 8/28/2020 at 11:07 AM, RDP1974 said: hi looks here https://blog.digitaltundra.com/?p=902 another MM pascal code, free with my test I9 16 threads is the faster among all the MM tested (Tundra vs default) it's using threadvar tls for each thread cache ok, did a test with FastMM5, with 16 threads results are identical to BigbrainMM, and with single thread a little better (2501 vs 2727) 8% quicker Share this post Link to post
Anders Melander 1782 Posted September 2, 2020 36 minutes ago, Lars Fosdal said: Seriously... Why is that man trying to swallow an invisible shoe? 3 Share this post Link to post
Guest Posted September 22, 2020 @Lars Fosdal, it would be interesting to know the ratio of your forum rep regarding jokes vs. actually helping out... Share this post Link to post
Lars Fosdal 1792 Posted September 22, 2020 Clearly, it is based on jokes alone. 1 2 Share this post Link to post
Sherlock 663 Posted September 22, 2020 34 minutes ago, Lars Fosdal said: Clearly, it is based on jokes alone. Same here. Plus the tons of bots that just randomly like our posts. 1 Share this post Link to post
Attila Kovacs 629 Posted September 22, 2020 (edited) @Dany Marmur Just look it up: Edited September 22, 2020 by Attila Kovacs 1 Share this post Link to post