RDP1974 40 Posted October 20 hi, I did a quick benchmark test for single threaded application (using attached poker game, of course it's not exhaustive with a small subset of things) the win heap manager seems enhanced, now direct heap is faster than default MM or Intel TBB (using host I-9900kf win11 24h2 delphi 12.2.1 fpc 3.2.2 release mode) single thread x64 console mode-> D12 default: Total Time: 1,031 Seconds Hands Per Second 2520814,74296799 Hand Evaluation Expected Actual Royal Flushes: 4 4 Straight Flushes: 36 36 Four of a Kinds: 624 624 Full Houses: 3744 3744 Flushes: 5108 5108 Straights: 10200 10200 Three of a Kinds: 54912 54912 Two Pairs: 123552 123552 One Pairs: 1098240 1098240 Other: 1302540 1302540 Total Hands: 2598960 2598960 D12 intel tbb (rdpmm64): Total Time: 1,281 Seconds Hands Per Second 2028852,45901639 Hand Evaluation Expected Actual Royal Flushes: 4 4 Straight Flushes: 36 36 Four of a Kinds: 624 624 Full Houses: 3744 3744 Flushes: 5108 5108 Straights: 10200 10200 Three of a Kinds: 54912 54912 Two Pairs: 123552 123552 One Pairs: 1098240 1098240 Other: 1302540 1302540 Total Hands: 2598960 2598960 D12 msheap: Total Time: 0,984 Seconds Hands Per Second 2641219,51219512 Hand Evaluation Expected Actual Royal Flushes: 4 4 Straight Flushes: 36 36 Four of a Kinds: 624 624 Full Houses: 3744 3744 Flushes: 5108 5108 Straights: 10200 10200 Three of a Kinds: 54912 54912 Two Pairs: 123552 123552 One Pairs: 1098240 1098240 Other: 1302540 1302540 Total Hands: 2598960 2598960 latest FPC lazarus: Total Time: 2,25 Seconds Hands Per Second 1155093,33333333 Hand Evaluation Expected Actual Royal Flushes: 4 4 Straight Flushes: 36 36 Four of a Kinds: 624 624 Full Houses: 3744 3744 Flushes: 5108 5108 Straights: 10200 10200 Three of a Kinds: 54912 54912 Two Pairs: 123552 123552 One Pairs: 1098240 1098240 Other: 1302540 1302540 Total Hands: 2598960 2598960 Many RTL are using directly the heap of windows, as Rust, Clang and others, it resists fragmentation, so I suppose it is okay to use it directly. Also this act very well in multithreading as webbroker apps. look here if you wish https://github.com/RDP1974/ Sorry to bore you with these things, just out of curiosity to squeeze the possible performance btw. do you know a more complete real-world test scenario than this? wldd49.zip Share this post Link to post
RDP1974 40 Posted October 20 this in win2022 server 21h2 hyper-v guest C:\Exes>cmd3_def Total Time: 1,062 Seconds Hands Per Second 2447231,63841808 Hand Evaluation Expected Actual Royal Flushes: 4 4 Straight Flushes: 36 36 Four of a Kinds: 624 624 Full Houses: 3744 3744 Flushes: 5108 5108 Straights: 10200 10200 Three of a Kinds: 54912 54912 Two Pairs: 123552 123552 One Pairs: 1098240 1098240 Other: 1302540 1302540 Total Hands: 2598960 2598960 C:\Exes>cmd3_fm5 (fastmm5) Total Time: 1 Seconds Hands Per Second 2598960 Hand Evaluation Expected Actual Royal Flushes: 4 4 Straight Flushes: 36 36 Four of a Kinds: 624 624 Full Houses: 3744 3744 Flushes: 5108 5108 Straights: 10200 10200 Three of a Kinds: 54912 54912 Two Pairs: 123552 123552 One Pairs: 1098240 1098240 Other: 1302540 1302540 Total Hands: 2598960 2598960 C:\Exes>cmd3_rdp64 Total Time: 1,344 Seconds Hands Per Second 1933750 Hand Evaluation Expected Actual Royal Flushes: 4 4 Straight Flushes: 36 36 Four of a Kinds: 624 624 Full Houses: 3744 3744 Flushes: 5108 5108 Straights: 10200 10200 Three of a Kinds: 54912 54912 Two Pairs: 123552 123552 One Pairs: 1098240 1098240 Other: 1302540 1302540 Total Hands: 2598960 2598960 C:\Exes>cmd3_laz Total Time: 2,266 Seconds Hands Per Second 1146937,33451015 Hand Evaluation Expected Actual Royal Flushes: 4 4 Straight Flushes: 36 36 Four of a Kinds: 624 624 Full Houses: 3744 3744 Flushes: 5108 5108 Straights: 10200 10200 Three of a Kinds: 54912 54912 Two Pairs: 123552 123552 One Pairs: 1098240 1098240 Other: 1302540 1302540 Total Hands: 2598960 2598960 C:\Exes>cmd3_msheap Total Time: 0,984 Seconds Hands Per Second 2641219,51219512 Hand Evaluation Expected Actual Royal Flushes: 4 4 Straight Flushes: 36 36 Four of a Kinds: 624 624 Full Houses: 3744 3744 Flushes: 5108 5108 Straights: 10200 10200 Three of a Kinds: 54912 54912 Two Pairs: 123552 123552 One Pairs: 1098240 1098240 Other: 1302540 1302540 Total Hands: 2598960 2598960 Share this post Link to post
Tommi Prami 130 Posted October 21 Interesting... Hope someone also gets involved and helps making bench marks more comprehensive. -Tee- Share this post Link to post
darnocian 84 Posted October 21 I have witnessed similar behaviour actually with the MS memory manager. For one of my projects, I also switched to using the MS memory manager. I did benchmarks a while ago, but will look into publishing something in future. Share this post Link to post
Der schöne Günther 316 Posted October 21 20 hours ago, RDP1974 said: it resists fragmentation Can somebody shed some light on that? I don't really know much about heap fragmentation, but it is one of my worst nightmares. Share this post Link to post
RDP1974 40 Posted October 21 https://learn.microsoft.com/en-us/windows/win32/memory/low-fragmentation-heap https://illmatics.com/Understanding_the_LFH.pdf https://www.softwareverify.com/blog/memory-fragmentation-your-worst-nightmare/ https://users.rust-lang.org/t/why-dont-windows-targets-use-malloc-instead-of-heapalloc/57936 I don't know if we need an intermediate allocator or if we can use the Win API heap directly. Share this post Link to post
Stefan Glienke 2002 Posted October 21 (edited) That benchmark proves almost (*) nothing, the only point where it allocates is during the form creation where it builds the card deck and later when it prints the output into the TListBox. (*) the only thing affected here is the possible layout of the card objects in the heap as they are all read during the hand-processing code. The difference that you can observe here between Delphi buids using different memory managers is most likely caused by the amount of overhead the respective memory manager is using thus fitting more card objects within the same memory pages, thus more of them (most likely all on modern processors) fitting into L1 cache. As for this particular code - removing the name of the Cards from the object and only building it when it is needed for some UI would probably speed up code more than anything else because you get rid of 20 Byte for every object (Name is a string[19]) - on my CPU this makes the code go down from ~900ms to ~680ms - simply because it does not need to copy the strings in CopyCardFromDeck. Circumventing the getter and setter of TList (which contribute around 25% of the remaining time) brings it down to 460ms. And after that we are not done with string stuff - in every loop iteration, it calls Hand.SetHighValues which produces a name for the cards on the hand - removing that gets me down to ~400. Now because I have a run and SamplingProfiler open already I see that now one of the scorers is TcaaPokerHand.CopyCardFromDeck - the Items getter is not inlined which causes it to be called 10 times for the same 2 card objects. Changing that gets me down to 270ms. But how about avoiding repeated access to the same object in the 2 lists altogether? 230ms I could go on because I see a lot more room to optimize - but I think I made my point. Instead of fiddling with the memory manager one should first look if heap allocations are even the issue. And then identify unnecessary work and eliminate that. ... change TcaaEvaluationCard to be 8 Byte size - (that avoids that the compiler creates a movsd/movsb instruction but simply does an 8 byte mov) -> 160ms Edited October 21 by Stefan Glienke 5 Share this post Link to post
pyscripter 689 Posted October 21 I am also sceptical. However, I see that in one of the most starred Github Delphi repos, danieleteti/delphimvcframework: DMVCFramework (for short) is a popular and powerful framework for WEB API in Delphi. Supports RESTful and JSON-RPC WEB APIs development., the IDE expert project generator, includes MSHeap by default. @Daniele TetiCould you please enlighten us as to what led you to this choice? By the way the RTL can be configured to use MSHeap, if compiled with the SIMPLEHEAP conditional define. The downside is that ReportMemoryLeaksOnShutdown is not available with MSHeap. Share this post Link to post
David Heffernan 2345 Posted October 22 19 hours ago, pyscripter said: By the way the RTL can be configured to use MSHeap, if compiled with the SIMPLEHEAP conditional define. The downside is that ReportMemoryLeaksOnShutdown is not available with MSHeap. Rather than re-compile the RTL you can register a custom memory manager, which is very easy to do backed with MSHeap. And then you can add simple memory leaks on shutdown reporting. I actually do use the MS heap for my MM because I want to implement a per thread MM so that I can get decent performance on NUMA systems. 1 Share this post Link to post
pyscripter 689 Posted October 22 1 hour ago, David Heffernan said: And then you can add simple memory leaks on shutdown reporting. Do you have any code you can share? Share this post Link to post
David Heffernan 2345 Posted October 22 You can add to a counter when allocating new blocks, and decrement when deallocating, and then you find bleat if the counter is non-zero on exit 1 Share this post Link to post
thatlr 1 Posted October 22 3 hours ago, pyscripter said: Do you have any code you can share? Here is my MemTest unit which I use together with WinMemMgr, in all my productive applications: https://github.com/thatlr/Delphi-SupportUnits/blob/main/source/MemTest.pas 1 Share this post Link to post