win11 24h2 msheap fastest

RDP1974 · October 20, 2024

hi,

I did a quick benchmark test for single threaded application (using attached poker game, of course it's not exhaustive with a small subset of things)

the win heap manager seems enhanced, now direct heap is faster than default MM or Intel TBB (using host I-9900kf win11 24h2 delphi 12.2.1 fpc 3.2.2 release mode)

single thread x64 console mode->

D12 default:

Total Time: 1,031 Seconds
Hands Per Second 2520814,74296799
Hand Evaluation Expected Actual
Royal Flushes: 4 4
Straight Flushes: 36 36
Four of a Kinds: 624 624
Full Houses: 3744 3744
Flushes: 5108 5108
Straights: 10200 10200
Three of a Kinds: 54912 54912
Two Pairs: 123552 123552
One Pairs: 1098240 1098240
Other: 1302540 1302540
Total Hands: 2598960 2598960

D12 intel tbb (rdpmm64):

Total Time: 1,281 Seconds
Hands Per Second 2028852,45901639
Hand Evaluation Expected Actual
Royal Flushes: 4 4
Straight Flushes: 36 36
Four of a Kinds: 624 624
Full Houses: 3744 3744
Flushes: 5108 5108
Straights: 10200 10200
Three of a Kinds: 54912 54912
Two Pairs: 123552 123552
One Pairs: 1098240 1098240
Other: 1302540 1302540
Total Hands: 2598960 2598960

D12 msheap:

Total Time: 0,984 Seconds
Hands Per Second 2641219,51219512
Hand Evaluation Expected Actual
Royal Flushes: 4 4
Straight Flushes: 36 36
Four of a Kinds: 624 624
Full Houses: 3744 3744
Flushes: 5108 5108
Straights: 10200 10200
Three of a Kinds: 54912 54912
Two Pairs: 123552 123552
One Pairs: 1098240 1098240
Other: 1302540 1302540
Total Hands: 2598960 2598960

latest FPC lazarus:

Total Time: 2,25 Seconds
Hands Per Second 1155093,33333333
Hand Evaluation Expected Actual
Royal Flushes: 4 4
Straight Flushes: 36 36
Four of a Kinds: 624 624
Full Houses: 3744 3744
Flushes: 5108 5108
Straights: 10200 10200
Three of a Kinds: 54912 54912
Two Pairs: 123552 123552
One Pairs: 1098240 1098240
Other: 1302540 1302540
Total Hands: 2598960 2598960

Many RTL are using directly the heap of windows, as Rust, Clang and others, it resists fragmentation, so I suppose it is okay to use it directly. Also this act very well in multithreading as webbroker apps.

look here if you wish

https://github.com/RDP1974/

Sorry to bore you with these things, just out of curiosity to squeeze the possible performance
btw. do you know a more complete real-world test scenario than this?

wldd49.zip

RDP1974 · October 20, 2024

this in win2022 server 21h2 hyper-v guest

C:\Exes>cmd3_def
Total Time: 1,062 Seconds
Hands Per Second 2447231,63841808
Hand Evaluation Expected Actual
Royal Flushes: 4 4
Straight Flushes: 36 36
Four of a Kinds: 624 624
Full Houses: 3744 3744
Flushes: 5108 5108
Straights: 10200 10200
Three of a Kinds: 54912 54912
Two Pairs: 123552 123552
One Pairs: 1098240 1098240
Other: 1302540 1302540
Total Hands: 2598960 2598960

C:\Exes>cmd3_fm5 (fastmm5)
Total Time: 1 Seconds
Hands Per Second 2598960
Hand Evaluation Expected Actual
Royal Flushes: 4 4
Straight Flushes: 36 36
Four of a Kinds: 624 624
Full Houses: 3744 3744
Flushes: 5108 5108
Straights: 10200 10200
Three of a Kinds: 54912 54912
Two Pairs: 123552 123552
One Pairs: 1098240 1098240
Other: 1302540 1302540
Total Hands: 2598960 2598960

C:\Exes>cmd3_rdp64
Total Time: 1,344 Seconds
Hands Per Second 1933750
Hand Evaluation Expected Actual
Royal Flushes: 4 4
Straight Flushes: 36 36
Four of a Kinds: 624 624
Full Houses: 3744 3744
Flushes: 5108 5108
Straights: 10200 10200
Three of a Kinds: 54912 54912
Two Pairs: 123552 123552
One Pairs: 1098240 1098240
Other: 1302540 1302540
Total Hands: 2598960 2598960

C:\Exes>cmd3_laz
Total Time: 2,266 Seconds
Hands Per Second 1146937,33451015
Hand Evaluation Expected Actual
Royal Flushes: 4 4
Straight Flushes: 36 36
Four of a Kinds: 624 624
Full Houses: 3744 3744
Flushes: 5108 5108
Straights: 10200 10200
Three of a Kinds: 54912 54912
Two Pairs: 123552 123552
One Pairs: 1098240 1098240
Other: 1302540 1302540
Total Hands: 2598960 2598960

C:\Exes>cmd3_msheap
Total Time: 0,984 Seconds
Hands Per Second 2641219,51219512
Hand Evaluation Expected Actual
Royal Flushes: 4 4
Straight Flushes: 36 36
Four of a Kinds: 624 624
Full Houses: 3744 3744
Flushes: 5108 5108
Straights: 10200 10200
Three of a Kinds: 54912 54912
Two Pairs: 123552 123552
One Pairs: 1098240 1098240
Other: 1302540 1302540
Total Hands: 2598960 2598960

Tommi Prami · October 21, 2024

Interesting...

Hope someone also gets involved and helps making bench marks more comprehensive.

-Tee-

darnocian · October 21, 2024

I have witnessed similar behaviour actually with the MS memory manager. For one of my projects, I also switched to using the MS memory manager. I did benchmarks a while ago, but will look into publishing something in future.

Der schöne Günther · October 21, 2024

20 hours ago, RDP1974 said:

it resists fragmentation

Can somebody shed some light on that?

I don't really know much about heap fragmentation, but it is one of my worst nightmares.

RDP1974 · October 21, 2024

https://learn.microsoft.com/en-us/windows/win32/memory/low-fragmentation-heap

https://illmatics.com/Understanding_the_LFH.pdf

https://www.softwareverify.com/blog/memory-fragmentation-your-worst-nightmare/

https://users.rust-lang.org/t/why-dont-windows-targets-use-malloc-instead-of-heapalloc/57936

I don't know if we need an intermediate allocator or if we can use the Win API heap directly.

Stefan Glienke · October 21, 2024

That benchmark proves almost (*) nothing, the only point where it allocates is during the form creation where it builds the card deck and later when it prints the output into the TListBox.

(*) the only thing affected here is the possible layout of the card objects in the heap as they are all read during the hand-processing code. The difference that you can observe here between Delphi buids using different memory managers is most likely caused by the amount of overhead the respective memory manager is using thus fitting more card objects within the same memory pages, thus more of them (most likely all on modern processors) fitting into L1 cache.

As for this particular code - removing the name of the Cards from the object and only building it when it is needed for some UI would probably speed up code more than anything else because you get rid of 20 Byte for every object (Name is a string[19]) - on my CPU this makes the code go down from ~900ms to ~680ms - simply because it does not need to copy the strings in CopyCardFromDeck.

Circumventing the getter and setter of TList (which contribute around 25% of the remaining time) brings it down to 460ms.

And after that we are not done with string stuff - in every loop iteration, it calls Hand.SetHighValues which produces a name for the cards on the hand - removing that gets me down to ~400.

Now because I have a run and SamplingProfiler open already I see that now one of the scorers is TcaaPokerHand.CopyCardFromDeck - the Items getter is not inlined which causes it to be called 10 times for the same 2 card objects. Changing that gets me down to 270ms.

But how about avoiding repeated access to the same object in the 2 lists altogether? 230ms

I could go on because I see a lot more room to optimize - but I think I made my point. Instead of fiddling with the memory manager one should first look if heap allocations are even the issue. And then identify unnecessary work and eliminate that.

... change TcaaEvaluationCard to be 8 Byte size - (that avoids that the compiler creates a movsd/movsb instruction but simply does an 8 byte mov) -> 160ms

Edited October 21, 2024 by Stefan Glienke

pyscripter · October 21, 2024

I am also sceptical. However, I see that in one of the most starred Github Delphi repos, danieleteti/delphimvcframework: DMVCFramework (for short) is a popular and powerful framework for WEB API in Delphi. Supports RESTful and JSON-RPC WEB APIs development., the IDE expert project generator, includes MSHeap by default.

@Daniele TetiCould you please enlighten us as to what led you to this choice?

By the way the RTL can be configured to use MSHeap, if compiled with the SIMPLEHEAP conditional define. The downside is that ReportMemoryLeaksOnShutdown is not available with MSHeap.

David Heffernan · October 22, 2024

19 hours ago, pyscripter said:

By the way the RTL can be configured to use MSHeap, if compiled with the SIMPLEHEAP conditional define. The downside is that ReportMemoryLeaksOnShutdown is not available with MSHeap.

Rather than re-compile the RTL you can register a custom memory manager, which is very easy to do backed with MSHeap. And then you can add simple memory leaks on shutdown reporting.

I actually do use the MS heap for my MM because I want to implement a per thread MM so that I can get decent performance on NUMA systems.

pyscripter · October 22, 2024

1 hour ago, David Heffernan said:

And then you can add simple memory leaks on shutdown reporting.

Do you have any code you can share?

David Heffernan · October 22, 2024

You can add to a counter when allocating new blocks, and decrement when deallocating, and then you find bleat if the counter is non-zero on exit

thatlr · October 22, 2024

3 hours ago, pyscripter said:

Do you have any code you can share?

Here is my MemTest unit which I use together with WinMemMgr, in all my productive applications: https://github.com/thatlr/Delphi-SupportUnits/blob/main/source/MemTest.pas

Sign In

win11 24h2 msheap fastest

Recommended Posts

RDP1974 40

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Tommi Prami 148

Share this post

Link to post

darnocian 99

Share this post

Link to post

Der schöne Günther 336

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Stefan Glienke 2142

Share this post

Link to post

pyscripter 788

Share this post

Link to post

David Heffernan 2446

Share this post

Link to post

pyscripter 788

Share this post

Link to post

David Heffernan 2446

Share this post

Link to post

thatlr 1

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity