Jump to content
RDP1974

win11 24h2 msheap fastest

Recommended Posts

hi,

I did a quick benchmark test for single threaded application (using attached poker game, of course it's not exhaustive with a small subset of things)

the win heap manager seems enhanced, now direct heap is faster than default MM or Intel TBB (using host I-9900kf win11 24h2 delphi 12.2.1 fpc 3.2.2 release mode)

 

single thread x64 console mode->

 

D12 default:

Total Time: 1,031 Seconds
Hands Per Second  2520814,74296799
Hand Evaluation  Expected Actual
Royal Flushes:          4 4
Straight Flushes:      36 36
Four of a Kinds:      624 624
Full Houses:         3744 3744
Flushes:             5108 5108
Straights:          10200 10200
Three of a Kinds:   54912 54912
Two Pairs:         123552 123552
One Pairs:        1098240 1098240
Other:            1302540 1302540
Total Hands:      2598960 2598960

 

D12 intel tbb (rdpmm64):

Total Time: 1,281 Seconds
Hands Per Second  2028852,45901639
Hand Evaluation  Expected Actual
Royal Flushes:          4 4
Straight Flushes:      36 36
Four of a Kinds:      624 624
Full Houses:         3744 3744
Flushes:             5108 5108
Straights:          10200 10200
Three of a Kinds:   54912 54912
Two Pairs:         123552 123552
One Pairs:        1098240 1098240
Other:            1302540 1302540
Total Hands:      2598960 2598960

 

D12 msheap:

Total Time: 0,984 Seconds
Hands Per Second  2641219,51219512
Hand Evaluation  Expected Actual
Royal Flushes:          4 4
Straight Flushes:      36 36
Four of a Kinds:      624 624
Full Houses:         3744 3744
Flushes:             5108 5108
Straights:          10200 10200
Three of a Kinds:   54912 54912
Two Pairs:         123552 123552
One Pairs:        1098240 1098240
Other:            1302540 1302540
Total Hands:      2598960 2598960

 

latest FPC lazarus:

Total Time: 2,25 Seconds
Hands Per Second  1155093,33333333
Hand Evaluation  Expected Actual
Royal Flushes:          4 4
Straight Flushes:      36 36
Four of a Kinds:      624 624
Full Houses:         3744 3744
Flushes:             5108 5108
Straights:          10200 10200
Three of a Kinds:   54912 54912
Two Pairs:         123552 123552
One Pairs:        1098240 1098240
Other:            1302540 1302540
Total Hands:      2598960 2598960

 

Many RTL are using directly the heap of windows, as Rust, Clang and others, it resists fragmentation, so I suppose it is okay to use it directly. Also this act very well in multithreading as webbroker apps.

 

look here if you wish

https://github.com/RDP1974/

 

Sorry to bore you with these things, just out of curiosity to squeeze the possible performance
btw. do you know a more complete real-world test scenario than this?

wldd49.zip

Share this post


Link to post

this in win2022 server 21h2 hyper-v guest

 

C:\Exes>cmd3_def
Total Time: 1,062 Seconds
Hands Per Second  2447231,63841808
Hand Evaluation  Expected Actual
Royal Flushes:          4 4
Straight Flushes:      36 36
Four of a Kinds:      624 624
Full Houses:         3744 3744
Flushes:             5108 5108
Straights:          10200 10200
Three of a Kinds:   54912 54912
Two Pairs:         123552 123552
One Pairs:        1098240 1098240
Other:            1302540 1302540
Total Hands:      2598960 2598960


C:\Exes>cmd3_fm5 (fastmm5)
Total Time: 1 Seconds
Hands Per Second  2598960
Hand Evaluation  Expected Actual
Royal Flushes:          4 4
Straight Flushes:      36 36
Four of a Kinds:      624 624
Full Houses:         3744 3744
Flushes:             5108 5108
Straights:          10200 10200
Three of a Kinds:   54912 54912
Two Pairs:         123552 123552
One Pairs:        1098240 1098240
Other:            1302540 1302540
Total Hands:      2598960 2598960


C:\Exes>cmd3_rdp64
Total Time: 1,344 Seconds
Hands Per Second  1933750
Hand Evaluation  Expected Actual
Royal Flushes:          4 4
Straight Flushes:      36 36
Four of a Kinds:      624 624
Full Houses:         3744 3744
Flushes:             5108 5108
Straights:          10200 10200
Three of a Kinds:   54912 54912
Two Pairs:         123552 123552
One Pairs:        1098240 1098240
Other:            1302540 1302540
Total Hands:      2598960 2598960


C:\Exes>cmd3_laz
Total Time: 2,266 Seconds
Hands Per Second  1146937,33451015
Hand Evaluation  Expected Actual
Royal Flushes:          4 4
Straight Flushes:      36 36
Four of a Kinds:      624 624
Full Houses:         3744 3744
Flushes:             5108 5108
Straights:          10200 10200
Three of a Kinds:   54912 54912
Two Pairs:         123552 123552
One Pairs:        1098240 1098240
Other:            1302540 1302540
Total Hands:      2598960 2598960


C:\Exes>cmd3_msheap
Total Time: 0,984 Seconds
Hands Per Second  2641219,51219512
Hand Evaluation  Expected Actual
Royal Flushes:          4 4
Straight Flushes:      36 36
Four of a Kinds:      624 624
Full Houses:         3744 3744
Flushes:             5108 5108
Straights:          10200 10200
Three of a Kinds:   54912 54912
Two Pairs:         123552 123552
One Pairs:        1098240 1098240
Other:            1302540 1302540
Total Hands:      2598960 2598960

Share this post


Link to post

I have witnessed similar behaviour actually with the MS memory manager. For one of my projects, I also switched to using the MS  memory manager. I did benchmarks a while ago, but will look into publishing something in future.

Share this post


Link to post
20 hours ago, RDP1974 said:

it resists fragmentation

Can somebody shed some light on that?

I don't really know much about heap fragmentation, but it is one of my worst nightmares.

Share this post


Link to post

That benchmark proves almost (*) nothing, the only point where it allocates is during the form creation where it builds the card deck and later when it prints the output into the TListBox.

 

(*) the only thing affected here is the possible layout of the card objects in the heap as they are all read during the hand-processing code. The difference that you can observe here between Delphi buids using different memory managers is most likely caused by the amount of overhead the respective memory manager is using thus fitting more card objects within the same memory pages, thus more of them (most likely all on modern processors) fitting into L1 cache.

 

As for this particular code - removing the name of the Cards from the object and only building it when it is needed for some UI would probably speed up code more than anything else because you get rid of 20 Byte for every object (Name is a string[19]) - on my CPU this makes the code go down from ~900ms to ~680ms - simply because it does not need to copy the strings in CopyCardFromDeck.

 

Circumventing the getter and setter of TList (which contribute around 25% of the remaining time) brings it down to 460ms.

And after that we are not done with string stuff - in every loop iteration, it calls Hand.SetHighValues which produces a name for the cards on the hand - removing that gets me down to ~400.

 

Now because I have a run and SamplingProfiler open already I see that now one of the scorers is TcaaPokerHand.CopyCardFromDeck - the Items getter is not inlined which causes it to be called 10 times for the same 2 card objects. Changing that gets me down to 270ms.

But how about avoiding repeated access to the same object in the 2 lists altogether? 230ms

 

I could go on because I see a lot more room to optimize - but I think I made my point. Instead of fiddling with the memory manager one should first look if heap allocations are even the issue. And then identify unnecessary work and eliminate that.

 

... change TcaaEvaluationCard to be 8 Byte size - (that avoids that the compiler creates a movsd/movsb instruction but simply does an 8 byte mov) -> 160ms

Edited by Stefan Glienke
  • Like 4

Share this post


Link to post

I am also sceptical.  However, I see that in one of the most starred Github Delphi repos, danieleteti/delphimvcframework: DMVCFramework (for short) is a popular and powerful framework for WEB API in Delphi. Supports RESTful and JSON-RPC WEB APIs development., the IDE expert project generator, includes MSHeap by default.

 

@Daniele TetiCould you please enlighten us as to what led you to this choice?

 

By the way the RTL can be configured to use MSHeap, if compiled with the SIMPLEHEAP conditional define.  The downside is that ReportMemoryLeaksOnShutdown is not available with MSHeap.

Share this post


Link to post

 

19 hours ago, pyscripter said:

By the way the RTL can be configured to use MSHeap, if compiled with the SIMPLEHEAP conditional define.  The downside is that ReportMemoryLeaksOnShutdown is not available with MSHeap.

 

Rather than re-compile the RTL you can register a custom memory manager, which is very easy to do backed with MSHeap. And then you can add simple memory leaks on shutdown reporting.

 

I actually do use the MS heap for my MM because I want to implement a per thread MM so that I can get decent performance on NUMA systems.

  • Like 1

Share this post


Link to post
1 hour ago, David Heffernan said:

And then you can add simple memory leaks on shutdown reporting.

Do you have any code you can share?

Share this post


Link to post

You can add to a counter when allocating new blocks, and decrement when deallocating, and then you find bleat if the counter is non-zero on exit

  • Like 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×