Jump to content
Günther Schoch

FastMM5 now released by Pierre le Riche (small background story)

Recommended Posts

Hi

 

I would like to give you some background information on the today release of FastMM5.

 

Exactly 2 years ago we (gs-soft) met Marco Cantu to discuss some important points in the long-term Delphi strategy. The efficient memory management within heavily multi threading applications was one of these important topics.

 

Marco had not seen at that moment a big chance that the Delphi Development Team could improve the situation. But he linked us with Pierre le Riche for further discussion.

Since then Pierre invested a lot of design efforts and time into the new FastMM5 in order to overcome the limitations. We (gs-soft) on our side sponsored a part of this large work.

 

More than a year and several redesigns later, FastMM5 Apha was ready. Since that moment several Beta were running  on our test platforms and we even have shipped some of our products using the Beta. In the last 4 month we have as well used our AMD EPYC 64/128 CPU based servers to learnt more about fine-tuning of the FastMM5 environment. As well other Beta testers were involved to crosscheck the stability.

 

Pierre still sees several places of enhancements, but FastMM5 is in our common opinion now ready to be used by other companies.

This Release 5.00 is available for download under https://github.com/pleriche/FastMM5/blob/master/README.md

Please be aware that Pierre decided to go for a dual licensing model. The background is clear – a good product needs maintenance and this again money.

Have fun with testing!

 

Günther Schoch / gs-soft AG

  • Like 12
  • Thanks 6

Share this post


Link to post
Posted (edited)

What do you think about FPC + Linux support, which is a good environment for multi-threaded servers?

FPC built-in heap is good, but tends to consume a lot of memory with a lot of threads: it maintains small per-thread heaps using a threadvar, whereas FastMM5 uses several arenas which are shared among all threads (I guess the idea is inspired from pmalloc/glibc allocator).
I used C all best known alternatives, and I was not convinced. The only stable and not bloated memory manager is the one in glibc. But the slightest memory access violation tends to kill/abort the process, so it is not good on production.

 

I could definitively help about the Linux/FPC syscalls and the low-level Intel asm, to includ FPC/Linux support on FastMM5.

But perhaps I would go into this direction only if FPC as compiler doesn't require a commercial license.

What do you think?

Edited by Arnaud Bouchez
  • Like 1

Share this post


Link to post
7 hours ago, Arnaud Bouchez said:

What do you think about FPC + Linux support, which is a good environment for multi-threaded servers?

... But perhaps I would go into this direction only if FPC as compiler doesn't require a commercial license.

The next steps depend on Pierre le Riche and are influenced as well by the commercial side. But in a first phase we focus now to have 5.0x optimal running for all use case under win32 and win64.

Share this post


Link to post

Actually I don't see any difference between Rio and FastMM5. Can some one explain in simple examples where can I win?  

Share this post


Link to post

I can't find a link to the FastMM4Options.inc file. Is FastMM5 no longer a configuration file?

Share this post


Link to post
46 minutes ago, Jacek Laskowski said:

I can't find a link to the FastMM4Options.inc file. Is FastMM5 no longer a configuration file?

From the introduction:

 

  • It is fully configurable runtime. There is no need to change conditional defines and recompile to change options. (It is however backward compatible with many of the version 4 conditional defines.)
  • It may be configured runtime to favour speed, memory usage efficiency or a blend of the two via the FastMM_SetOptimizationStrategy call.

Share this post


Link to post
7 hours ago, dkprojektai said:

Can some one explain in simple examples where can I win?

It's explained in the readme.

It would be simpler if you described what you're doing and what your experience is, than us explaining every possible improved scenario.

Share this post


Link to post
21 hours ago, dkprojektai said:

Actually I don't see any difference between Rio and FastMM5. Can some one explain in simple examples where can I win?  

we expected that developers with some heavy multi-threading products would just replace the FastMM4 unit reference with FastMM5 and retest. If you see a speed gain or a drop in CPU consumption then your application was limited by the design of FastMM4.

But I see, that a small "intro-simple-example" will help. We work on that sample and will publish it soon. But please to no blame us later that the sample is not realistic for a good code 🙂. Most of the time you have chances to micro optimize your own code to remove the hot spots. But in certain cases that is not possible anymore (e.g. libraries provided by Delphi or a 3rd party or just a lack of time and resources). 

Share this post


Link to post
1 hour ago, Günther Schoch said:

OK - I attached now a small first test (FastMM5ConsoleTest.dpr) that compares FastMM5 with FastMM4. Please read first the attached SpeedTestMM5_Sample1.pdf to understand more on the background and motivation.

Thank you Gunther,

 

But i have a question as i just ran your test on Seattle, also i tested it with ScaleMM2 and BrainMM, now BrainMM only 10-20% faster on 32/64bit, while ScaleMM2 performed twice faster than FastMM5 on 32bit and just like BrainMM around 10-20% faster on 64bit, here i think ScaleMM is missing few things to perform the same on 32 and 64 bits, Can you confirm this result ? or i am missing something.

 

Note: i am not saying that ScaleMM or BrainMM are stable for production, they may be OK or may be not, but i saw few unexplained behaviour in the past made me exclude them from real world usage.

Share this post


Link to post
Posted (edited)
33 minutes ago, Kas Ob. said:

i think ScaleMM is missing few things to perform the same on 32 and 64 bits

for me SMM2 is faster on x64 as on x86, also SMM2 uses much more RAM as FMM5, so it's possible that FMM5 could be tuned to use more RAM and even less CPU (?).

Edited by Attila Kovacs

Share this post


Link to post
33 minutes ago, Attila Kovacs said:

for me SMM2 is faster on x64 as on x86, also SMM2 uses much more RAM as FMM5,

That is interesting !,

my CPU is i7-2600K , what is yours ? ScaleMM2 with 64bit is performing slower almost twice than 32bit, but still faster than FastMM5, i just rechecked and also i don't have Rio, if that does matter.

 

35 minutes ago, Attila Kovacs said:

so it's possible that FMM5 could be tuned to use more RAM and even less CPU (?).

I don't think so ! or at least this will not be easy task to begin with, both have very different concept in their approach.

 

Although adding NUMA support to ScaleMM will be way easier and more clear simple design, as it built to have a memory manager per thread.

Share this post


Link to post
Posted (edited)

@Kas Ob. Okay, I don't know how interesting is it.

I've downloaded the SMM2 sources, changed the 'xxx..' stuff in the test dpr for some lorem ipsum cantus and my CPU is i7-3930K

The two fastest runs:

// Parallel For used : 8857900 ticks 32-bit
// Parallel For used : 7113633 ticks 64-bit

 

Edit: Under Berlin U2.

Edited by Attila Kovacs

Share this post


Link to post
19 minutes ago, Attila Kovacs said:

Okay, I don't know how interesting is it.

Here is my result, showing something is wrong on my side !

this the result of comparing FastMM5 ( 32 and 64 has same result)

 

FastMM5-Test.thumb.png.383aafa04a50153d640ce8e4943c057f.png

 

and the result of same EXE's on Hyper-V with small CPU power (1 virtual processor) running on E5-1650v3.

 

FastMM5-Test-HyperV.png

Share this post


Link to post
Posted (edited)

@Kas Ob. Everything fine with your test. Reverted back to 'a'+c+'xxx..' and the results are same as yours. However, changing 'xxx..' to something longer changes the game. It performs better on larger memory chunks, but therefore also uses a lot more RAM.

Btw, now I understand why was it slower on the first couple of runs, it did not reach the barrier.

FMM5 performs more homogeneous, seamless transition between the block sizes.

Edited by Attila Kovacs

Share this post


Link to post
16 minutes ago, Attila Kovacs said:

FMM5 performs more homogeneous, seamless transition between the block sizes.

That is exactly what caught my attention, i did a stress test for one server of mine, and the result was very stable, means no peaks in throughput, there wasn't those +-15% surge in traffic, and that was impressive, i never have seen my server stress test that stable in throughput and memory usage, threads were having almost the same context switch amount.

 

The double speed in ScaleMM on 32bit was coming from the optimized move (SSE3) in Optimize.Move.pas, when excluded the result of 32bit and 64bit was very close.

 

Now some more interesting observation:

Including that Optimize.Move to BrainMM doubles the speed.

Including Optimize.Move to FastMM5 increase the speed by 25% only.

  • Like 1

Share this post


Link to post
Posted (edited)

I did a test of your console bench, using FastMM4, FastMM5, and optimized Intel Delphi64 TBB (feel free to use it)

 

The result on VMware 8vcpu I9 5Ghz Windows 2016 Server:

 

FastMM5 is 4x faster than FastMM4; IntelTBB is 5x faster than FastMM5 and 18x faster than FastMM4

 

Those new generation of allocators based on TLS cache are faster and used in production (I see game engines as unreal that are using by default TBB).

Visual Studio C, C++ have as option to optimize using TBB and IPP.

Further are better suited for memory error discovery and tested for 24/7/365 use.

In my humble opinion Delphi should license TBB from Intel (it's free oss license) and port it to CLANG, rewriting the missing $TLS API runtime. The WINAPI headers dependency of msvcrt should be avoided using the C++Builder winapi 7.0 repository.

This should be used in Win32, Win64, Android, Linux, Ios, Osx.

Another cool C allocator, free, is the mimalloc of Microsoft.

(IMHO Delphi 64bit can have a nice place for Cloud and distributed web apps, with a modern allocator can compete with Rust, Erlang, Go)

 

C:\Exes>FastMM5ConsoleTest_F4
Parallel For used : 1479456 ticks
Parallel For used : 1593960 ticks
Parallel For used : 1492162 ticks
Parallel For used : 1516575 ticks
Parallel For used : 1504889 ticks
Parallel For used : 1616684 ticks
Parallel For used : 1694674 ticks
Parallel For used : 1659002 ticks
Parallel For used : 1509797 ticks
Parallel For used : 1623232 ticks
Parallel For used : 1549025 ticks
Parallel For used : 1768947 ticks
Parallel For used : 1860454 ticks
Parallel For used : 1813156 ticks
Parallel For used : 2014587 ticks
Parallel For used : 1896651 ticks
Parallel For used : 1918023 ticks
Parallel For used : 1869937 ticks
Parallel For used : 1832852 ticks
Parallel For used : 1855156 ticks
Done. Press ENTER to exit

 

C:\Exes>FastMM5ConsoleTest_F5 (FastMM_SetOptimizationStrategy(mmosOptimizeForSpeed))
Parallel For used : 429409 ticks
Parallel For used : 428977 ticks
Parallel For used : 439715 ticks
Parallel For used : 431561 ticks
Parallel For used : 441682 ticks
Parallel For used : 448713 ticks
Parallel For used : 457904 ticks
Parallel For used : 451374 ticks
Parallel For used : 420869 ticks
Parallel For used : 433840 ticks
Parallel For used : 428119 ticks
Parallel For used : 426678 ticks
Parallel For used : 431399 ticks
Parallel For used : 432025 ticks
Parallel For used : 429793 ticks
Parallel For used : 420178 ticks
Parallel For used : 422983 ticks
Parallel For used : 433726 ticks
Parallel For used : 426557 ticks
Parallel For used : 418806 ticks
Done. Press ENTER to exit

 

C:\Exes>FastMM5ConsoleTest_Intel
Parallel For used : 85910 ticks
Parallel For used : 82550 ticks
Parallel For used : 84917 ticks
Parallel For used : 81707 ticks
Parallel For used : 81077 ticks
Parallel For used : 80789 ticks
Parallel For used : 81069 ticks
Parallel For used : 81506 ticks
Parallel For used : 85098 ticks
Parallel For used : 84156 ticks
Parallel For used : 84978 ticks
Parallel For used : 81699 ticks
Parallel For used : 84017 ticks
Parallel For used : 79480 ticks
Parallel For used : 80324 ticks
Parallel For used : 80736 ticks
Parallel For used : 83380 ticks
Parallel For used : 84887 ticks
Parallel For used : 78052 ticks
Parallel For used : 82792 ticks
Done. Press ENTER to exit

 

Edited by RDP1974

Share this post


Link to post

Thank you to all that have provide a feedback for my first test example.

 

Remember that I was ask to show evidence that FastMM5 scales better than FastMM4. I think there we agree that this is shown. The sample was actually not provided to be a full scale compare of to other memory managers. But of cause it was used in that direction (I would have done that as well :classic_smile:). 

 

Concerning the difference to ScaleMM we will for sure go into more details (I still hope to see similar jumps as well in x64)

Like with a "good Formula1 car design" we are convinced that we can easily improve step by step while keeping all the features as "memory leak check", "FullDebugMode" etc. 

Concerning IntelTTB we will have to run some "full scale compares" to see in which real world cases we do have a clear difference (and why).

 

regards Günther

  • Like 1

Share this post


Link to post
1 hour ago, RDP1974 said:

IntelTBB is 5x faster than FastMM5

Against my better judgement I'm going to bite. Below are screenshots of the "CPU time" and "Peak working set" columns in Task Manager after a single benchmark run of the Fastcode Memory Manager Benchmark & Validation tool. It includes a variety of tests, including replays of memory usage recordings of real world applications:

 

FastMM5.png

IntelTBB.png

  • Like 3

Share this post


Link to post

work scenario can be different, thread pool using the heap will benefit a lot from TBB+IPP

but, memory a part, I wish embarcadero will update delphi and linker to accomodate the modern C libraries ($TLS)

kind regards

Share this post


Link to post

It's pointless to say that one memory manager is n times faster than another because usage varies so much. One thing that is surely true is that different applications have different needs.

 

I'm personally pretty sceptical that one single memory manager can perform optimally in a wide range of usage scenarios. Nothing wrong with choosing the MM that best fits each application. 

Share this post


Link to post

We're already evaluating it. First impression: Promising!

However, I'm not sure using GPLv3 is something he should've used, for what he tries to archieve with the dual license.

Share this post


Link to post
Posted (edited)

Hi !

On my side, I have 2 pro use case (delphi and FPC) where we changed standart memory manager : 


First, I'm fully agreed with David Hefferman, we have not succeded to spot "the" memory manager which spot all our need. 

-> I proclamed myself "basic user" on this subject, I plug, and launch test, and I took the best thanks memory/speed/context meters. :)


- First case , where we used FastMM4 in a customized bus oriented server : (GridServer + custom synapse based socketing) work very well. (used preliminary for memory counsumption barrier)

- Second case, in a raytracer-like image generator, where we targeted best thread usage with memory sharing : We are using currently scalemm2 on backend intel based server (and nothing on arm embeded based - (side question : is there some memory manager perform well on linux/arm under FPC ?).

 

On the first use case, on FastMM4/FastMM5 usage, I have a slight better result on FastMM5. I'll dig :)

On the second one, Test results give to ScaleMM2 to keep its avantage
(Average of 35% more speed than FastMM5 - again, no tunning) 
Note that, this result is the same with FastMM4)

 

@Pierre le Riche If you are interested to get some test code for this "graphics" test, I'll be happy to cooperate.

 

Here it is, in all case *thank you a lot* Pierre, for your *amazing work* (and @Günther Schoch's company for sponsoring !): FastMM5 is cool and work well as is, no more really need to tune compile side, (this is cool), and the overall compatibility seems to be nice.

 

regards,

Vincent

 

Edited by Vincent Gsell

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×