FastMM5 now released by Pierre le Riche (small background story)

Ralf7 · May 4, 2020

I don't understand the license yet. Do I need a FastMM5 license ($ 99) for every program I sell?

Anders Melander · May 4, 2020

1 hour ago, Ralf7 said:

Do I need a FastMM5 license ($ 99) for every program I sell?

I think it's pretty clear: You need a license per developer, not per end user:

Quote

Licence

FastMM 5 is dual-licensed. You may choose to use it under the restrictions of the GPL v3 licence at no cost to you, or you may purchase a commercial licence. A commercial licence includes all future updates. The commercial licence pricing is as follows:

Number Of Developers Price (USD)
1 developer $99
2 developers $189
3 developers $269
4 developers $339
5 developers $399
More than 5 developers $399 + $50 per developer from the 6th onwards

Ian Branch · May 4, 2020

Hi Team,

Please pardon my ignorance here, I have never done anything with Threading and I have no idea if any of the 3rd Party stuff I use does. I suspect not but it is possible I suppose.

So, to my Question..

Given my non-threaded environment, aside from the Debugging aspects, I use EurekaLog, does FastMM5 provide any benefit over the native Delphi (10.3.3) memory manager?

Regards & TIA,

Ian

May 4, 2020

1 hour ago, Ian Branch said:

does FastMM5 provide any benefit over the native Delphi (10.3.3) memory manager?

Yes, it does.

Quote

In the Fastcode memory manager benchmark tool FastMM 5 scores 15% higher than FastMM 4.992 on the single threaded benchmarks, and 30% higher on the multithreaded benchmarks. (I7-8700K CPU, EnableMMX and AssumeMultithreaded options enabled.)

And FastMM 4.992 is better than the stock (native) MM.

Ian Branch · May 5, 2020

Thaks Kas. I'll stick with it then.

Ian

David Heffernan · May 5, 2020

9 hours ago, Ian Branch said:

Hi Team,

Please pardon my ignorance here, I have never done anything with Threading and I have no idea if any of the 3rd Party stuff I use does. I suspect not but it is possible I suppose.

So, to my Question..

Given my non-threaded environment, aside from the Debugging aspects, I use EurekaLog, does FastMM5 provide any benefit over the native Delphi (10.3.3) memory manager?

Regards & TIA,

Ian

Impossible to know without benchmarking. Depends on what your application spends its time doing.

Edited May 5, 2020 by David Heffernan

Arnaud Bouchez · May 6, 2020

If I understand correctly, FastMM5 handles several arenas instead of one for FastMM4, and tries all of them until one is not currently locked, so thread contention is less subject to happen.

One area where FastMM5 may have some improvement is his naive use of "rep movsb" which should rather use a non volative SSE2/AVX move for big blocks.
Check https://stackoverflow.com/a/43574756/458259 numbers for instance.

ScaleMM2 and FPC heap both use a threadvar arena for small blocks, so doesn't bother to check for any lock. It is truly lock-free.
But each thread maintains its own small blocks arena, so it consumes more memory.

Other Memory Managers like Intel TBB or JeMalloc have also a similar per-thread approach, but consumes much more memory.
For instance, IBB is a huge winner in performance, but it consumes up to 60 (sixty!) times more memory! So it is not usable in practice for serious server work - it may help for heavily multithread apps, but not for common services.
I tries those other MM with mORMot and real heavily multi-threaded service. Please check the comments at the beginning of https://synopse.info/fossil/artifact/f85c957ff5016106
One big problem with the C memory managers is that they tend to ABORT the process (not SIGAV but SIGABRT) if there is a dandling pointer - which happens sometimes, and is very difficult to track. This paranoia makes them impossible to use on production: you don't want your service to shutdown with no prior notice just because the MM complains about a pointer!

Our only problem with FPC heap is that with long term servers, it tends to fragment the memory and consumes some GB of RAM, whereas a FastMM-like memory manager would have consumed much less.
FPC heap memory consumption doesn't leak: it stabilizes after a few days, but it is still higher than FastMM.
The problem with FastMM (both 4 and 5) is that they don't work on Linux x86_64 with FPC.

This is why I proposed to help Eric about FPC support.

Pierre le Riche · May 6, 2020

3 hours ago, Arnaud Bouchez said:

One area where FastMM5 may have some improvement is his naive use of "rep movsb" which should rather use a non volative SSE2/AVX move for big blocks.

Rep movs is only used for large blocks, where it is the fastest mechanism on modern processors. For smaller blocks SSE2 is used, if available.

As an interesting aside, apparently the Linux kernel uses rep movs for all moves. The rationale behind that is to force the CPU manufacturers to optimize rep movs even for small moves. Historically rep movs has been hit and miss from one CPU generation to the next.

3 hours ago, Arnaud Bouchez said:

The problem with FastMM (both 4 and 5) is that they don't work on Linux x86_64 with FPC.

I would like to add more platform support, but there has to be sufficient demand to offset the development and maintenance burden. I don't want v5 to become an "ifdef hell" like, to a large extent, v4. v4 supports so many different configurations that it is littered with ifdefs. It makes it really hard to expand functionality because there are so many different combinations that need to be tested.

That said, it is on my "to-do" list to add support for most or all of the platforms supported by the latest Delphi versions. I think it would be particularly useful to have the debugging features available on other platforms as well.

May 6, 2020

If i may add my 2 cents:

1)

4 hours ago, Arnaud Bouchez said:

One area where FastMM5 may have some improvement is his naive use of "rep movsb" which should rather use a non volative SSE2/AVX move for big blocks.

36 minutes ago, Pierre le Riche said:

Rep movs is only used for large blocks, where it is the fastest mechanism on modern processors. For smaller blocks SSE2 is used, if available.

Don't go after the assembly now, it will be waste of your time, for that matter here a good reading article http://codearcana.com/posts/2013/05/18/achieving-maximum-memory-bandwidth.html

2) The slowness in my experience in this code is coming from two things, branching and cache miss:

a) too many jumps and loops, this a one function i replaced with the following

procedure SetMediumBlockHeader_SetSizeAndFlags(APMediumBlock: Pointer; ABlockSize: Integer; ABlockIsFree: Boolean;
  ABlockHasDebugInfo: Boolean); //inline;
var
  LPNextBlock: Pointer;
begin
  PMediumBlockHeader(APMediumBlock)[-1].BlockStatusFlags := CHasDebugInfoFlag * Ord(ABlockHasDebugInfo) + CIsMediumBlockFlag + CBlockIsFreeFlag * Ord(ABlockIsFree);
  LPNextBlock := @PByte(APMediumBlock)[ABlockSize];

  {If the block is free then the size must also be stored just before the header of the next block.}
  //if ABlockIsFree then        // Safe ?! may be !, but setting the size is faster than compare and jump
    PMediumFreeBlockFooter(LPNextBlock)[-1].MediumFreeBlockSize := ABlockSize;

  {Update the flag in the next block header to indicate that this block is free.}
  {Update the flag in the next block to indicate that this block is in use.  The block size is not stored before
    the header of the next block if it is not free.}
  PMediumBlockHeader(LPNextBlock)[-1].PreviousBlockIsFree := ABlockIsFree;

  {Store the block size.}
  PMediumBlockHeader(APMediumBlock)[-1].MediumBlockSizeMultiple := ABlockSize shr CMediumBlockAlignmentBits;
end;

the result is around 0.9% speed from this simple change, it should be 4 inlined functions instead of one.

b) the core cause of this time waste is CPU cache miss, please read the following to get better understanding, ( better English than mine ) https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/559843

and i genuinely think the speed can doubled easily if the access build differently or new access method introduced, that was the cause in my opinion and here a suggestion as a or remedy:

I like to combine single linked list ( or even double linked list ) with a shadow list for walking the shadow list doesn't need to be accurate 100% and will be used in very specific method to decrease access time to linked list at least i can skip walking the single list in some operation.

PendingFreeList in first attempt in FastMM_GetMem_GetMediumBlock is the most valuable to performance and yet it is the one that taking much in Gunther test, so can do a list ( simple array ) introduced to be accessed without locking and let it doesn't not be 100% accurate or updated !, you might want to test them in parallel first, this might save very valuable time and of course can be walked many folds faster than the linked list, after that it up to you to find the best method to lock or not.

Food for thought:

Is there a benefit from introducing two wait-free LIFO list ?, one for inuse and the other for freeing/pending, the and items will be swapped between those lists ( simple array's), just swapping.

For medium blocks: the size is determined and can have two list as above per size, this is a waste of space, is the returned speed worth it ? eg. block of 2880 byte, 1mb( 1024*1024) / 2880 = 365, 366*4(or8) = 1464(2928) byte to hold such a list for block size of 2880 per one mega byte.

57 minutes ago, Pierre le Riche said:

I don't want v5 to become an "ifdef hell" like

FastMM5 is one of the most beautiful piece of code i ever saw ! neat, tucked and clear.

As for arenas, i suggest ( To-Do list) to modify the arenas to be dynamic with auto increase, but this task must come last, if it ever to be realized, the idea is start any app with one arena and it will increase those arenas at runtime based on contention count, reaching (n) the number of CPU cores or 1.5-2 of that n.

Sorry for this long post.

Stefan Glienke · May 6, 2020

2 hours ago, Kas Ob. said:

The slowness in my experience in this code is coming from two things, branching and cache miss

I don't know about the particular piece of code but I absolutely second that - if you can avoid branching, please please do so - unfortunately neat and clean code (an innocent looking if or a loop) can easily turn into a performance drain.

Pierre le Riche · May 6, 2020

3 hours ago, Kas Ob. said:

Don't go after the assembly now, it will be waste of your time, for that matter here a good reading article http://codearcana.com/posts/2013/05/18/achieving-maximum-memory-bandwidth.html

The assembly language makes a noticeable difference in performance under x86 due to the limited number of registers and the fact that the 32-bit compiler shuffles variables between the stack and registers more often than is necessary. Under 64-bit there are more registers, and the 64-bit compiler also optimizes a bit better, so assembly language generally provides less of a benefit under 64-bit, but given that it is easy to translate 32-bit assembly to 64-bit I went ahead and translated many of the 32-bit assembly language routines anyway. In places where I need to use instructions for which the language currently has no intrinsic (e.g. bit scan forward "bsf") I don't have much choice in the matter.

3 hours ago, Kas Ob. said:

2) The slowness in my experience in this code is coming from two things, branching and cache miss:

a) too many jumps and loops, this a one function i replaced with the following

It is critical that the method you used as an example be inlined. In most places it is called from the arguments are all constants, so the compiler will short-circuit all the branches in the inlining process and it will produce close to optimal code (bar some shuffling of values between registers which I could not coax it out of doing). If you take out the "inline" then it becomes very inefficient.

I've spent a lot of time looking at the compiled code disassembly, reorganising the Pascal code, and making use of inlining to nudge the compiler towards producing better output. There's no built-in assembler support for some of the platforms, so with the goal of eventually supporting more platforms than Windows it is worth the time spent.

3 hours ago, Kas Ob. said:

As for arenas, i suggest ( To-Do list) to modify the arenas to be dynamic with auto increase, but this task must come last, if it ever to be realized, the idea is start any app with one arena and it will increase those arenas at runtime based on contention count, reaching (n) the number of CPU cores or 1.5-2 of that n.

That is also on the to-do list. The aim is to make everything configurable runtime, and the number of arenas is perhaps the most important option that currently cannot be tuned runtime. We have discussed making the number of arenas grow dynamically as needed, however each arena comes with memory overhead, so at some point you would also want the number of arenas to shrink again when the application no longer needs them. Another option that has been discussed internally is to allow having a single dedicated arena per thread, useful in cases where memory usage is not important at all and scaling is everything.

Options such as these will probably be worked into the "MemoryManagerOptimizationStrategy" enumeration. At the moment there isn't much difference between the various options, but that will change as features get tweaked and new ones are added.

Thank you for the feedback and suggestions.

Stefan Glienke · May 7, 2020

17 hours ago, Pierre le Riche said:

I've spent a lot of time looking at the compiled code disassembly, reorganising the Pascal code, and making use of inlining to nudge the compiler towards producing better output. There's no built-in assembler support for some of the platforms, so with the goal of eventually supporting more platforms than Windows it is worth the time spent.

FWIW I have done similar but in many cases just looking at the disassembly does not tell the entire story - running the code and profiling it (which can be a very tedious process given the many different uses cases etc) will often tell a different story and modern hardware architecture just shows you the middle finger.

Especially for routines that get inlined in many different places although they might produce a little bit better disassembly it might not be worse or even better in some cases to not inline them if they can be written jump free because the code will be smaller and likeliness of staying in the instruction cache will be higher.

But again we are talking about microoptimization here that requires a lot of profiling and looking at some cpu metrics.

Pierre le Riche · May 7, 2020

2 minutes ago, Stefan Glienke said:

FWIW I have done similar but in many cases just looking at the disassembly does not tell the entire story - running the code and profiling it (which can be a very tedious process given the many different uses cases etc) will often tell a different story and modern hardware architecture just shows you the middle finger.

It's all about keeping dependency chains short and avoiding unpredictable jumps. Modern CPUs are pretty good at dealing with any cruft in-between.

Sometimes the profiler will tell you something you've missed, but if you spend enough time tuning code you develop a feel for what works and what doesn't.

An example of an unavoidable dependency chain hurting performance is in FreeMem: On average more than a quarter of the CPU time in a FreeMem call goes to the very first line - reading the block header. FastMM needs to know what kind of block it is (if it even is a valid block) before it can do anything. Due to alignment requirements the header is often in a prior cache line, and if the app has been busy manipulating data then it is most likely no longer cached. Going all the way to main memory is expensive and it holds up everything else.

May 7, 2020

29 minutes ago, Pierre le Riche said:

An example of an unavoidable dependency chain hurting performance is in FreeMem: On average more than a quarter of the CPU time in a FreeMem call goes to the very first line - reading the block header. FastMM needs to know what kind of block it is (if it even is a valid block) before it can do anything. Due to alignment requirements the header is often in a prior cache line, and if the app has been busy manipulating data then it is most likely no longer cached. Going all the way to main memory is expensive and it holds up everything else.

And for that exactly i suggest a shadow list, in this case and due the alignment you can find the beginning of such list from the address (APointer) itself then check if it belongs to your arena ( if such check is needed of course) then use the relative address to the arena/VirtualAllocblock address , to get an index supplementary to such APointer will be used on fixed list with the block information, this can be just few instructions ( i am sure you can make them 3 or less instructions), then you ditched this cache miss, i am sure too you know the impact on the memory usage, this i think this will be cheaper than a cache miss.

If i ever need to design a memory manager, then i will use relative address's in building static and fixed tables, and even try to use binary trees to access the assigned size, each block and every address does know its extra information in few instruction and two very small and short lookup operation , if a lookup is needed.

Allen@Grijjy · May 9, 2020

I am very happy that FastMM5 is now available and also with the new licensing scheme. We plan to use it in our heavily threaded 64-bit http/socket servers. While I wouldn't have used it at all if it was GPL only, having a commercial offering at a reasonable price is actually more preferable because I like the idea of a properly maintained memory manager for our commercial products. There are plenty of free memory managers around for Delphi, that are barely maintained.

Sign In

FastMM5 now released by Pierre le Riche (small background story)

Recommended Posts

Ralf7 4

Share this post

Link to post

Anders Melander 1631

Licence

Share this post

Link to post

Ian Branch 121

Share this post

Link to post

Guest

Share this post

Link to post

Ian Branch 121

Share this post

Link to post

David Heffernan 2288

Share this post

Link to post

Arnaud Bouchez 380

Share this post

Link to post

Pierre le Riche 21

Share this post

Link to post

Guest

Share this post

Link to post

Stefan Glienke 1899

Share this post

Link to post

Pierre le Riche 21

Share this post

Link to post

Stefan Glienke 1899

Share this post

Link to post

Pierre le Riche 21

Share this post

Link to post

Guest

Share this post

Link to post

Allen@Grijjy 44

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity