Experience/opinions on FastMM5

David Heffernan · May 7, 2020

1 hour ago, Leif Uneus said:

I still don't know why Embarcadero does not implement the FastCode purepascal Pos for win64.

https://quality.embarcadero.com/browse/RSP-13687

In the example given, the fastcode win64 version is 8 times faster than System.Pos.

8 times faster sounds amazing, if the only thing you ever do is call Pos on data that is already in the cache. I'll bet that for a lot of real world applications you wouldn't see any benefit.

pyscripter · May 7, 2020

2 hours ago, RDP1974 said:

"They" should move if want to jump to the bandwagon of parallel computing (IMHO? Within 5 years will be the facto with dozens or hundred cpu cores as standard)-> hard to beat Elixir, Erlang, Go or those functional programming that offers built-in horizontal and vertical scalability (userland scheduler with lightweight fibers, kernel threads, multiprocessing over cpu hw cores, machine clustering... without modify a line of code)

🙂

Funnily enough some of the most popular languages today, Python, JavaScript R and Ruby are single-threaded and you have to go out-of-your-way to use more than one cores.

Edited May 7, 2020 by pyscripter

May 7, 2020

4 hours ago, Leif Uneus said:

I still don't know why Embarcadero does not implement the FastCode purepascal Pos for win64.

PurePascal or assembly what so ever, the point is they are the one who is responsible for the RTL, and it should be the fastest, specially when it can be done in hours, and once for lifetime, when the last time Embarcadero team ( sorry the reasearch and development team) make a right call ?

What was the highlight of 2019 ? Android 64 !

Is Android 64 an unaticipated move from Google, or expected since at least 6 years?

3 hours ago, RDP1974 said:

"They" should move if want to jump to the bandwagon of parallel computing (IMHO? Within 5 years will be the facto with dozens or hundred cpu cores as standard)-> hard to beat Elixir, Erlang, Go or those functional programming that offers built-in horizontal and vertical scalability (userland scheduler with lightweight fibers, kernel threads, multiprocessing over cpu hw cores, machine clustering... without modify a line of code)

it is already now not 5 years, it has been for years, have you read about the most fascinating computer language Haskell ? if you are interested start here https://stackoverflow.com/questions/35027952/why-is-haskell-ghc-so-darn-fast

May 7, 2020

53 minutes ago, pyscripter said:

Funnily enough some of the most popular languages today, Python, JavaScript R and Ruby are single-threaded and you have to go out-of-your-way to use more than one cores.

Nothing funny here, those languages have real team building best compilers and working on there RTL and backend, what was Embarcadero doing ? pouring resources to speed up Livebindings !!

For real someone some send them a memo or an email to explain enhancing the compiler probably will speed up LiveBindings, same as the RTL.

And for the sake of bits and bytes, have they heard about this new technology, the one was invented not long ago by two companies Intel and AMD, this technology called SSE,SSE2, SSE3 SSE4 ..., it is only 14 years, since those instructions are out there, it might be risky enough for Embarcadero to use them as those companies tend to start a technology then drop it, though there is more than 4 generations of CPU out there are using them, so this might put it on To-Do list on the next 10 years.

Lets hope Embarcadero add a new sunset skin to the IDE and finishing this amazing technology called LiveBindings, those will make Delphi great again.

RDP1974 · May 7, 2020

Quote

read about the most fascinating computer language Haskell ?

I'm studying ad implementing Elixir/PhoenixWeb/Erlang over FreeBSD/Linux. Simply it's incredible! From http MVC with routes/controller/ORM to websocket channels, linear scalability until millions of sockets x single server with yusecs latency and fault tolerance (it's a VM with userlevel scheduler and signaling)...within a bunch of lines

(a bench shows 100,000 reqs/sec from a MVC/postgre ORM json render in a single server; further you can change the code inside the VM meantime is running, so you can update pieces of the running app without close it)

https://www.phoenixframework.org/

https://elixir-lang.org/

Edited May 7, 2020 by RDP1974

Anders Melander · May 7, 2020

1 hour ago, Kas Ob. said:

Nothing funny here, [yada, yada, yada]

Give it a rest, will you?

Edwin Yip · May 8, 2020

And right after this discussion, Mr. Arnaud Bouchez, author of mORMot, just released a new memory manager for FPC (both Windows and Linux) based on FastMM4!

http://blog.synopse.info/post/2020/05/07/New-Multi-thread-Friendly-Memory-Manager-for-FPC-written-in-x86_64-assembly

I kinda feel that the new release of FastMM5 and the consequent discussions stimulated him to take the challenge ;) Is it so, @Arnaud Bouchez ?

Daniel · May 8, 2020

Folks - please stay on topic. Discussions about faster RTL functions or Haskell do not belong here. We are taking about a special memory manager here, so a general rant about EMBT also does not belong here.

Arnaud Bouchez · May 8, 2020

12 hours ago, Edwin Yip said:

I kinda feel that the new release of FastMM5 and the consequent discussions stimulated him to take the challenge ;) Is it so, @Arnaud Bouchez ?

You are right: FastMM5 challenged me... and since no one responded to my offer about helping it run on FPC/Linux, and also since I wanted something Open Source but not so restrictive, I created https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas which is GPL/LGPL and MPL. So you can use it with closed software.

It uses the same core algorithms than FastMM4. I like it so much, and missed it so much in FPC... 🙂
I was involved in ScaleMM2, and a per-thread arena for small blocks didn't convince me: it tends to consume too much RAM when you have a lot of threads in your process. Note that a threadvar is what the FPC standard MM uses.
I wanted to take the best of FastMM4 (which is very proven, stable and efficient), but drive it a little further in terms of multi-threading and code quality.
FastMM4 asm is 32-bit oriented, its x86_64 version was sometimes not very optimized for this target - just see its abuse of globals, not knowledge of micro-op fusion or CPU cache lines and locks, and sparse use of registers.
Also focusing on a single compiler and a single CPU, with not all the features of FastMM4 in pascal mode, helped fpcx64mm appear in two days only.
Last but not least, I spent a lot of time this last year in x86_64 assembly, so I know which patterns are expected to be faster.

The huge regression test suite of mORMot helps having a proven benchmark - much more aggressive and realistic than microbenchmarks (like string concatenation in threads, or even the FastCode benchmark) on which most other MM relies for measurement.
When the regression tests are more than twice faster than with the FPC standard MM on Linux - as @ttomas reported - then we are talking. It runs a lot of different scenarios, with more than 43,000,000 individual tests, and several kind of HTTP/TCP servers on the loopback, running in-memory or SQLite databases, processing JSON everywhere, with multiple client threads stressing it. When I run the test on my Linux machine, I have only a few (less than a dozen) system Linux nanosleeps (better than Windows sleep) , and less than 2 ms waiting during a 1 minute of heavy tests - and only for Freemem.
I really don't like the microbenchmarks used for testing MM. Like the one published in this forum. For instance IntelTBB is very fast for such benchmarks, but it doesn't release its memory as it should, and it is unusable in practice.

I guess that some user code, not written with performance in mind, and e.g. abusing of str := str+'something' patterns would also be more than twice faster.
And if your code has to reallocate huge buffers (>256KB) in a loop, using mremap on Linux may make a huge performance boost since no data would be copied at all - Linux mremap() is much better than what Windows or BSD offer! Yes, huge memory blocks are resized by the Linux Kernel by reaffecting its TLB redirection tables, without copying any memory. No need to use AVX512 if you don't copy anything! And plain SSE2 (with non-volatile mov for big buffers) is good enough to saturate the HW memory bandwidth - and faster than ERMS in practice.

IMHO there was no need to change the data structures like FastMM5 did - I just tuned/fixed most of its predecessor FastMM4 asm, reserved some additional slots for the smaller blocks (<=80 bytes are now triplets), implemented a safe and efficient spinning, implement some internal instrumentation to catch multi-threading bottlenecks, and then Getmem didn't suffer from contention any more!

I knew than FastMM4 plus some tweaks could be faster than anything else - perhaps even FastMM5.

Edited May 8, 2020 by Arnaud Bouchez

RDP1974 · May 8, 2020

Your talent is fantastic and so your code, but let me tell you a word about "TBB unusable" that it's the default optimize option on the whole Visual Studio C compiler and in main game engines... TBB and IPP also are used in Oracle Database, Adobe, Autodesk...

Edited May 9, 2020 by RDP1974

Edwin Yip · May 8, 2020

@Arnaud Bouchez ,you are so damn fast man! A memory manager in less than 3 days? Unbelievable!

A new MM (fastMM5) for Delphi and a new MM for FPC. Wow! The Pascal community is getting better and better :)

May 8, 2020

@Arnaud BouchezThat is nice.

I just surfed the assembly and if i may suggest 2 things i saw:

1) part between cmp and conditional instrucntion ( jumps, and CMOVcc)

  lea rdi, [r10 + TMediumBlockInfo.Bins + rsi * 2]
  {Get the free block in rsi}
  mov rsi, TMediumFreeBlock[rdi].NextFreeBlock
  {Remove the first block from the linked list (LIFO)}
  mov rdx, TMediumFreeBlock[rsi].NextFreeBlock
  mov TMediumFreeBlock[rdi].NextFreeBlock, rdx
  mov TMediumFreeBlock[rdx].PreviousFreeBlock, rdi
  {Is this bin now empty?}
  cmp rdi, rdx
  jne @MediumBinNotEmpty

it can be like this

  lea rdi, [r10 + TMediumBlockInfo.Bins + rsi * 2]
  {Get the free block in rsi}
  cmp rdi, rdx	//
  mov rsi, TMediumFreeBlock[rdi].NextFreeBlock
  {Remove the first block from the linked list (LIFO)}
  mov rdx, TMediumFreeBlock[rsi].NextFreeBlock
  mov TMediumFreeBlock[rdi].NextFreeBlock, rdx
  mov TMediumFreeBlock[rdx].PreviousFreeBlock, rdi
  {Is this bin now empty?}
  //cmp rdi, rdx
  jne @MediumBinNotEmpty

2) Use CMOVcc to get rid of a jump

@NoSuitableMediumBlocks:
  {Check the sequential feed medium block pool for space}
  movzx ecx, [rbx].TSmallBlockType.MinimumBlockPoolSize
  mov edi, [r10 + TMediumBlockInfo.SequentialFeedBytesLeft]
  cmp edi, ecx
  jb @AllocateNewSequentialFeed
  {Get the address of the last block that was fed}
  mov rsi, [r10 + TMediumBlockInfo.LastSequentiallyFed]
  {Enough sequential feed space: Will the remainder be usable?}
  movzx ecx, [rbx].TSmallBlockType.OptimalBlockPoolSize
  lea rdx, [rcx + MinimumMediumBlockSize]
  cmp edi, edx
  jb @NotMuchSpace
  mov edi, ecx
@NotMuchSpace:

it could be like this

@NoSuitableMediumBlocks:
  {Check the sequential feed medium block pool for space}
  movzx ecx, [rbx].TSmallBlockType.MinimumBlockPoolSize
  mov edi, [r10 + TMediumBlockInfo.SequentialFeedBytesLeft]
  cmp edi, ecx
  jb @AllocateNewSequentialFeed
  cmp edi, edx  //
  {Get the address of the last block that was fed}
  mov rsi, [r10 + TMediumBlockInfo.LastSequentiallyFed]
  {Enough sequential feed space: Will the remainder be usable?}
  movzx ecx, [rbx].TSmallBlockType.OptimalBlockPoolSize
  lea rdx, [rcx + MinimumMediumBlockSize]
  //cmp edi, edx
  cmovb edi,ecx
  //jb @NotMuchSpace
  //mov edi, ecx
@NotMuchSpace:

Such nano optimazation has small impact but somehow still good, they might give the out-of-order execution a slight speed boost.

Last i really wish that you change it to support Delphi on Windows, if the delay is a problem then let me bring this to your attention, the undecumented API NtDelayExecution https://undocumented.ntinternals.net/index.html?page=UserMode%2FUndocumented Functions%2FNT Objects%2FThread%2FNtDelayExecution.html

This funciton is really great, not saying you should use it, but for testing you can, i use it to hold bunch of threads and release them at absolute time at the same moment to induce contentions, and for the releative time the API is very close to 100ns delay.

Arnaud Bouchez · May 8, 2020

@Kas Ob.

1) this modified code is not the same as the initial, because rdx is modified in between.

And the current code is better since the CPU will make microfusion opcode of cmp + jmp

2) It is correct. I will use cmovb here.

Thanks!

3) I would never use an Windows undocumented function in production code.

There is almost no sleep() call in my tests thanks to good spining.

So it won't make any difference in practice.
And we focus on Linux, not Windows, for our servers - in which nanosleep is there.

Speaking of 100ns resolution is IMHO unrealistic: I suspect there is a context switch otherwise bigger spinning or calling ThreadSwitch may be just good enough.

Edited May 8, 2020 by Arnaud Bouchez

May 8, 2020

1 hour ago, Arnaud Bouchez said:

1) this modified code is not the same as the initial, because rdx is modified in between.

And the current code is better since the CPU will make microfusion opcode of cmp + jmp

Yes that is a mistake, but microfusion is overrated in some cases, i would prefere test them, the behaviour can vary between generation of CPU's, and here is the twist they will have better chance to be ~~microfusioned~~ microfused if the jump address is aligned, which is not guaranteed in this case.

Edited May 8, 2020 by Guest

Arnaud Bouchez · May 9, 2020

I don't think alignement is involved to trigger or not microfusion.
Alignement is a just way to ensure that the CPU instruction decoder is able to fetch as much opcodes as possible: since the CPU is likely to fetch 16 bytes of opcodes at a time, aligning a jump to 16 bytes may reduce the number of fetchs. It is mostly needed for a loop, and could (much more marginaly) be beneficial for regular jumps.

My reference/bible is https://www.agner.org/optimize/optimizing_assembly.pdf in that matter:

Quote

Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an important subroutine entry or jump label happens to be near the end of a 16-byte block then the microprocessor will only get a few useful bytes of code when fetching that block of code. It may have to fetch the next 16 bytes too before it can decode the first instructions after the label. This can be avoided by aligning important subroutine entries and loop entries by 16. Aligning by 8 will assure that at least 8 bytes of code can be loaded with the first instruction fetch, which may be sufficient if the instructions are small. We may align subroutine entries by the cache line size (typically 64 bytes) if the subroutine is part of a critical hot spot and the preceding code is unlikely to be executed in the same context. A disadvantage of code alignment is that some cache space is lost to empty spaces before the aligned code entries. In most cases, the effect of code alignment is minimal. My recommendation is to align code only in the most critical cases like critical subroutines and critical innermost loops.

But the only true reference is the clock: as you wrote we need to test/measure, not guess.

Edited May 9, 2020 by Arnaud Bouchez

Feri · May 12, 2020

Hi All!

I would like to ask if there is an option like in fastmm4 : {$define UseOutputDebugString}

(Set this option to use the Windows API OutputDebugString procedure to output debug strings on startup/shutdown and when errors occur.)

It was usefull, because it was not in FullDebugMode, but I had some information about memory leaks when program finished.

best regards

feri

Pierre le Riche · May 12, 2020

4 minutes ago, Feri said:

I would like to ask if there is an option like in fastmm4 : {$define UseOutputDebugString}

(Set this option to use the Windows API OutputDebugString procedure to output debug strings on startup/shutdown and when errors occur.)

Hi Feri,

There is the global variable FastMM_OutputDebugStringEvents, which is a set of events for which OutputDebugString will be called. By default only critical events are included, but you can adjust it to fit your needs.

Best regards,

Pierre

Feri · May 12, 2020

1 hour ago, Pierre le Riche said:

Hi Feri,

There is the global variable FastMM_OutputDebugStringEvents, which is a set of events for which OutputDebugString will be called. By default only critical events are included, but you can adjust it to fit your needs.

Best regards,

Pierre

Thank you very much !

best regards feri

Jacek Laskowski · May 13, 2020

@Pierre le Riche

How deep is the call stack in FastMM5 reports? In FastMM4 it was hardcoded to 11, I reported an issue to add a depth setting option, but it was not added.

Primož Gabrijelčič · May 13, 2020

Call stack depth is configurable in FastMM4.

{------------- FullDebugMode/LogLockContention constants---------------}
const
  {The stack trace depth. (Must be an *uneven* number to ensure that the
   Align16Bytes option works in FullDebugMode.)}
  StackTraceDepth = 19;

Pierre le Riche · May 13, 2020

12 minutes ago, Jacek Laskowski said:

How deep is the call stack in FastMM5 reports?

The default values are 19 entries under 32-bit, and 20 entries under 64-bit. (The odd numbers are to ensure that the structure is a multiple of 64 bytes.)

The values are adjustable, but not runtime.

Jacek Laskowski · May 13, 2020

@Primož Gabrijelčič

I know that, but changing sources to imho is not a configuration 🙂
It should be in the file FastMM4Options.inc

Edited May 13, 2020 by Jacek Laskowski

Jacek Laskowski · May 13, 2020

3 minutes ago, Pierre le Riche said:

The default values are 19 entries under 32-bit, and 20 entries under 64-bit. (The odd numbers are to ensure that the structure is a multiple of 64 bytes.)

The values are adjustable, but not runtime.

Adjustable with source change?

Pierre le Riche · May 13, 2020

10 minutes ago, Jacek Laskowski said:

Adjustable with source change?

Yes, the CFastMM_StackTraceEntryCount constant. If there is a big demand for it I could make it adjustable, but v5 is already approaching double the number of entries of v4 so I reckon it should be sufficient.

Jacek Laskowski · May 13, 2020

8 minutes ago, Pierre le Riche said:

Yes, the CFastMM_StackTraceEntryCount constant. If there is a big demand for it I could make it adjustable, but v5 is already approaching double the number of entries of v4 so I reckon it should be sufficient.

I've got a callstack in FastMM4 set to 25, because that's the value that didn't cut off my log.

Sign In

Experience/opinions on FastMM5

Recommended Posts

David Heffernan 2479

Share this post

Link to post

pyscripter 837

Share this post

Link to post

Guest

Share this post

Link to post

Guest

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Anders Melander 2108

Share this post

Link to post

Edwin Yip 154

Share this post

Link to post

Daniel 423

Share this post

Link to post

Arnaud Bouchez 413

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Edwin Yip 154

Share this post

Link to post

Guest

Share this post

Link to post

Arnaud Bouchez 413

Share this post

Link to post

Guest

Share this post

Link to post

Arnaud Bouchez 413

Share this post

Link to post

Feri 0

Share this post

Link to post

Pierre le Riche 23

Share this post

Link to post

Feri 0

Share this post

Link to post

Jacek Laskowski 57

Share this post

Link to post

Primož Gabrijelčič 227

Share this post

Link to post

Pierre le Riche 23

Share this post

Link to post

Jacek Laskowski 57

Share this post

Link to post

Jacek Laskowski 57

Share this post

Link to post

Pierre le Riche 23

Share this post

Link to post

Jacek Laskowski 57

Share this post

Link to post

Create an account or sign in to comment

Create an account