Jump to content
Leif Uneus

Experience/opinions on FastMM5

Recommended Posts

I tried it in my project, single threaded project, high memory consumption. Tried all optimization options, no improvements over FastMM4.

FastMM was a huge improvement over Delphi's memory manager in D2006, for my project, so I'm happy @Pierre le Riche made it available!

 

 

Share this post


Link to post
3 minutes ago, Mike Torrettinni said:

I tried it in my project, single threaded project, high memory consumption. Tried all optimization options, no improvements over FastMM4.

Exactly what we expect. For single thread applications FastMM5 will have no big impact. FastMM4 was already highly optimized. It really starts to show the big differences with heavy multi-threading apps  on machines with many CPUs as the memory manager does much less block or serializes the worker threads. E.g. using TParallel.For should already show that difference. But we will soon provide some samples to demonstrate the mentioned difference.

  • Like 2

Share this post


Link to post
Posted (edited)
1 hour ago, Günther Schoch said:

Exactly what we expect. For single thread applications FastMM5 will have no big impact. FastMM4 was already highly optimized. It really starts to show the big differences with heavy multi-threading apps  on machines with many CPUs as the memory manager does much less block or serializes the worker threads. E.g. using TParallel.For should already show that difference. But we will soon provide some samples to demonstrate the mentioned difference.

I would say I'm Delphi rookie, so this "heavy multi-threading apps  on machines with many CPUs as the memory manager does much less block or serializes the worker threads. E.g. using TParallel"is like sci-fi to me, but if the demos are going to be available for single cpu machines, and no need for in-depth analysis of results to understand them, I would be happy to run a few and be impressed.

Edited by Mike Torrettinni
level upgrade, novice -> rookie :)

Share this post


Link to post
2 hours ago, Kas Ob. said:

is this fair for the one who lend you that hammer ?

That's not really the point. The developers of any code own it. They get to choose how they licence it. They have many options. Basic decency says we respect their choices.

 

I'm confident that every person who is critical of somebody else's choice of licence would flat out reject it if anybody told them how to licence their own software.

  • Like 5

Share this post


Link to post
13 hours ago, Darian Miller said:

If you distribute applications that includes some GPL code, then all the code to your application must be made publicly available.

That's a common misconception: You will have to make the source code available to everybody you give the binary. So, of it's a commercial application, that means you must give your customers the source code of your application as well as the source code of every library you used. On top of that you cannot restrict how they use that source code, as long as they adhere to the GPL. But you don't need to make the source code publicly available.

Share this post


Link to post
Posted (edited)

I hope that you find beside the "for sure never ending discussion the the licensing" some time to have a look the first sample I added to compare FastMM5 with FastMM4.

The attached PDF will give you more information on the background and  motivation.

 

Edited by Günther Schoch
  • Like 2

Share this post


Link to post

Well - this is a really fascinating technical topic. You can, if you feel you have to, discuss the GPL-license in the non technical-area. But please let us stay technical here.

  • Like 2

Share this post


Link to post
16 minutes ago, dummzeuch said:

That's a common misconception: You will have to make the source code available to everybody you give the binary. So, of it's a commercial application, that means you must give your customers the source code of your application as well as the source code of every library you used. On top of that you cannot restrict how they use that source code, as long as they adhere to the GPL. But you don't need to make the source code publicly available.

"Everybody you give the binary"... to all your customers is indeed the intent.  And that is considered 'public' to many commercial software vendors.  If you only have 10 customers, then perhaps not.

 

But there are also no restrictions placed on those that receive your code.  They can put your code on GitHub and make it freely available to everyone in the world.  So if you use GPL code in your commercial app, your code can deemed to be 'publicly available' by nearly any standard.  

Share this post


Link to post
3 minutes ago, Darian Miller said:

But there are also no restrictions placed on those that receive your code. 

Wrong. They are restricted by the GPL.

Share this post


Link to post

Thanks for clarification.

But NOW, NOW please let us stay on the technical side.

:classic_cheerleader:

  • Like 3

Share this post


Link to post
1 minute ago, David Heffernan said:

Wrong. They are restricted by the GPL.

Technically, yes.  If they want to put your code on GitHub for free, they can.   That was the obvious overall intent of that paragraph.

Share this post


Link to post

My comment was about the licensing terms, not the product or developer's choice. 

 

I expect that people who see no problem with this kind of licensing scheme will never again complain about taxes or having their rights taken away by so-called "government overreach" since they're realliy the same as a GPL license you agree to by supporting your government officials. And y'all will defend the government's right to seize your stuff because, well, you agreed to the terms when you voted for the shysters in the first place.

 

 

Share this post


Link to post
3 hours ago, David Schwartz said:

My comment was about the licensing terms, not the product or developer's choice. 

Use a different memory manager then. You don't have rights to other people's work. Do I have rights to your work? What on earth are you smoking? 

  • Like 2

Share this post


Link to post

So, our multithreaded TCP/HTTP event driven services do a lot of string manipulation, copying to/from buffers, converting objects to/from json / xml, etc.

Can I expect FastMM5 to increase the throughput?

Share this post


Link to post
11 minutes ago, Lars Fosdal said:

Can I expect FastMM5 to increase the throughput?

compared with FastMM4 or the standard Delphi it should show a better result. Compared with other memory managers I don't want to restart the discussions.

Easiest: Replace the line FastMM4 with FastMM5 and run some load tests.

Share this post


Link to post
2 hours ago, Lars Fosdal said:

So, our multithreaded TCP/HTTP event driven services do a lot of string manipulation, copying to/from buffers, converting objects to/from json / xml, etc.

Can I expect FastMM5 to increase the throughput?

Another way to improve performance is to design the code to minimise heap allocations. The best way to optimise a block of code is not to bypass it.

Share this post


Link to post
1 hour ago, Günther Schoch said:

compared with FastMM4 or the standard Delphi it should show a better result. Compared with other memory managers I don't want to restart the discussions.

Easiest: Replace the line FastMM4 with FastMM5 and run some load tests.

Proper load tests are challenging since it they require data that make sense on many levels. I probably would have to try it in a live situation.

Share this post


Link to post
5 minutes ago, David Heffernan said:

Another way to improve performance is to design the code to minimise heap allocations. The best way to optimise a block of code is not to bypass it.

We do use worker threads that live across TCP/HTTP "sessions".  Since the data are so dynamic and variable, it is next to impossible to go fully static on allocations.

Share this post


Link to post
Posted (edited)

hello @Pierre le Riche 

 

thank you for this great piece of code (FastMM5), I have a suggestion to make it quicker,

in my TBB wrapper I have used to replace Fillchar (that's under Delphi64 is very slow) with a SIMD version (Intel IPP avx-512 etc...). 

Further, you are pre-allocating pieces of virtual mem. Perhaps you can do a quick hash or binary tree based cache with ready fillchar 0 blocks, maybe assigned to a background thread with minimal priority.

So when the MM calls the Alloc, the fillchar is not needed, because the block is already filled with zeroes.

IMHO in multithreaded stress test this will boost the performance!

I don't mind of virtual allocated ram being bigger, windows kernel utilize only the "really used" (hard to explain for me :-))

Further, as far I have read of those new allocators, they pre-allocate ram in TLS cache, dispatching a thread pool (of course with a big ram allocation(virtual, so what cares?), but to avoid race concurrency and global locking)

(please sorry me if those info are useless)

kind regards

Roberto

Edited by RDP1974

Share this post


Link to post
On 5/5/2020 at 2:48 PM, RDP1974 said:

in my TBB wrapper I have used to replace Fillchar (that's under Delphi64 is very slow) with a SIMD version (Intel IPP avx-512 etc...). 

This is the kind of thing that really should be in the RTL. It makes little sense for every library that needs a fast FillChar to have its own. Sure the code might be small, but the CPU micro-op cache is small too.

 

There is some justification for having custom Move routines in the memory manager, since there are some assumptions that it can make that a general purpose Move cannot, e.g. buffers are always non-overlapping, always aligned, always a multiple of a certain power of two, etc.

 

At the moment FastMM just calls FillChar in system.pas for zeroing blocks - except for large blocks obtained directly from the OS (those are guaranteed to be zero already). Apart from some assumptions about alignment there's not much room for optimizations that cannot be done in FillChar.

 

On 5/5/2020 at 2:48 PM, RDP1974 said:

of course with a big ram allocation(virtual, so what cares?)

It's still an issue under 32-bit, where you're limited to a 4GB address space.

 

While on the topic: Have you run benchmarks on a real-world application to see what difference a faster Move and/or FillChar makes to application throughput? The reason I ask is because in the real-world applications I have tested so far the total time spent in FillChar and Move is typically in the region of 5%, so if you could somehow double the speed (which I doubt you can, given that memory bandwidth is a bottleneck) the best improvement you would see is 2.5%. If there are applications out there that will benefit greatly from a faster FillChar then it is something I would want to pursue further.

Share this post


Link to post
Posted (edited)

Hi,

 

https://github.com/RDP1974/Delphi64

 

look, there I have patched "key" RTL functions with the SIMD enhanced from Intel libraries:

https://github.com/RDP1974/Delphi64/blob/master/RDPSimd64.pas (move, fillchar, pos)

 

So I did a TBB allocator wrapper, a SIMD rtl patch, and a Zlib Intel version for http deflate (5x faster than gzip).

Results are outstanding, tested by "famous" company coders:

A test with Indy, the built-in TCP Delphi library, on I7 cpu, shows an enhancement from 6934.29 ops/sec to 23097.68 ops/sec

Another test with WebBroker http compression, on I7 cpu, shows an enhancement from 147 pages/sec to 722 pages/sec

Another test with DMVC web api, on I9 cpu and windows 2016, simulating with apachebench 10000 requests and 100 users, shows an enhancement from 111 reqs/sec to 6448 reqs/sec

Another test, a ISAPI, on I9 cpu and windows 2016, doing in sequence DB query -> dataset of 1500 lines x 10 rows -> serialize to json string -> shrink it with deflate, is populating 2000 http reqs/sec, correctly filling all the cpu cores

 

As far I have read the code of TBB, seems that the speed is obtained using x thread TLS (threadvar), when an app thread ask for mem, the allocator provides an already prepared zone (act as a cache)(I'm not sure of this).

 

If you wish feel free to test my lib and see if behavior can be reproduced. As far I have seen should be enough to obtain a fast move, fillchar, pos (used in a lot of classes) and lock-free allocator (without branch jumps etc.) to have win64 speedup.

(Anyway I agree with you, we should do real case bench)

 

Thank you.

Edited by RDP1974

Share this post


Link to post
58 minutes ago, RDP1974 said:

As far I have seen should be enough to obtain a fast move, fillchar, pos (used in a lot of classes) and lock-free allocator (without branch jumps etc.) to have win64 speedup.

 

I still don't know why Embarcadero does not implement the FastCode purepascal Pos for win64. 

 

https://quality.embarcadero.com/browse/RSP-13687

 

In the example given, the fastcode win64 version is 8 times faster than System.Pos.

Share this post


Link to post

"They" should move if want to jump to the bandwagon of parallel computing (IMHO? Within 5 years will be the facto with dozens or hundred cpu cores as standard)-> hard to beat Elixir, Erlang, Go or those functional programming that offers built-in horizontal and vertical scalability (userland scheduler with lightweight fibers, kernel threads, multiprocessing over cpu hw cores, machine clustering... without modify a line of code)

🙂

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×