Leif Uneus 43 Posted April 30, 2020 (edited) Newly released FastMM5: https://github.com/pleriche/FastMM5 FastMM is a fast replacement memory manager for Embarcadero Delphi applications that scales well across multiple threads and CPU cores, is not prone to memory fragmentation, and supports shared memory without the use of external .DLL files. Version 5 is a complete rewrite of FastMM. It is designed from the ground up to simultaneously keep the strengths and address the shortcomings of version 4.992: Multithreaded scaling across multiple CPU cores is massively improved, without memory usage blowout. It can be configured to scale close to linearly for any number of CPU cores. In the Fastcode memory manager benchmark tool FastMM 5 scores 15% higher than FastMM 4.992 on the single threaded benchmarks, and 30% higher on the multithreaded benchmarks. (I7-8700K CPU, EnableMMX and AssumeMultithreaded options enabled.) It is fully configurable runtime. There is no need to change conditional defines and recompile to change options. (It is however backward compatible with many of the version 4 conditional defines.) Debug mode uses the same debug support library as version 4 (FastMM_FullDebugMode.dll) by default, but custom stack trace routines are also supported. Call FastMM_EnterDebugMode to switch to debug mode ("FullDebugMode") and call FastMM_ExitDebugMode to return to performance mode. Calls may be nested, in which case debug mode will be exited after the last FastMM_ExitDebugMode call. Supports 8, 16, 32 or 64 byte alignment of all blocks. Call FastMM_EnterMinimumAddressAlignment to request a minimum block alignment, and FastMM_ExitMinimumAddressAlignment to rescind a prior request. Calls may be nested, in which case the coarsest alignment request will be in effect. All event notifications (errors, memory leak messages, etc.) may be routed to the debugger (via OutputDebugString), a log file, the screen or any combination of the three. Messages are built using templates containing mail-merge tokens. Templates may be changed runtime to facilitate different layouts and/or translation into any language. Templates fully support Unicode, and the log file may be configured to be written in UTF-8 or UTF-16 format, with or without a BOM. It may be configured runtime to favour speed, memory usage efficiency or a blend of the two via the FastMM_SetOptimizationStrategy call. Experience/opinions welcome ... Edited May 4, 2020 by MrSpock tag added 5 Share this post Link to post
David Heffernan 2345 Posted April 30, 2020 Would be interested to know whether NUMA memory is handled well. Share this post Link to post
Arnaud Bouchez 407 Posted April 30, 2020 I didn't see any explicit NUMA support in the source code. I guess the idea is to force the CPU affinity of the process, to avoid NUMA latencies. Share this post Link to post
David Heffernan 2345 Posted May 1, 2020 7 hours ago, Arnaud Bouchez said: I guess the idea is to force the CPU affinity of the process, to avoid NUMA latencies. Fine if you are multiprocessing, but not if multithreading. Share this post Link to post
Günther Schoch 61 Posted May 1, 2020 15 hours ago, David Heffernan said: Would be interested to know whether NUMA memory is handled well. Well, during the design phase of FastMM5 this feature was discussed but not (yet) implemented. The background was: a) a lot of the software is now running on large AWS nodes or similar virtual severs. There the optimization via NUMA is rather a special case b) modern processors as the AMD EPYC https://www.nextplatform.com/2019/08/15/a-deep-dive-into-amds-rome-epyc-architecture/ have internal optimization strategies But we are open to everything that makes the FastMM5 performance significantly better. regards Günther (Günther Schoch, gs-soft AG = we sponsored FastMM5) 4 Share this post Link to post
David Heffernan 2345 Posted May 1, 2020 15 minutes ago, Günther Schoch said: Well, during the design phase of FastMM5 this feature was discussed but not (yet) implemented. The background was: a) a lot of the software is now running on large AWS nodes or similar virtual severs. There the optimization via NUMA is rather a special case b) modern processors as the AMD EPYC https://www.nextplatform.com/2019/08/15/a-deep-dive-into-amds-rome-epyc-architecture/ have internal optimization strategies But we are open to everything that makes the FastMM5 performance significantly better. regards Günther (Günther Schoch, gs-soft AG = we sponsored FastMM5) My question was based on my own experience running multithreaded floating point software on NUMA machines, an issue that was live for me maybe three years ago. My problem was that most memory managers allocate out of a shared pool, but that cross node memory access is much more expensive than within node memory access. I didn't find any Delphi memory managers that were both robust and able to allocate memory local to the node on which the calling thread was executing. IIRC, allocators such as that of TBB and others used in the C++ world were able to do this. My application is a little different to more mainstream Delphi applications however. I understand that fastmm targets usage with frequent allocation of relatively small objects. In my application I preallocate wherever possible and avoid allocation in any hotspot. So my goal could just be boiled down to achieving affinity to the local node. In the end I wrote my own memory allocator on top of HeapCreate / HeapAlloc etc. The strategy is the each NUMA node has its own private heap (allocated by a call to HeapCreate). Each allocation is performed on the heap associated with the calling thread's node. I'm not in any way suggesting that such a simple strategy would be appropriate for fastmm. The interesting thing that I observed is that raw heap allocation / deallocation performance was never a problem for my app, because of the efforts we took to avoid allocation in hotspots. Likewise for thread contention, for the same reasons. The issue was that memory access speeds in my app is a key performance factor. And cross node access has dire performance. 1 Share this post Link to post
Günther Schoch 61 Posted May 1, 2020 7 minutes ago, David Heffernan said: ... I'm not in any way suggesting that such a simple strategy would be appropriate for fastmm.... very interesting case. If you are interested, I would suggest you drop me an email on "guenther.schoch" at gs-soft.com. I think it makes sense that you and Pierre exchange on that topic a little bit deeper. Share this post Link to post
Pierre le Riche 21 Posted May 1, 2020 (edited) Hi all, I've added experimental support for NUMA in a branch (numa_support). The idea is to link both the arenas and threads (for which performance matters) to a "NUMA mask". When scanning the arenas for available blocks it will perform a bitwise "and" between the mask for the arena and the mask for the thread, and if the result is non-zero then the arena is allowed to serve blocks to that thread. In this way you can completely separate the memory pools between threads or groups of threads. I have not tested how well this works in practice (I don't have a NUMA system on hand), but I believe VirtualAlloc is smart enough to provide memory from the NUMA node closest to the CPU the thread is running on. I have made it so you can specify a mask by block size. Version 4 is susceptible to cache thrashing when adjacent small blocks share the same cache line and are written to by different CPUs. By using this mechanism that can also be avoided. Pierre Edited May 1, 2020 by Pierre le Riche 3 1 Share this post Link to post
David Heffernan 2345 Posted May 1, 2020 34 minutes ago, Pierre le Riche said: I believe VirtualAlloc is smart enough to provide memory from the NUMA node closest to the CPU the thread is running on I would expect so, but I don't know for sure. I do know that when using HeapCreate / HeapAlloc this is true, but that's no use to you. I do have a NUMA machine to hand and would be happy to run some tests on it if that would be useful for you. 1 Share this post Link to post
Pierre le Riche 21 Posted May 1, 2020 47 minutes ago, David Heffernan said: I would expect so, but I don't know for sure. I do know that when using HeapCreate / HeapAlloc this is true, but that's no use to you. I have not been able to get official verification of this from a Microsoft website, but according to several other sources Windows uses the memory closest to the CPU that causes the page fault wherever possible. So effectively it does not matter which thread allocated the virtual memory, the thread that touches the page first will determine what memory is used to back it. It certainly would make a lot of sense for it to work that way. 47 minutes ago, David Heffernan said: I do have a NUMA machine to hand and would be happy to run some tests on it if that would be useful for you. If you have a real-world workload that you could throw at it I would really appreciate the feedback. I don't currently have any benchmarks that I think are suitable. Assuming the behaviour described above is correct and Windows backs a page with memory from the CPU that touched the page first I expect this new feature not to have a material impact on performance with blocks much larger than 4K, but with smaller blocks there should be a measurable difference. Share this post Link to post
David Heffernan 2345 Posted May 1, 2020 1 hour ago, Pierre le Riche said: So effectively it does not matter which thread allocated the virtual memory, the thread that touches the page first will determine what memory is used to back it. It certainly would make a lot of sense for it to work that way. Yeah, that makes a lot of sense. 1 hour ago, Pierre le Riche said: If you have a real-world workload that you could throw at it I would really appreciate the feedback. I could do this, but I'd probably need a little guidance on how to integrate fastmm5 with my program, making sure it was configured properly, and used the right branch. Share this post Link to post
DelphiRio 4 Posted May 1, 2020 13 hours ago, Arnaud Bouchez said: I didn't see any explicit NUMA support in the source code. I guess the idea is to force the CPU affinity of the process, to avoid NUMA latencies. There is version of FastMM4 supports NUMA & FPC. Check here: https://github.com/maximmasiutin/FastMM4-AVX Share this post Link to post
Darian Miller 361 Posted May 1, 2020 I notice the license has changed to $99/dev. "FastMM 5 is dual-licensed. You may choose to use it under the restrictions of the GPL v3 licence at no cost to you, or you may purchase a commercial licence. A commercial licence includes all future updates. " I assume forks like this one will be excluded from FastMM5+ changes: https://github.com/maximmasiutin/FastMM4-AVX @Pierre le Riche what happens with Delphi and FastMM in Delphi 10.4 and beyond? Will it be frozen at the current version of FastMM4? Share this post Link to post
Anders Melander 1783 Posted May 1, 2020 7 hours ago, DelphiRio said: There is version of FastMM4 supports NUMA What makes you think it "supports NUMA"? Share this post Link to post
Pierre le Riche 21 Posted May 1, 2020 9 hours ago, David Heffernan said: 10 hours ago, Pierre le Riche said: If you have a real-world workload that you could throw at it I would really appreciate the feedback. I could do this, but I'd probably need a little guidance on how to integrate fastmm5 with my program, making sure it was configured properly, and used the right branch. I've just pushed some more changes to the numa_support branch in the repository. The idea is that you call FastMM_ConfigureAllArenasForNUMA on application startup, and then you have to call FastMM_ConfigureCurrentThreadForNUMA once in all threads - including the main thread. It should probably not be necessary to require the application to make those calls, since FastMM_ConfigureAllArenasForNUMA could be called automatically when FastMM installs itself, and FastMM_ConfigureCurrentThreadForNUMA when a thread enters GetMem for the first time. However, I didn't want to presume that the NUMA support would necesssarily be desired, because it does have a downside and that is that it reduces the number of arenas available to each thread. I also suggest that you bump the CFastMM_SmallBlockArenaCount and CFastMM_MediumBlockArenaCount values to about half the number of CPU cores. Thanks for taking a look. Share this post Link to post
Pierre le Riche 21 Posted May 1, 2020 6 hours ago, Darian Miller said: What happens with Delphi and FastMM in Delphi 10.4 and beyond? Will it be frozen at the current version of FastMM4? Nothing will change for 10.4. Beyond that there is currently no firm plan in place. Share this post Link to post
David Schwartz 426 Posted May 1, 2020 7 hours ago, Darian Miller said: I notice the license has changed to $99/dev. "FastMM 5 is dual-licensed. You may choose to use it under the restrictions of the GPL v3 licence at no cost to you, or you may purchase a commercial licence. A commercial licence includes all future updates. " I assume forks like this one will be excluded from FastMM5+ changes: https://github.com/maximmasiutin/FastMM4-AVX What is the practical impact of a GPL V3 license for those of us who don't keep up with such things? Share this post Link to post
Anders Melander 1783 Posted May 2, 2020 1 hour ago, David Schwartz said: What is the practical impact of a GPL V3 license https://www.google.com/search?q=GPL+V3 1 Share this post Link to post
Darian Miller 361 Posted May 2, 2020 2 hours ago, David Schwartz said: What is the practical impact of a GPL V3 license for those of us who don't keep up with such things? If you distribute applications that includes some GPL code, then all the code to your application must be made publicly available. Commercial software makers typically stay far away from GPL. Now, if you are actually making money on the software you make and distribute, then it makes sense to pay Pierre for a commercial license, bypassing the GPL issue. It's a real line in the sand for FastMM5. Perhaps Embarcadero will negotiate with Pierre a nice big fee to get a redistributable commercial licensed version of FastMM5... or they will simply keep shipping Delphi with FastMM4. They certainly won't be shipping a GPL version of FastMM5 with Delphi. 1 Share this post Link to post
David Schwartz 426 Posted May 2, 2020 Oh, that one. "If I borrow your hammer then I have to give away everything I'll ever build with it in the future for free, even if it cost me a lot of time and money to build." Some folks have a strange notion of what "equity" and "balance" are about. 3 Share this post Link to post
David Heffernan 2345 Posted May 2, 2020 2 hours ago, David Schwartz said: Oh, that one. "If I borrow your hammer then I have to give away everything I'll ever build with it in the future for free, even if it cost me a lot of time and money to build." Some folks have a strange notion of what "equity" and "balance" are about. Don't use the hammer then. Make your own. Your choice. 6 Share this post Link to post
Anders Melander 1783 Posted May 2, 2020 3 hours ago, David Schwartz said: Some folks have a strange notion of what "equity" and "balance" are about. So you're complaining that Pierre has enabled us to use FastMM 5 for free and that there's conditions for this use? I think "thank you" would be more appropriate. 7 Share this post Link to post
Günther Schoch 61 Posted May 2, 2020 4 hours ago, David Schwartz said: Oh, that one. "If I borrow your hammer then I have to give away everything I'll ever build with it in the future for free, even if it cost me a lot of time and money to build." Hello David I see your concerns and Peirre le Riche and I discussed a lot on the licensing. I tried to explain the background in https://en.delphipraxis.net/topic/2751-fastmm5-now-released-by-pierre-le-riche-small-background-story/ we see 3 groups of "users" a) the vast majority is fine with FastMM4 as the applications do not suffer under any multi-threading related memory manager problem. Means: nobody is forced to switch. b) the developer having heavy multi-threaded applications consuming a lot of rather expensive CPU. There FastMM5 really helps and the small amount of money that Pierre is asking for in form of a dual license (starting with 99$) is nothing compared with other expenses. c) and there is Embarcadero: As explained in my intro story a modern memory manager would actually be part of the scope of Delphi (in theory). Pierre solved this problem already once (with FastMM4) for free. This story will not be repeated by FastMM5 as Pierre needs obviously some financial payback to maintain the product. BTW: When my company started to sponsor the development of FastMM5, I whould never had thought that it pays back that fast. We got beginning of the year our first 2 AMD Epyc 64/128 based servers for hosting your Delphi WebServices. Scaling up our services for such platforms was really only possible with FastMM5. 5 Share this post Link to post
Guest Posted May 2, 2020 19 minutes ago, Günther Schoch said: BTW: When my company started to sponsor the development of FastMM5, I whould never had thought that it pays back that fast. We got beginning of the year our first 2 AMD Epyc 64/128 based servers for hosting your Delphi WebServices. Scaling up our services for such platforms was really only possible with FastMM5. If and only if Embarcadero had servers need to be updated ! @Günther Schoch and @Pierre le Riche thank you for such contribution to the community ! Share this post Link to post
Guest Posted May 2, 2020 5 hours ago, David Schwartz said: Oh, that one. "If I borrow your hammer then I have to give away everything I'll ever build with it in the future for free, even if it cost me a lot of time and money to build." You are not borrowing a hammer !, you borrowed hammer and then used it building a house, then used the hammer itself as door handle and now you are trying to sell your house with the hammer embedded in it, is this fair for the one who lend you that hammer ? Share this post Link to post