Jump to content

Pierre le Riche

Members
  • Content Count

    14
  • Joined

  • Last visited

  • Days Won

    1

Pierre le Riche last won the day on May 6

Pierre le Riche had the most liked content!

Community Reputation

19 Good

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Pierre le Riche

    Experience/opinions on FastMM5

    If you have applications that are currently using borlndmm.dll you can just replace it with one compiled using FastMM5, The source for it is in the "BorlndMM DLL" subfolder. For new applications I recommend that you use FastMM5 directly - add it as the first unit in your project's dpr file.
  2. Pierre le Riche

    Experience/opinions on FastMM5

    I have now added support for this. Previously you needed to execute "Include(FastMM_MessageBoxEvents, mmetUnexpectedMemoryLeakSummary);" to get a leak summary on shutdown, but now if ReportMemoryLeaksOnShutdown = True it will do that automatically. FastMM5 does not support the v4 options file, but it does support the v4 defines if you declare then in Project - Options - Delphi Compiler - Conditionals defines. If that is not convenient I recommend you use the equivalent v5 options instead - the v4 conditional defines support is just for backward compatibility.
  3. Pierre le Riche

    Experience/opinions on FastMM5

    Yes, the CFastMM_StackTraceEntryCount constant. If there is a big demand for it I could make it adjustable, but v5 is already approaching double the number of entries of v4 so I reckon it should be sufficient.
  4. Pierre le Riche

    Experience/opinions on FastMM5

    The default values are 19 entries under 32-bit, and 20 entries under 64-bit. (The odd numbers are to ensure that the structure is a multiple of 64 bytes.) The values are adjustable, but not runtime.
  5. Pierre le Riche

    Experience/opinions on FastMM5

    Hi Feri, There is the global variable FastMM_OutputDebugStringEvents, which is a set of events for which OutputDebugString will be called. By default only critical events are included, but you can adjust it to fit your needs. Best regards, Pierre
  6. Pierre le Riche

    Experience/opinions on FastMM5

    This is the kind of thing that really should be in the RTL. It makes little sense for every library that needs a fast FillChar to have its own. Sure the code might be small, but the CPU micro-op cache is small too. There is some justification for having custom Move routines in the memory manager, since there are some assumptions that it can make that a general purpose Move cannot, e.g. buffers are always non-overlapping, always aligned, always a multiple of a certain power of two, etc. At the moment FastMM just calls FillChar in system.pas for zeroing blocks - except for large blocks obtained directly from the OS (those are guaranteed to be zero already). Apart from some assumptions about alignment there's not much room for optimizations that cannot be done in FillChar. It's still an issue under 32-bit, where you're limited to a 4GB address space. While on the topic: Have you run benchmarks on a real-world application to see what difference a faster Move and/or FillChar makes to application throughput? The reason I ask is because in the real-world applications I have tested so far the total time spent in FillChar and Move is typically in the region of 5%, so if you could somehow double the speed (which I doubt you can, given that memory bandwidth is a bottleneck) the best improvement you would see is 2.5%. If there are applications out there that will benefit greatly from a faster FillChar then it is something I would want to pursue further.
  7. Pierre le Riche

    FastMM5 now released by Pierre le Riche (small background story)

    It's all about keeping dependency chains short and avoiding unpredictable jumps. Modern CPUs are pretty good at dealing with any cruft in-between. Sometimes the profiler will tell you something you've missed, but if you spend enough time tuning code you develop a feel for what works and what doesn't. An example of an unavoidable dependency chain hurting performance is in FreeMem: On average more than a quarter of the CPU time in a FreeMem call goes to the very first line - reading the block header. FastMM needs to know what kind of block it is (if it even is a valid block) before it can do anything. Due to alignment requirements the header is often in a prior cache line, and if the app has been busy manipulating data then it is most likely no longer cached. Going all the way to main memory is expensive and it holds up everything else.
  8. Pierre le Riche

    FastMM5 now released by Pierre le Riche (small background story)

    The assembly language makes a noticeable difference in performance under x86 due to the limited number of registers and the fact that the 32-bit compiler shuffles variables between the stack and registers more often than is necessary. Under 64-bit there are more registers, and the 64-bit compiler also optimizes a bit better, so assembly language generally provides less of a benefit under 64-bit, but given that it is easy to translate 32-bit assembly to 64-bit I went ahead and translated many of the 32-bit assembly language routines anyway. In places where I need to use instructions for which the language currently has no intrinsic (e.g. bit scan forward "bsf") I don't have much choice in the matter. It is critical that the method you used as an example be inlined. In most places it is called from the arguments are all constants, so the compiler will short-circuit all the branches in the inlining process and it will produce close to optimal code (bar some shuffling of values between registers which I could not coax it out of doing). If you take out the "inline" then it becomes very inefficient. I've spent a lot of time looking at the compiled code disassembly, reorganising the Pascal code, and making use of inlining to nudge the compiler towards producing better output. There's no built-in assembler support for some of the platforms, so with the goal of eventually supporting more platforms than Windows it is worth the time spent. That is also on the to-do list. The aim is to make everything configurable runtime, and the number of arenas is perhaps the most important option that currently cannot be tuned runtime. We have discussed making the number of arenas grow dynamically as needed, however each arena comes with memory overhead, so at some point you would also want the number of arenas to shrink again when the application no longer needs them. Another option that has been discussed internally is to allow having a single dedicated arena per thread, useful in cases where memory usage is not important at all and scaling is everything. Options such as these will probably be worked into the "MemoryManagerOptimizationStrategy" enumeration. At the moment there isn't much difference between the various options, but that will change as features get tweaked and new ones are added. Thank you for the feedback and suggestions.
  9. Pierre le Riche

    FastMM5 now released by Pierre le Riche (small background story)

    Rep movs is only used for large blocks, where it is the fastest mechanism on modern processors. For smaller blocks SSE2 is used, if available. As an interesting aside, apparently the Linux kernel uses rep movs for all moves. The rationale behind that is to force the CPU manufacturers to optimize rep movs even for small moves. Historically rep movs has been hit and miss from one CPU generation to the next. I would like to add more platform support, but there has to be sufficient demand to offset the development and maintenance burden. I don't want v5 to become an "ifdef hell" like, to a large extent, v4. v4 supports so many different configurations that it is littered with ifdefs. It makes it really hard to expand functionality because there are so many different combinations that need to be tested. That said, it is on my "to-do" list to add support for most or all of the platforms supported by the latest Delphi versions. I think it would be particularly useful to have the debugging features available on other platforms as well.
  10. Pierre le Riche

    FastMM5 now released by Pierre le Riche (small background story)

    Against my better judgement I'm going to bite. Below are screenshots of the "CPU time" and "Peak working set" columns in Task Manager after a single benchmark run of the Fastcode Memory Manager Benchmark & Validation tool. It includes a variety of tests, including replays of memory usage recordings of real world applications:
  11. Pierre le Riche

    Experience/opinions on FastMM5

    Nothing will change for 10.4. Beyond that there is currently no firm plan in place.
  12. Pierre le Riche

    Experience/opinions on FastMM5

    I could do this, but I'd probably need a little guidance on how to integrate fastmm5 with my program, making sure it was configured properly, and used the right branch. I've just pushed some more changes to the numa_support branch in the repository. The idea is that you call FastMM_ConfigureAllArenasForNUMA on application startup, and then you have to call FastMM_ConfigureCurrentThreadForNUMA once in all threads - including the main thread. It should probably not be necessary to require the application to make those calls, since FastMM_ConfigureAllArenasForNUMA could be called automatically when FastMM installs itself, and FastMM_ConfigureCurrentThreadForNUMA when a thread enters GetMem for the first time. However, I didn't want to presume that the NUMA support would necesssarily be desired, because it does have a downside and that is that it reduces the number of arenas available to each thread. I also suggest that you bump the CFastMM_SmallBlockArenaCount and CFastMM_MediumBlockArenaCount values to about half the number of CPU cores. Thanks for taking a look.
  13. Pierre le Riche

    Experience/opinions on FastMM5

    I have not been able to get official verification of this from a Microsoft website, but according to several other sources Windows uses the memory closest to the CPU that causes the page fault wherever possible. So effectively it does not matter which thread allocated the virtual memory, the thread that touches the page first will determine what memory is used to back it. It certainly would make a lot of sense for it to work that way. If you have a real-world workload that you could throw at it I would really appreciate the feedback. I don't currently have any benchmarks that I think are suitable. Assuming the behaviour described above is correct and Windows backs a page with memory from the CPU that touched the page first I expect this new feature not to have a material impact on performance with blocks much larger than 4K, but with smaller blocks there should be a measurable difference.
  14. Pierre le Riche

    Experience/opinions on FastMM5

    Hi all, I've added experimental support for NUMA in a branch (numa_support). The idea is to link both the arenas and threads (for which performance matters) to a "NUMA mask". When scanning the arenas for available blocks it will perform a bitwise "and" between the mask for the arena and the mask for the thread, and if the result is non-zero then the arena is allowed to serve blocks to that thread. In this way you can completely separate the memory pools between threads or groups of threads. I have not tested how well this works in practice (I don't have a NUMA system on hand), but I believe VirtualAlloc is smart enough to provide memory from the NUMA node closest to the CPU the thread is running on. I have made it so you can specify a mask by block size. Version 4 is susceptible to cache thrashing when adjacent small blocks share the same cache line and are written to by different CPUs. By using this mechanism that can also be avoided. Pierre
×