Jump to content

Pierre le Riche

Members
  • Content Count

    18
  • Joined

  • Last visited

  • Days Won

    1

Everything posted by Pierre le Riche

  1. Pierre le Riche

    SetLength TBytes Memory Leak

    If the call was made by a DLL that was unloaded prior to the report being generated it won't be possible to convert the return address to call address after the fact. But I take your point - in most cases it'll work. If it's a big issue for you I can put it on my to-do list. TJclStackInfoList.ValidCallSite does that - it returns the size of the call instruction given a return address.
  2. Pierre le Riche

    SetLength TBytes Memory Leak

    I agree with you, but calculating the call address is only included in the cost when you perform a raw stack trace: Frame based traces (including the CaptureStackBackTrace API call) yield return addresses, so in order to get the call addresses for those you would need to dereference the return pointers (as you do with raw traces) in order to find the start of the call instruction. This would negate much of the performance advantage over raw traces. Before a stack trace address is passed to the JCL for conversion to unit and line number information, 1 is subtracted in order to ensure the address falls inside the call instruction. Consequently the unit/line information corresponds to the call, and the listed address is the return address. Not ideal, but I find that I very rarely actually look at the address so it has never really bothered me.
  3. Pierre le Riche

    SetLength TBytes Memory Leak

    Hi Stefan, The code that validates stack trace entries in FastMM_FullDebugMode.dpr was taken from TJclStackInfoList.ValidCallSite, with some minor modifications. (Credit is given in the comments.) I just checked and it is still the same, functionally, as the latest code in JCLDebug.pas. By default FastMM does a "raw" stack trace, meaning it just walks the stack in reverse, testing every dword to see whether it could potentially be the return address for a call instruction. It does this by checking whether the bytes just prior to the potential return address match one of the many potential opcodes for a call instruction. This is not 100% accurate, so you do get some false positives (as you have seen), but it works fairly well in general. The alternative would be to do a frame based stack trace, but then routines that do not set up a stack frame would not be listed at all. There's room for optimization here: The stack tracer could perhaps look at the actual address of the call instruction and compare that to the return address of the next call in the call stack. If the variance in code addresses is too large then the prior call is likely a false positive. Perhaps MadExcept does something like this, or some other smart tricks I am not aware of. I'll look around to see if there are ways to cut down on the false positives without impacting performance too much. I ran your test case under Delphi 10.4.2 and I got this stack trace: 0041320A [FastMM5.pas][FastMM5][FastMM_DebugGetMem$qqri][7717] 00404799 [System.pas][System][@ReallocMem$qqrrpvi][5035] 00408B4F [System.pas][System][DynArraySetLength$qqrrpvpvipi][36568] 00408CB6 [System.pas][System][@DynArraySetLength$qqrv][36672] 00427619 76F5FA29 [BaseThreadInitThunk] 77247A4E [RtlGetAppContainerNamedObjectPath] 77247A1E [RtlGetAppContainerNamedObjectPath] As you can see the call to LStrFromPWCharLen isn't there, so I think this confirms that it was just "noise" in the stack trace. Best regards, Pierre
  4. In the SetLength call you're calling Integer.ToString and BoolToStr in order to determine the resultant string lengths, and lower down you're calling those again to get the actual strings. I reckon that's where the 30% is going. Even if you fix that I doubt you'll see more than a marginal performance improvement going from PrepareLineForExport and PrepareLineForExport_MOVE. String concatenation is decently implemented in the RTL. (The 32-bit implementation is even in assembly language.)
  5. Pierre le Riche

    Experience/opinions on FastMM5

    If you have applications that are currently using borlndmm.dll you can just replace it with one compiled using FastMM5, The source for it is in the "BorlndMM DLL" subfolder. For new applications I recommend that you use FastMM5 directly - add it as the first unit in your project's dpr file.
  6. Pierre le Riche

    Experience/opinions on FastMM5

    I have now added support for this. Previously you needed to execute "Include(FastMM_MessageBoxEvents, mmetUnexpectedMemoryLeakSummary);" to get a leak summary on shutdown, but now if ReportMemoryLeaksOnShutdown = True it will do that automatically. FastMM5 does not support the v4 options file, but it does support the v4 defines if you declare then in Project - Options - Delphi Compiler - Conditionals defines. If that is not convenient I recommend you use the equivalent v5 options instead - the v4 conditional defines support is just for backward compatibility.
  7. Pierre le Riche

    Experience/opinions on FastMM5

    Yes, the CFastMM_StackTraceEntryCount constant. If there is a big demand for it I could make it adjustable, but v5 is already approaching double the number of entries of v4 so I reckon it should be sufficient.
  8. Pierre le Riche

    Experience/opinions on FastMM5

    The default values are 19 entries under 32-bit, and 20 entries under 64-bit. (The odd numbers are to ensure that the structure is a multiple of 64 bytes.) The values are adjustable, but not runtime.
  9. Pierre le Riche

    Experience/opinions on FastMM5

    Hi Feri, There is the global variable FastMM_OutputDebugStringEvents, which is a set of events for which OutputDebugString will be called. By default only critical events are included, but you can adjust it to fit your needs. Best regards, Pierre
  10. Pierre le Riche

    Experience/opinions on FastMM5

    This is the kind of thing that really should be in the RTL. It makes little sense for every library that needs a fast FillChar to have its own. Sure the code might be small, but the CPU micro-op cache is small too. There is some justification for having custom Move routines in the memory manager, since there are some assumptions that it can make that a general purpose Move cannot, e.g. buffers are always non-overlapping, always aligned, always a multiple of a certain power of two, etc. At the moment FastMM just calls FillChar in system.pas for zeroing blocks - except for large blocks obtained directly from the OS (those are guaranteed to be zero already). Apart from some assumptions about alignment there's not much room for optimizations that cannot be done in FillChar. It's still an issue under 32-bit, where you're limited to a 4GB address space. While on the topic: Have you run benchmarks on a real-world application to see what difference a faster Move and/or FillChar makes to application throughput? The reason I ask is because in the real-world applications I have tested so far the total time spent in FillChar and Move is typically in the region of 5%, so if you could somehow double the speed (which I doubt you can, given that memory bandwidth is a bottleneck) the best improvement you would see is 2.5%. If there are applications out there that will benefit greatly from a faster FillChar then it is something I would want to pursue further.
  11. Pierre le Riche

    FastMM5 now released by Pierre le Riche (small background story)

    It's all about keeping dependency chains short and avoiding unpredictable jumps. Modern CPUs are pretty good at dealing with any cruft in-between. Sometimes the profiler will tell you something you've missed, but if you spend enough time tuning code you develop a feel for what works and what doesn't. An example of an unavoidable dependency chain hurting performance is in FreeMem: On average more than a quarter of the CPU time in a FreeMem call goes to the very first line - reading the block header. FastMM needs to know what kind of block it is (if it even is a valid block) before it can do anything. Due to alignment requirements the header is often in a prior cache line, and if the app has been busy manipulating data then it is most likely no longer cached. Going all the way to main memory is expensive and it holds up everything else.
  12. Pierre le Riche

    FastMM5 now released by Pierre le Riche (small background story)

    The assembly language makes a noticeable difference in performance under x86 due to the limited number of registers and the fact that the 32-bit compiler shuffles variables between the stack and registers more often than is necessary. Under 64-bit there are more registers, and the 64-bit compiler also optimizes a bit better, so assembly language generally provides less of a benefit under 64-bit, but given that it is easy to translate 32-bit assembly to 64-bit I went ahead and translated many of the 32-bit assembly language routines anyway. In places where I need to use instructions for which the language currently has no intrinsic (e.g. bit scan forward "bsf") I don't have much choice in the matter. It is critical that the method you used as an example be inlined. In most places it is called from the arguments are all constants, so the compiler will short-circuit all the branches in the inlining process and it will produce close to optimal code (bar some shuffling of values between registers which I could not coax it out of doing). If you take out the "inline" then it becomes very inefficient. I've spent a lot of time looking at the compiled code disassembly, reorganising the Pascal code, and making use of inlining to nudge the compiler towards producing better output. There's no built-in assembler support for some of the platforms, so with the goal of eventually supporting more platforms than Windows it is worth the time spent. That is also on the to-do list. The aim is to make everything configurable runtime, and the number of arenas is perhaps the most important option that currently cannot be tuned runtime. We have discussed making the number of arenas grow dynamically as needed, however each arena comes with memory overhead, so at some point you would also want the number of arenas to shrink again when the application no longer needs them. Another option that has been discussed internally is to allow having a single dedicated arena per thread, useful in cases where memory usage is not important at all and scaling is everything. Options such as these will probably be worked into the "MemoryManagerOptimizationStrategy" enumeration. At the moment there isn't much difference between the various options, but that will change as features get tweaked and new ones are added. Thank you for the feedback and suggestions.
  13. Pierre le Riche

    FastMM5 now released by Pierre le Riche (small background story)

    Rep movs is only used for large blocks, where it is the fastest mechanism on modern processors. For smaller blocks SSE2 is used, if available. As an interesting aside, apparently the Linux kernel uses rep movs for all moves. The rationale behind that is to force the CPU manufacturers to optimize rep movs even for small moves. Historically rep movs has been hit and miss from one CPU generation to the next. I would like to add more platform support, but there has to be sufficient demand to offset the development and maintenance burden. I don't want v5 to become an "ifdef hell" like, to a large extent, v4. v4 supports so many different configurations that it is littered with ifdefs. It makes it really hard to expand functionality because there are so many different combinations that need to be tested. That said, it is on my "to-do" list to add support for most or all of the platforms supported by the latest Delphi versions. I think it would be particularly useful to have the debugging features available on other platforms as well.
  14. Pierre le Riche

    FastMM5 now released by Pierre le Riche (small background story)

    Against my better judgement I'm going to bite. Below are screenshots of the "CPU time" and "Peak working set" columns in Task Manager after a single benchmark run of the Fastcode Memory Manager Benchmark & Validation tool. It includes a variety of tests, including replays of memory usage recordings of real world applications:
  15. Pierre le Riche

    Experience/opinions on FastMM5

    Nothing will change for 10.4. Beyond that there is currently no firm plan in place.
  16. Pierre le Riche

    Experience/opinions on FastMM5

    I could do this, but I'd probably need a little guidance on how to integrate fastmm5 with my program, making sure it was configured properly, and used the right branch. I've just pushed some more changes to the numa_support branch in the repository. The idea is that you call FastMM_ConfigureAllArenasForNUMA on application startup, and then you have to call FastMM_ConfigureCurrentThreadForNUMA once in all threads - including the main thread. It should probably not be necessary to require the application to make those calls, since FastMM_ConfigureAllArenasForNUMA could be called automatically when FastMM installs itself, and FastMM_ConfigureCurrentThreadForNUMA when a thread enters GetMem for the first time. However, I didn't want to presume that the NUMA support would necesssarily be desired, because it does have a downside and that is that it reduces the number of arenas available to each thread. I also suggest that you bump the CFastMM_SmallBlockArenaCount and CFastMM_MediumBlockArenaCount values to about half the number of CPU cores. Thanks for taking a look.
  17. Pierre le Riche

    Experience/opinions on FastMM5

    I have not been able to get official verification of this from a Microsoft website, but according to several other sources Windows uses the memory closest to the CPU that causes the page fault wherever possible. So effectively it does not matter which thread allocated the virtual memory, the thread that touches the page first will determine what memory is used to back it. It certainly would make a lot of sense for it to work that way. If you have a real-world workload that you could throw at it I would really appreciate the feedback. I don't currently have any benchmarks that I think are suitable. Assuming the behaviour described above is correct and Windows backs a page with memory from the CPU that touched the page first I expect this new feature not to have a material impact on performance with blocks much larger than 4K, but with smaller blocks there should be a measurable difference.
  18. Pierre le Riche

    Experience/opinions on FastMM5

    Hi all, I've added experimental support for NUMA in a branch (numa_support). The idea is to link both the arenas and threads (for which performance matters) to a "NUMA mask". When scanning the arenas for available blocks it will perform a bitwise "and" between the mask for the arena and the mask for the thread, and if the result is non-zero then the arena is allowed to serve blocks to that thread. In this way you can completely separate the memory pools between threads or groups of threads. I have not tested how well this works in practice (I don't have a NUMA system on hand), but I believe VirtualAlloc is smart enough to provide memory from the NUMA node closest to the CPU the thread is running on. I have made it so you can specify a mask by block size. Version 4 is susceptible to cache thrashing when adjacent small blocks share the same cache line and are written to by different CPUs. By using this mechanism that can also be avoided. Pierre
×