Jud 1 Posted November 12, 2020 Here is the problem (a bit of a long message) - I'm writing a threaded program to run on workstations with dual Xeons. Each of the Xeons has eight hyperthreaded cores. (I've read https://www.delphipraxis.net/113427-beginthreadaffinity-setthreadaffinity.html but I don't think it helps me.) Dual-Xeon systems have Non-Uniform Memory Access (NUMA) - each of the Xeons have direct access to their own memory, but they can access memory on the other Xeon - but it takes a long time for a thread running on one Xeon to access memory on the other Xeon. The bottleneck in the program is accessing memory. I have the program set so I can adjust the number of threads it is running. As I test from 1 up to 16 threads, performance improves. TaskManager/performance/Resource meter shows that all 16 threads are running on one Xeon. But if I go to 17 threads or more, it hurts performance (compared to 16 threads on one Xeon). What must be happening is that threads 17 and above are executing on the second Xeon but their memory must be on the first Xeon. That's the problem. I've tried to set the affinity for each thread to a particular CPU and hope it uses that CPU for its memory, but setting the affinity inside a thread isn't working at all (even with only one thread, it runs on all CPUs.) What I have right now is I limit an instance of the program to running 16 threads and I have the affinity set to run on the first Xeon. Then I start a second instance of the program - it checks to see if the first instance is running, and if so, it sets its affinity to run on the second Xeon. Then I merge the results later (which isn't an ideal situation). This is working, with double the performance of running on one Xeon, but it is no different from running on two computers, each with one Xeon. So is there a way to get one instance to use all 32 available CPUs with full performance? (If I could assign each thread and its memory to a particular CPU, that should fix the problem, but I haven't been able to do that.) Share this post Link to post
Arnaud Bouchez 407 Posted November 12, 2020 You would need two memory managers, one per CPU socket.... Which is not possible with the Delphi RTL yet. You may try to use a per-thread memory manager, not FastMM4, which uses a per-thread arena for small blocks. There are some around for Delphi - just google for it. Share this post Link to post
David Heffernan 2345 Posted November 12, 2020 Yeah, NUMA requires a completely different memory allocation strategy. When I faced this problem I concluded that there was no memory manager available that could do what I needed. So I wrote my own that is essentially a wrapper around the Windows heap manager. The trick is to have different heaps for each NUMA node. Share this post Link to post
Jud 1 Posted November 12, 2020 Thank you both for the replies. I don't have the knowledge to write a memory manager, so I'll stick with what I'm doing now. Delphi is a little behind with support for 64-bit systems with large memories and/or multiple CPU chips. BTW, the dual-Xeon systems do well on multi-threaded computation-intensive tasks, but if it is memory intensive, it is more like two independent xeons. Share this post Link to post
David Heffernan 2345 Posted November 12, 2020 Just remembered, doesn't FastMM5 have some features to support NUMA? Worth a look. Share this post Link to post
Jud 1 Posted November 12, 2020 I didn't know about FastMM5 but it sounds good! Share this post Link to post
Anders Melander 1783 Posted November 12, 2020 1 hour ago, David Heffernan said: doesn't FastMM5 have some features to support NUMA? No. You discussed it here: Share this post Link to post
Jud 1 Posted November 13, 2020 (edited) Well, I downloaded FastMM5, put it at the first of my ISES, did a build, and I get the error message: FastMM cannot be installed, because the default memory manager has already been used to allocate memory. I exited Delphi, restarted it, did a build, got the same thing. ADDED: Right after I wrote that I realized that it should go in USES in the DPROJ file instead of DPR file. That got rid of the error message. But I did a test on an quad-core i7, so 8 threads, and it is no faster. Edited November 13, 2020 by Jud Share this post Link to post
Jud 1 Posted November 13, 2020 I have no problem assigning affinity for an entire program, but I can't get it to work on individual threads on a dual-Xeon system. Is there a way to make that work? Share this post Link to post
FredS 138 Posted November 13, 2020 (edited) 51 minutes ago, Jud said: it should go in USES in the DPROJ I added in my DPR right above madExcept (v5) and it worked fine. Did you use the NUMA Branch? Edited November 13, 2020 by FredS Share this post Link to post
FredS 138 Posted November 13, 2020 12 minutes ago, Jud said: but I can't get it to work on individual threads I don't have a Dual Xenon system available to me but I've tested this in an elevated app and it works fine: https://community.idera.com/developer-tools/programming-languages/f/delphi-language/72615/linux-equivalent-for-setthreadidealprocessor-and-setthreadaffinitymask-in-delphi Share this post Link to post
Jud 1 Posted November 13, 2020 12 minutes ago, FredS said: I added in my DPR right above madExcept (v5) and it worked fine. Did you use the NUMA Branch? I don't know how to do that - I haven't found good documentation on FastMM5. It didn't make a difference on the memory-intensive program on a single-i7 machine. Share this post Link to post
FredS 138 Posted November 13, 2020 56 minutes ago, Jud said: I don't know how to do that - I haven't found good documentation on FastMM5. It didn't make a difference on the memory-intensive program on a single-i7 machine. Checkout the NUMA Support Branch Add to DPR Uses: FastMM5 in '<path>\FastMM5\FastMM5.pas', Call `FastMM_ConfigureAllArenasForNUMA` in the DPR after BEGIN Share this post Link to post
David Heffernan 2345 Posted November 13, 2020 You aren't going to get anywhere by just adding a new library, compiling, running, and hoping for magic. It's going to require in depth understanding if how to configure and use any library and how to adapt your own code. Share this post Link to post
Jud 1 Posted November 13, 2020 Yes, but the only documentation I've found about it is: https://github.com/pleriche/FastMM5 I have no idea of how to use it or any of the details of what a memory manager does. I think it should be invisible. ---------- Does anyone know if OmniThreadLibrary will help with this? Share this post Link to post
David Heffernan 2345 Posted November 13, 2020 2 hours ago, Jud said: I think it should be invisible. That would be nice. But in my experience, developing a program to work well with NUMA requires detailed knowledge from the programmer. I personally don't think there is any way round that. Share this post Link to post
Guest Posted November 13, 2020 I did 2 projects that utilized NUMA without depending on memory manager like David and Arnaud suggested, and your "computation-intensive tasks" caught my eye and i think it applied as in my case, but still it depends on the data type and structure, so i will explain what you can do if that is possible. This can be relatively easy task only if you don't have strings to manipulate, and will be hard and long task if so. The idea is simple to understand, but hard to apply as it require many design changes. The details: you should prevent any thread from using any memory allocated from different thread, example when your MainThread Create TThread then TThread itself is allocated by the MM and most definitely will belongs to the processor where the MainThread is, even that thread will pay the price on each access to its own TThread class fields, so first step drop TThread and start to use BeginThread instead, pass a simple pointer to it and let the newly created allocate memory for a record/structure need to be filled, the creator (caller of BeginThread) will use that record to pass information and data to the thread, by pass i mean copy/move everything to memory allocated and identified in that record. Your Thread after starting will allocate memory using VirtualAlloc or VirtualAllocExNuma ..., in my experience the OS is helping greatly here and using the closest memory to that thread as long you allocate huge chunks (+1MB), after the allocating and you already moved the data to that newly allocated memory, then start to process/compute, it doesn't need to be one chunk you can use many allocation request just keep them big and only called/requested by the same thread. Keep in mind you can't use any managed types like strings or TBytes..., this goes too for any class created, try to minimize or eliminate any dependency on anything of the data already in the newly allocated memory, if you need class's then create alternatives as record and also put them there, you got the idea. If needed then reserve parts of that allocated memory as records to be used for the output, so the result also should be on that allocated memory. It did work for me as i had hundreds millions of floats and integer to be processed, so my controller thread created threads, then filled the data across them, i had few short strings to filter based on the calculations, so it wasn't just a copy one to one but a simple convert from record with string to record without any strings, i used records with fixed short strings fields ( in few cases i needed to compare and filter, so i used simple array [0..16] of bytes where the first is the length of the content) this eliminate the need to handle strings and first thing comes to mind that i lost memory, on contrary this design was less memory expensive as string itself need +12 bytes as meta data then the length of the data which is 16 at least (per MM design). Share this post Link to post