Jump to content
Jud

Threads on dual-Xeon system

Recommended Posts

Here is the problem (a bit of a long message) - I'm writing a threaded program to run on workstations with dual Xeons.  Each of the Xeons has eight hyperthreaded cores.  

(I've read https://www.delphipraxis.net/113427-beginthreadaffinity-setthreadaffinity.html but I don't think it helps me.)

 

Dual-Xeon systems have Non-Uniform Memory Access (NUMA) - each of the Xeons have direct access to their own memory, but they can access memory on the other Xeon - but it takes a long time for a thread running on one Xeon to access memory on the other Xeon.  The bottleneck in the program is accessing memory.

 

I have the program set so I can adjust the number of threads it is running.  As I test from 1 up to 16 threads, performance improves.  TaskManager/performance/Resource meter shows that all 16 threads are running on one Xeon.  But if I go to 17 threads or more, it hurts performance (compared to 16 threads on one Xeon).  What must be happening is that threads 17 and above are executing on the second Xeon but their memory must be on the first Xeon.

 

That's the problem.

 

I've tried to set the affinity for each thread to a particular CPU and hope it uses that CPU for its memory, but setting the affinity inside a thread isn't working at all (even with only one thread, it runs on all CPUs.)  

What I have right now is I limit an instance of the program to running 16 threads and I have the affinity set to run on the first Xeon.  Then I start a second instance of the program - it checks to see if the first instance is running, and if so, it sets its affinity to run on the second Xeon.  Then I merge the results later (which isn't an ideal situation).

 

This is working, with double the performance of running on one Xeon, but it is no different from running on two computers, each with one Xeon.

 

So is there a way to get one instance to use all 32 available CPUs with full performance?

 

(If I could assign each thread and its memory to a particular CPU, that should fix the problem, but I haven't been able to do that.)

 

 

Share this post


Link to post

You would need two memory managers, one per CPU socket....
Which is not possible with the Delphi RTL yet.

You may try to use a per-thread memory manager, not FastMM4, which uses a per-thread arena for small blocks. There are some around for Delphi - just google for it.

Share this post


Link to post

Yeah, NUMA requires a completely different memory allocation strategy. When I faced this problem I concluded that there was no memory manager available that could do what I needed. So I wrote my own that is essentially a wrapper around the Windows heap manager. The trick is to have different heaps for each NUMA node. 

Share this post


Link to post

Thank you both for the replies.  I don't have the knowledge to write a memory manager, so I'll stick with what I'm doing now.

 

Delphi is a little behind with support for 64-bit systems with large memories and/or multiple CPU chips.

 

BTW, the dual-Xeon systems do well on multi-threaded computation-intensive tasks, but if it is memory intensive, it is more like two independent xeons.

Share this post


Link to post

I didn't know about FastMM5 but it sounds good!

Share this post


Link to post

Well, I downloaded FastMM5, put it at the first of my ISES, did a build, and I get the error message:

 

FastMM cannot be installed, because the default memory manager has already been used to allocate memory.  I exited Delphi, restarted it, did a build, got the same thing.

 

ADDED: Right after I wrote that I realized that it should go in USES in the DPROJ file instead of DPR file.  That got rid of the error message.  But I did a test on an quad-core i7, so 8 threads, and it is no faster.

Edited by Jud

Share this post


Link to post

I have no problem assigning affinity for an entire program, but I can't get it to work on individual threads on a dual-Xeon system.  Is there a way to make that work?

Share this post


Link to post
51 minutes ago, Jud said:

it should go in USES in the DPROJ

I added in my DPR right above madExcept (v5) and it worked fine.

 

Did you use the NUMA Branch?

 

 

Edited by FredS

Share this post


Link to post
12 minutes ago, FredS said:

I added in my DPR right above madExcept (v5) and it worked fine.

 

Did you use the NUMA Branch?

 

 

 

I don't know how to do that - I haven't found good documentation on FastMM5.  It didn't make a difference on the memory-intensive program on a single-i7 machine.

Share this post


Link to post
56 minutes ago, Jud said:

 

I don't know how to do that - I haven't found good documentation on FastMM5.  It didn't make a difference on the memory-intensive program on a single-i7 machine.

  • Checkout the NUMA Support Branch
  • Add to DPR Uses:  FastMM5 in '<path>\FastMM5\FastMM5.pas',
  • Call  `FastMM_ConfigureAllArenasForNUMA` in the DPR after BEGIN

Share this post


Link to post

You aren't going to get anywhere by just adding a new library, compiling, running, and hoping for magic. It's going to require in depth understanding if how to configure and use any library and how to adapt your own code.

 

 

Share this post


Link to post

Yes, but the only documentation I've found about it is: https://github.com/pleriche/FastMM5

 

I have no idea of how to use it or any of the details of what a memory manager does.  I think it should be invisible.

 

----------

Does anyone know if OmniThreadLibrary will help with this?

Share this post


Link to post
2 hours ago, Jud said:

I think it should be invisible.

That would be nice. But in my experience, developing a program to work well with NUMA requires detailed knowledge from the programmer. I personally don't think there is any way round that. 

Share this post


Link to post

I did 2 projects that utilized NUMA without depending on memory manager like David and Arnaud suggested, and your "computation-intensive tasks" caught my eye and i think it applied as in my case, but still it depends on the data type and structure, so i will explain what you can do if that is possible.

 

This can be relatively easy task only if you don't have strings to manipulate, and will be hard and long task if so.

 

The idea is simple to understand, but hard to apply as it require many design changes.

The details: you should prevent any thread from using any memory allocated from different thread, example when your MainThread Create TThread then TThread itself is allocated by the MM and most definitely will belongs to the processor where the MainThread is, even that thread will pay the price on each access to its own TThread class fields, so first step drop TThread and start to use BeginThread instead, pass a simple pointer to it and let the newly created allocate memory for a record/structure need to be filled, the creator (caller of BeginThread) will use that record to pass information and data to the thread, by pass i mean copy/move everything to memory allocated and identified in that record.

 

Your Thread after starting will allocate memory using VirtualAlloc or VirtualAllocExNuma ..., in my experience the OS is helping greatly here and using the closest memory to that thread as long you allocate huge chunks (+1MB), after the allocating and you already moved the data to that newly allocated memory, then start to process/compute, it doesn't need to be one chunk you can use many allocation request just keep them big and only called/requested by the same thread.

 

Keep in mind you can't use any managed types like strings or TBytes..., this goes too for any class created, try to minimize or eliminate any dependency on anything of the data already in the newly allocated memory, if you need class's then create alternatives as record and also put them there, you got the idea.

 

If needed then reserve parts of that allocated memory as records to be used for the output, so the result also should be on that allocated memory.

 

It did work for me as i had hundreds millions of floats and integer to be processed, so my controller thread created threads, then filled the data across them, i had few short strings to filter based on the calculations, so it wasn't just a copy one to one but a simple convert from record with string to record without any strings, i used records with fixed short strings fields ( in few cases i needed to compare and filter, so i used simple array [0..16] of bytes where the first is the length of the content) this eliminate the need to handle strings and first thing comes to mind that i lost memory, on contrary this design was less memory expensive as string itself need +12 bytes as meta data then the length of the data which is 16 at least (per MM design).

 

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×