Jump to content
Jud

Parallel for and CPU with Performance and Efficient cores

Recommended Posts

Recent models of Intel processors (12th and 13th generations) have Performance cores and Efficient cores.  How does Delphi (11.3 in particular) handle this with the parallel for?  Does it assign tasks only to performance cores or all cores?

Share this post


Link to post

You can try to call this:

function uCoreId: uint64; register;
asm
  //from 10 gen it reads the IA32_TSC_AUX MSR which should theoretically
  //indicate the CORE (THREAD in the case of HyperThread processors) in which rdpid runs
  rdpid RAX;
end;

This function returns the ID of the core (means CORE THREAD) in wich rdpid runs. It works form Intel 10th generation CPU.

 

The first Core Thread is numbered 0 (zero).

 

You will see that also the efficients core will be sometimes used.

This is because the ThreadDirector allocates threads (meaning processes) based on various factors. The distribution is not predictable.

If you want to avoid using the efficient cores you have to use the affinity mask (for the whole program) and select only the performance cores.

 

Bye

 

P.S.: this is for WIN64 program.

Edited by DelphiUdIT

Share this post


Link to post
20 hours ago, dwrbudr said:

Isn't that an OS job to do?

I don't know how it is handled, which is why I was asking.

Share this post


Link to post
5 hours ago, DelphiUdIT said:

You can try to call this:


function uCoreId: uint64; register;
asm
  //from 10 gen it reads the IA32_TSC_AUX MSR which should theoretically
  //indicate the CORE (THREAD in the case of HyperThread processors) in which rdpid runs
  rdpid RAX;
end;

This function returns the ID of the core (means CORE THREAD) in wich rdpid runs. It works form Intel 10th generation CPU.

 

The first Core Thread is numbered 0 (zero).

 

You will see that also the efficients core will be sometimes used.

This is because the ThreadDirector allocates threads (meaning processes) based on various factors. The distribution is not predictable.

If you want to avoid using the efficient cores you have to use the affinity mask (for the whole program) and select only the performance cores.

 

P.S.: this is for WIN64 program.

 

That is what I'm using.  I've done some testing with the parallel for, for different numbers of threads, say 1..16, 1..20, 1..100, 1..500,  etc.  All I used was the time to complete the run, with each task being the same size.  It seems that if I'm running 20 threads on a CPU with 8 performance cores and 4 efficient cores, it seems to be assigning the tasks across all CPUs.  But if the number of tasks gets into the hundreds, it is assigning tasks to the efficient cores, but with a few hundred tasks, as the performance cores are resassigned when they finish while the efficient cores are still running.  So when the number of tasks is in the hundreds (or more) it seems to naturally balance the load among the cores.

 

Share this post


Link to post
10 hours ago, DelphiUdIT said:

If you want to avoid using the efficient cores you have to use the affinity mask (for the whole program) and select only the performance cores.

There are usually some tasks that do not need to run on the performance cores, so setting the affinity mask for the whole program may not be the best strategy, even though it's the easiest way. But I'm sure that sooner or later Windows will start ignoring those masks because everybody sets them.

 

Of course this is currently the only way to do that for threads generated using parallel for.

Edited by dummzeuch

Share this post


Link to post

I'll never used affinity mask, 'cause my experience is that ThreadDirector does a good work. But like told if you want max performance one should use it.

 

There are other systems too, for using max power form the system (eg. if one has additional graphic card): using OpenCL.

 

One can start form here: https://github.com/LUXOPHIA/OpenCL

 

Bye

 

P.S.: I never used it, normaly i use OpenCL with others advanced library (artificial vision).

Share this post


Link to post

" How does Delphi (11.3 in particular) handle this with the parallel for? " Hopefully not at all. It is not its job.

 

The CPU cores are a shared resource; other application and services use them too. So, it is up to the operationg system to somehow measure the load of all cores over some time, and make scheduling decisions based on that. The only factors (aside from the affinity mask) that incluence the decisions of the scheduler are the priority of the process amd its threads.

As for affinity masks, this would probably also not the right tool, as soon as CPUs with  32 or 64 cores (or more) become more widespread in the near future, due to the existence of the processor groups:

https://docs.microsoft.com/en-us/windows/win32/procthread/processor-groups

How do this groups handle different kinds of cores?

 

In my opinion, trying to measure CPU load from inside an application, and trying to second-guess the scheduler is not the right approach.

Edited by thatlr

Share this post


Link to post
52 minutes ago, thatlr said:

........

As for affinity masks, this would probably also not the right tool, as soon as CPUs with  32 or 64 cores (or more) become more widespread in the near future, due to the existence of the processor groups:

https://docs.microsoft.com/en-us/windows/win32/procthread/processor-groups

How do this groups handle different kinds of cores?

In my opinion, trying to measure CPU load from inside an application, and trying to second-guess the scheduler is not the right approach.

..........

Using "priorities" to manage threads and their (hypothetical) distribution between cores is not simple and as mentioned it is not the right tool.

For now multiple processors, processor groups and numa nodes are out of Delphi's handling, or better let's say Delphi is not optimized for that.

Memory and all shared and non-shared resources are handled "differently" in these systems.

 

Those who have programmed in multiprocessor systems (such as Xeon platforms) have experienced this.

 

Bye

Share this post


Link to post
On 6/24/2023 at 9:01 AM, dummzeuch said:

But I'm sure that sooner or later Windows will start ignoring those masks because everybody sets them.

Seems unlikely that MS would break its system because some people write crap programs.

  • Like 1

Share this post


Link to post
4 hours ago, David Heffernan said:

Seems unlikely that MS would break its system because some people write crap programs.

They will just introduce a new API and deprecate the old one. Happened before, will happen again.

Edited by dummzeuch

Share this post


Link to post
1 hour ago, dummzeuch said:

They will just introduce a new API and deprecate the old one. Happened before, will happen again.

They tend not to break things. Also, ignoring affinities would utterly break aot of software, and break some software that set affinities correctly. So, no, this isn't a risk. 

Edited by David Heffernan

Share this post


Link to post
On 6/24/2023 at 4:01 AM, dummzeuch said:

There are usually some tasks that do not need to run on the performance cores, so setting the affinity mask for the whole program may not be the best strategy, even though it's the easiest way. But I'm sure that sooner or later Windows will start ignoring those masks because everybody sets them.

 

Of course this is currently the only way to do that for threads generated using parallel for.

 

Well, I need all of the power I have available.

 

BUT - I realized a fallacy in my thinking and analysis.  I assumed that if I ran 16 threads with the parallel for loop, they would all be on performance cores, and that if I ran 20 threads, the extra 4 would go to the efficient cores.  But after more experimentation, parallel for seems to put the threads on any core.  I thought of timing how long each thread took to run, and with, say 20 threads, there wasn't that much difference.  And running 20 threads with parallel for (with 8 P=cores and 4 E-cores), the performance was 9-10% better than running 16 threads.  So it is using the E-cores too.  And running 100 or so threads, it pretty much evens out the difference between the E and P cores (because the ones on the P-cores finish sooner and get reassigned another thread.)

Share this post


Link to post

Sub-scheduling by the application (even determining how many threads to run to reach 100% CPU load) would have to deal with all the topics mentioned here https://learn.microsoft.com/en-us/windows/win32/procthread/scheduling as also Hyperthreading, Turbo Boost and Core Parking (https://learn.microsoft.com/en-us/windows-server/administration/performance-tuning/hardware/power/power-performance-tuning,  Why is Windows using only even-numbered processors? ). And now also with cores with different performance characteristics, which is a new topic for x86 and therefore Windows.

 

All in all, that why I wrote that no applicaton "parallel-for" and no naive-implemented thread pool should try to deal will all this. Other programs can change their CPU demand at any time, meaning the overall situation is highly dynamic. Taking all this factors into account is very hard, and should be left to the operating system.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×