Parallel for and CPU with Performance and Efficient cores

Jud · June 23, 2023

Recent models of Intel processors (12th and 13th generations) have Performance cores and Efficient cores. How does Delphi (11.3 in particular) handle this with the parallel for? Does it assign tasks only to performance cores or all cores?

dwrbudr · June 23, 2023

Isn't that an OS job to do?

DelphiUdIT · June 23, 2023

You can try to call this:

function uCoreId: uint64; register;
asm
  //from 10 gen it reads the IA32_TSC_AUX MSR which should theoretically
  //indicate the CORE (THREAD in the case of HyperThread processors) in which rdpid runs
  rdpid RAX;
end;

This function returns the ID of the core (means CORE THREAD) in wich rdpid runs. It works form Intel 10th generation CPU.

The first Core Thread is numbered 0 (zero).

You will see that also the efficients core will be sometimes used.

This is because the ThreadDirector allocates threads (meaning processes) based on various factors. The distribution is not predictable.

If you want to avoid using the efficient cores you have to use the affinity mask (for the whole program) and select only the performance cores.

Bye

P.S.: this is for WIN64 program.

Edited June 23, 2023 by DelphiUdIT

Jud · June 24, 2023

20 hours ago, dwrbudr said:

Isn't that an OS job to do?

I don't know how it is handled, which is why I was asking.

Jud · June 24, 2023

5 hours ago, DelphiUdIT said:
You can try to call this:
function uCoreId: uint64; register;
asm
  //from 10 gen it reads the IA32_TSC_AUX MSR which should theoretically
  //indicate the CORE (THREAD in the case of HyperThread processors) in which rdpid runs
  rdpid RAX;
end;
This function returns the ID of the core (means CORE THREAD) in wich rdpid runs. It works form Intel 10th generation CPU.

The first Core Thread is numbered 0 (zero).

You will see that also the efficients core will be sometimes used.

This is because the ThreadDirector allocates threads (meaning processes) based on various factors. The distribution is not predictable.

If you want to avoid using the efficient cores you have to use the affinity mask (for the whole program) and select only the performance cores.

P.S.: this is for WIN64 program.

That is what I'm using. I've done some testing with the parallel for, for different numbers of threads, say 1..16, 1..20, 1..100, 1..500, etc. All I used was the time to complete the run, with each task being the same size. It seems that if I'm running 20 threads on a CPU with 8 performance cores and 4 efficient cores, it seems to be assigning the tasks across all CPUs. But if the number of tasks gets into the hundreds, it is assigning tasks to the efficient cores, but with a few hundred tasks, as the performance cores are resassigned when they finish while the efficient cores are still running. So when the number of tasks is in the hundreds (or more) it seems to naturally balance the load among the cores.

dummzeuch · June 24, 2023

10 hours ago, DelphiUdIT said:

If you want to avoid using the efficient cores you have to use the affinity mask (for the whole program) and select only the performance cores.

There are usually some tasks that do not need to run on the performance cores, so setting the affinity mask for the whole program may not be the best strategy, even though it's the easiest way. But I'm sure that sooner or later Windows will start ignoring those masks because everybody sets them.

Of course this is currently the only way to do that for threads generated using parallel for.

Edited June 24, 2023 by dummzeuch

DelphiUdIT · June 24, 2023

I'll never used affinity mask, 'cause my experience is that ThreadDirector does a good work. But like told if you want max performance one should use it.

There are other systems too, for using max power form the system (eg. if one has additional graphic card): using OpenCL.

One can start form here: https://github.com/LUXOPHIA/OpenCL

Bye

P.S.: I never used it, normaly i use OpenCL with others advanced library (artificial vision).

thatlr · June 24, 2023

" How does Delphi (11.3 in particular) handle this with the parallel for? " Hopefully not at all. It is not its job.

The CPU cores are a shared resource; other application and services use them too. So, it is up to the operationg system to somehow measure the load of all cores over some time, and make scheduling decisions based on that. The only factors (aside from the affinity mask) that incluence the decisions of the scheduler are the priority of the process amd its threads.

As for affinity masks, this would probably also not the right tool, as soon as CPUs with 32 or 64 cores (or more) become more widespread in the near future, due to the existence of the processor groups:

https://docs.microsoft.com/en-us/windows/win32/procthread/processor-groups

How do this groups handle different kinds of cores?

In my opinion, trying to measure CPU load from inside an application, and trying to second-guess the scheduler is not the right approach.

Edited June 24, 2023 by thatlr

DelphiUdIT · June 24, 2023

52 minutes ago, thatlr said:

........

As for affinity masks, this would probably also not the right tool, as soon as CPUs with 32 or 64 cores (or more) become more widespread in the near future, due to the existence of the processor groups:

https://docs.microsoft.com/en-us/windows/win32/procthread/processor-groups

How do this groups handle different kinds of cores?

In my opinion, trying to measure CPU load from inside an application, and trying to second-guess the scheduler is not the right approach.

..........

Using "priorities" to manage threads and their (hypothetical) distribution between cores is not simple and as mentioned it is not the right tool.

For now multiple processors, processor groups and numa nodes are out of Delphi's handling, or better let's say Delphi is not optimized for that.

Memory and all shared and non-shared resources are handled "differently" in these systems.

Those who have programmed in multiprocessor systems (such as Xeon platforms) have experienced this.

Bye

David Heffernan · June 25, 2023

On 6/24/2023 at 9:01 AM, dummzeuch said:

But I'm sure that sooner or later Windows will start ignoring those masks because everybody sets them.

Seems unlikely that MS would break its system because some people write crap programs.

dummzeuch · June 25, 2023

4 hours ago, David Heffernan said:

Seems unlikely that MS would break its system because some people write crap programs.

They will just introduce a new API and deprecate the old one. Happened before, will happen again.

Edited June 25, 2023 by dummzeuch

David Heffernan · June 25, 2023

1 hour ago, dummzeuch said:

They will just introduce a new API and deprecate the old one. Happened before, will happen again.

They tend not to break things. Also, ignoring affinities would utterly break aot of software, and break some software that set affinities correctly. So, no, this isn't a risk.

Edited June 25, 2023 by David Heffernan

Jud · June 26, 2023

On 6/24/2023 at 4:01 AM, dummzeuch said:

There are usually some tasks that do not need to run on the performance cores, so setting the affinity mask for the whole program may not be the best strategy, even though it's the easiest way. But I'm sure that sooner or later Windows will start ignoring those masks because everybody sets them.

Of course this is currently the only way to do that for threads generated using parallel for.

Well, I need all of the power I have available.

BUT - I realized a fallacy in my thinking and analysis. I assumed that if I ran 16 threads with the parallel for loop, they would all be on performance cores, and that if I ran 20 threads, the extra 4 would go to the efficient cores. But after more experimentation, parallel for seems to put the threads on any core. I thought of timing how long each thread took to run, and with, say 20 threads, there wasn't that much difference. And running 20 threads with parallel for (with 8 P=cores and 4 E-cores), the performance was 9-10% better than running 16 threads. So it is using the E-cores too. And running 100 or so threads, it pretty much evens out the difference between the E and P cores (because the ones on the P-cores finish sooner and get reassigned another thread.)

thatlr · June 26, 2023

Sub-scheduling by the application (even determining how many threads to run to reach 100% CPU load) would have to deal with all the topics mentioned here https://learn.microsoft.com/en-us/windows/win32/procthread/scheduling as also Hyperthreading, Turbo Boost and Core Parking (https://learn.microsoft.com/en-us/windows-server/administration/performance-tuning/hardware/power/power-performance-tuning, Why is Windows using only even-numbered processors? ). And now also with cores with different performance characteristics, which is a new topic for x86 and therefore Windows.

All in all, that why I wrote that no applicaton "parallel-for" and no naive-implemented thread pool should try to deal will all this. Other programs can change their CPU demand at any time, meaning the overall situation is highly dynamic. Taking all this factors into account is very hard, and should be left to the operating system.

Sign In

Parallel for and CPU with Performance and Efficient cores

Recommended Posts

Jud 1

Share this post

Link to post

dwrbudr 8

Share this post

Link to post

DelphiUdIT 132

Share this post

Link to post

Jud 1

Share this post

Link to post

Jud 1

Share this post

Link to post

dummzeuch 1392

Share this post

Link to post

DelphiUdIT 132

Share this post

Link to post

thatlr 0

Share this post

Link to post

DelphiUdIT 132

Share this post

Link to post

David Heffernan 2300

Share this post

Link to post

dummzeuch 1392

Share this post

Link to post

David Heffernan 2300

Share this post

Link to post

Jud 1

Share this post

Link to post

thatlr 0

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity