Jump to content
stephane

Parallel.ForEach is really slow

Recommended Posts

Hello,

 

I am using Parallel.ForEach in my project and it didn't speed up the process compared to the monothread approach.

 

So I tried to run the test "58_ForVsForEach" for CLoopCount at 2 billion and "Parallel.ForEach" is more than 10 times slower than the "for" approach while "Parallel.For" is the fastest approach:

image.png.24d5854cc8af3ee5d231d176560121a1.png

 

I would have expected "Parallel.ForEach" to be comparable to "Parallel.For" in terms of speed. Am I missing something obvious? 

 

If this is of any help, I am using Delphi 12.1 on Windows 10 with a 4 cores/8 threads processor. I also tried on another computer and got the same kind of results.

 

Thanks in advance for your help.

Share this post


Link to post
56 minutes ago, stephane said:

I am using Parallel.ForEach in my project and it didn't speed up the process compared to the monothread approach.

It is hard to diagnose this issue without knowing anything about your test setup, how many threads you are running, what the threads are doing, etc.  But in general, simply throwing threads at a problem doesn't guarantee its speedup, if anything having too many threads can actually slow it down.  So you have to make sure your thread usage is optimal for the task, has a good balance between work time and contexts switches, etc.

Share this post


Link to post
Posted (edited)

I can absolutely repro - all my 20 logical cores (i5-13600k) go to 100% for 10 seconds.

Running it through SamplingProfiler right now to check.

 

Edit: Okay, either this has regressed at some point after the demo was originally built or it was overlooked that there is actually no real workload inside of the delegate and thus it just measures the huge overhead from RTL and interlocked operations.

It's just spending a huge amount of time wrapping and unwrapping the integer from TOmniValue and sharing the counter across all threads causing a complete bus lock every time due to the usage of DSiInterlockedExchangeAdd64 (*).

 

(*) I wrote bus lock and this is not entirely true, so before anyone chimes in quoting the Intel manual about the fact that the lock prefix does not automatically cause a bus lock - you are correct. Here we have the case that we are sharing the one variable across all the cores we have so it has to travel back and forth the CPU caches to and from RAM.

 

This code as is would be a nice worst-case example for Primoz' book about what can potentially go wrong when doing parallel programming.

However: keep in mind that we don't have any real workload which would most likely change the picture as the workload usually takes the majority of processing time and not the parallel foreach code.

 

P.S. Among the aforementioned things it looks like the OTL code (particularly TOmniValue) is suffering from some overhead caused by inlining. For example: because TOmniValue.AsInt64 as well as TOmniValue.TryCastToInt64 is inlined it causes the temporary Variant variable it needs for the otvVariant case to be spilled into the location where the inlining happens. But in our case we never have to even deal with a Variant hence all the extra code the compiler produces is just overhead. And because the getter for AsInt64 is used twice, the compiler repeats the inlined code twice and produces two Variant variables which it finalizes using System.FinalizeArray. Also a lot of record initialization and finalization is happening which I assume (did not look closer) is being caused by TOmniValue - potentially also because of some temporary compiler generated variables.

 

Here is the drilldown of the interesting pieces when running in SamplingProfiler:

 

image.thumb.png.153f1296f7aa36d2325c55d77907cc92.png

Edited by Stefan Glienke
  • Like 3
  • Thanks 3

Share this post


Link to post

Short answer: Use Parallel.For.

 

Long answer: Parallel.ForEach was written not for parallel processing of integers but for parallel processing of any weird kind of data. Because of that it is overly complex and not good for simple (for i in range) cases. It is usually about 10x slower than PPL's TParallel.For. That is why I later wrote Parallel.For, which works only on integer ranges and is on par with the PPL implementation.

  • Thanks 3

Share this post


Link to post

Thank you guys for your answers. I will try to rewrite my code to use Parallel.For instead of Parallel.ForEach and hopefully I will get much better performance.

Share this post


Link to post
12 minutes ago, stephane said:

Thank you guys for your answers. I will try to rewrite my code to use Parallel.For instead of Parallel.ForEach and hopefully I will get much better performance.

Please let us know how it goes. 

For me it was just changing two lines... Removed one two pieces of code "const" and "<Integer>"


It did not make things faster, but I'll split the workload into the Chunks before processing them in parallel. 

 

-Tee-

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×