Parallel.For optimization

stephane · July 2, 2024

Hello,

In a VCL application I am currently trying to optimize a monothread task that is doing many complex geometric calculations and that is taking around 2 minutes and 20 seconds to execute. It seems like a good candidate for implementing a multithread strategy. My computer has 8 cores and 16 threads but I try to implement 8 threads only for now.

Here is the code implementing the Parallel.For loop:

  var lNumTasks := 8;
  SetLength(lVCalculBuckets, lNumTasks);
  
  Parallel.For<TObject> (lShadingStepListAsObjects.ToArray)
          .NoWait
          .NumTasks(lNumTasks)
          .OnStop(Parallel.CompleteQueue(lResults))
          .Initialize(
      procInitMultiThread
    )
          .Finalize(
      procFinalizeMultiThread
    )
          .Execute (
      procExecuteMultiThread
  );

procInitMultiThread and procFinalizeMultiThread copy and free lVCalculBuckets which contains one copy of our working objects per thread:

    procedure TMyClass.procInitMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer);
    var lVCalcul : TVCalcul;
    begin
      // Copy data
      lVCalcul := TVCalcul.Create(nil);
      lVCalcul.CopyLight(Self.VCalcul);
      lVCalculBuckets[aTaskIndex] := lVCalcul;
    end;

    procedure TMyClass.procFinalizeMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer);
    var lVCalcul : TVCalcul;
    begin
      // Delete copied data
      lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]);
      FreeAndNil(lVCalcul);
    end;

procExecuteMultiThread is just making the calculations and posting them back to the calling thread so that they are displayed on the VCL interface:

    procedure TMyClass.procExecuteMultiThread(aTaskIndex: Integer; var aValue: TObject);
    var lVCalcul : TVCalcul;
        lRes: TStepRes;
    begin
      // Retrieve data
      lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]);
      if Assigned(lVCalcul) then
      begin
        // Calculate factors
        lRes := TShadingStepRes(aValue);
        lVCalcul.CalculateFactors(lRes.Height, lRes.Width);

        // Post results
        lRes.FillResFromVCalcul(lVCalcul);
        lResults.Add(TOmniValue.CastFrom<TStepRes>(lRes));
      end;
    end;

Now this implementation runs in about 1min50, which is faster than the monothread version, but far from the gains I expected. I tried simplifying the code by removing the "Post results" part, thinking that it was causing synchronization delays. But it doesn't have any effects.

Running the application inside SamplingProfiler and profiling a worker thread shows that 80% of the time spent by this thread is in NtDelayExecution:

Yet I have no idea why because in the calculation part itself there isn't any synchronization code that I am aware of.

If any of you would be able to point me in the right direction to further debug this, it would be much appreciated.

Anders Melander · July 2, 2024

3 hours ago, stephane said:

Running the application inside SamplingProfiler and profiling a worker thread shows that 80% of the time spent by this thread is in NtDelayExecution:

Yet I have no idea why because in the calculation part itself there isn't any synchronization code that I am aware of.

I don't know how SamplingProfiler works but I would think that it should be able to show you the call stack leading to NtDelayExecution. That should tell you from where and why it's being called.

stephane · July 3, 2024

Thanks a lot for the hint. I found a way to display the caller and it seems that many calls are coming from the system managing the memory:

Not sure how to take it from here though.

Der schöne Günther · July 3, 2024

It seems like you are constantly resizing arrays (or creating new ones).

I am not sure how much you can tweak/optimize the memory manager that ships with Delphi, but you might want to investigate other memory managers, like this fork of FastMM4:

maximmasiutin/FastMM4-AVX: FastMM4 memory manager for Delphi and FreePascal (free pascal/Lazarus). A fork with improved synchronization between the threads that gives performance benefits on thread-heavy applications. Proper synchronization techniques are used depending on context and availability, i.e. umonitor/umwait, spin-wait loops, SwitchToThread, critical sections, etc (github.com)

Dave Novo · July 3, 2024

There is also FastMM5. However, I would suggest looking at your code and figuring out a way to allocate required memory (even if you have to overallocate) and minimize/eliminate heap memory allocations during the threading code.

Anders Melander · July 4, 2024

6 hours ago, Dave Novo said:

However, I would suggest looking at your code and figuring out a way to allocate required memory (even if you have to overallocate) and minimize/eliminate heap memory allocations during the threading code.

Exactly.

The memory manager is not the problem. If anything, a better performing memory manager will just make it harder to locate and fix the problem. The goal is to not be reliant on a faster memory manager (and FWIW, FastMM4 isn't slow).

Tommi Prami · July 4, 2024

On 7/2/2024 at 10:23 PM, Anders Melander said:

I don't know how SamplingProfiler works but I would think that it should be able to show you the call stack leading to NtDelayExecution. That should tell you from where and why it's being called.

For better results run the measured process over and over again while profiling.

It gets more accurate with every sample you get.

I've even run with Monte Carlo enabled (multi threaded app) for 30 minutes, and then check the results.

stephane · July 5, 2024

Thank you guys for your hints. After investigating further, it seems indeed that there are some memory allocation issues. I started fixing them and both the monothread version and the multithread version are now faster. I'll revert back here when I am done with this process.

Edited July 5, 2024 by stephane

Sign In

Parallel.For optimization

Recommended Posts

stephane 3

Share this post

Link to post

Anders Melander 2065

Share this post

Link to post

stephane 3

Share this post

Link to post

Der schöne Günther 338

Share this post

Link to post

Dave Novo 57

Share this post

Link to post

Anders Melander 2065

Share this post

Link to post

Tommi Prami 157

Share this post

Link to post

stephane 3

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity