Jump to content
stephane

Parallel.For optimization

Recommended Posts

Hello,

 

In a VCL application I am currently trying to optimize a monothread task that is doing many complex geometric calculations and that is taking around 2 minutes and 20 seconds to execute. It seems like a good candidate for implementing a multithread strategy. My computer has 8 cores and 16 threads but I try to implement 8 threads only for now.

 

Here is the code implementing the Parallel.For loop:

  var lNumTasks := 8;
  SetLength(lVCalculBuckets, lNumTasks);
  
  Parallel.For<TObject> (lShadingStepListAsObjects.ToArray)
          .NoWait
          .NumTasks(lNumTasks)
          .OnStop(Parallel.CompleteQueue(lResults))
          .Initialize(
      procInitMultiThread
    )
          .Finalize(
      procFinalizeMultiThread
    )
          .Execute (
      procExecuteMultiThread
  );

procInitMultiThread and procFinalizeMultiThread copy and free lVCalculBuckets which contains one copy of our working objects per thread:

    procedure TMyClass.procInitMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer);
    var lVCalcul : TVCalcul;
    begin
      // Copy data
      lVCalcul := TVCalcul.Create(nil);
      lVCalcul.CopyLight(Self.VCalcul);
      lVCalculBuckets[aTaskIndex] := lVCalcul;
    end;

    procedure TMyClass.procFinalizeMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer);
    var lVCalcul : TVCalcul;
    begin
      // Delete copied data
      lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]);
      FreeAndNil(lVCalcul);
    end;

procExecuteMultiThread is just making the calculations and posting them back to the calling thread so that they are displayed on the VCL interface:

    procedure TMyClass.procExecuteMultiThread(aTaskIndex: Integer; var aValue: TObject);
    var lVCalcul : TVCalcul;
        lRes: TStepRes;
    begin
      // Retrieve data
      lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]);
      if Assigned(lVCalcul) then
      begin
        // Calculate factors
        lRes := TShadingStepRes(aValue);
        lVCalcul.CalculateFactors(lRes.Height, lRes.Width);

        // Post results
        lRes.FillResFromVCalcul(lVCalcul);
        lResults.Add(TOmniValue.CastFrom<TStepRes>(lRes));
      end;
    end;

Now this implementation runs in about 1min50, which is faster than the monothread version, but far from the gains I expected. I tried simplifying the code by removing the "Post results" part, thinking that it was causing synchronization delays. But it doesn't have any effects.

 

Running the application inside SamplingProfiler and profiling a worker thread shows that 80% of the time spent by this thread is in NtDelayExecution:

image.thumb.png.e282294df9331623e9efd6d9e3bb3aa9.png

 

Yet I have no idea why because in the calculation part itself there isn't any synchronization code that I am aware of.

 

If any of you would be able to point me in the right direction to further debug this, it would be much appreciated.

Share this post


Link to post
3 hours ago, stephane said:

Running the application inside SamplingProfiler and profiling a worker thread shows that 80% of the time spent by this thread is in NtDelayExecution:

image.thumb.png.e282294df9331623e9efd6d9e3bb3aa9.png

 

Yet I have no idea why because in the calculation part itself there isn't any synchronization code that I am aware of.

I don't know how SamplingProfiler works but I would think that it should be able to show you the call stack leading to NtDelayExecution. That should tell you from where and why it's being called.

  • Like 1

Share this post


Link to post

Thanks a lot for the hint. I found a way to display the caller and it seems that many calls are coming from the system managing the memory:

image.thumb.png.063448038d9ee0d7b27178011fab919e.png

 

Not sure how to take it from here though.

  • Like 1

Share this post


Link to post

Share this post


Link to post

There is also FastMM5. However, I would suggest looking at your code and figuring out a way to allocate required memory (even if you have to overallocate) and minimize/eliminate heap  memory allocations during the threading code. 

  • Like 3

Share this post


Link to post
6 hours ago, Dave Novo said:

However, I would suggest looking at your code and figuring out a way to allocate required memory (even if you have to overallocate) and minimize/eliminate heap  memory allocations during the threading code.

Exactly.

The memory manager is not the problem. If anything, a better performing memory manager will just make it harder to locate and fix the problem. The goal is to not be reliant on a faster memory manager (and FWIW, FastMM4 isn't slow).

  • Like 1

Share this post


Link to post
On 7/2/2024 at 10:23 PM, Anders Melander said:

I don't know how SamplingProfiler works but I would think that it should be able to show you the call stack leading to NtDelayExecution. That should tell you from where and why it's being called.

For better results run the measured process over and over again while profiling.

It gets more accurate with every sample you get. 

I've even run  with Monte Carlo enabled (multi threaded app) for 30 minutes, and then check the results.

Share this post


Link to post
Posted (edited)

Thank you guys for your hints. After investigating further, it seems indeed that there are some memory allocation issues. I started fixing them and both the monothread version and the multithread version are now faster. I'll revert back here when I am done with this process.

Edited by stephane
  • Like 2

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×