Hello,
In a VCL application I am currently trying to optimize a monothread task that is doing many complex geometric calculations and that is taking around 2 minutes and 20 seconds to execute. It seems like a good candidate for implementing a multithread strategy. My computer has 8 cores and 16 threads but I try to implement 8 threads only for now.
Here is the code implementing the Parallel.For loop:
var lNumTasks := 8;
SetLength(lVCalculBuckets, lNumTasks);
Parallel.For<TObject> (lShadingStepListAsObjects.ToArray)
.NoWait
.NumTasks(lNumTasks)
.OnStop(Parallel.CompleteQueue(lResults))
.Initialize(
procInitMultiThread
)
.Finalize(
procFinalizeMultiThread
)
.Execute (
procExecuteMultiThread
);
procInitMultiThread and procFinalizeMultiThread copy and free lVCalculBuckets which contains one copy of our working objects per thread:
procedure TMyClass.procInitMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer);
var lVCalcul : TVCalcul;
begin
// Copy data
lVCalcul := TVCalcul.Create(nil);
lVCalcul.CopyLight(Self.VCalcul);
lVCalculBuckets[aTaskIndex] := lVCalcul;
end;
procedure TMyClass.procFinalizeMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer);
var lVCalcul : TVCalcul;
begin
// Delete copied data
lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]);
FreeAndNil(lVCalcul);
end;
procExecuteMultiThread is just making the calculations and posting them back to the calling thread so that they are displayed on the VCL interface:
procedure TMyClass.procExecuteMultiThread(aTaskIndex: Integer; var aValue: TObject);
var lVCalcul : TVCalcul;
lRes: TStepRes;
begin
// Retrieve data
lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]);
if Assigned(lVCalcul) then
begin
// Calculate factors
lRes := TShadingStepRes(aValue);
lVCalcul.CalculateFactors(lRes.Height, lRes.Width);
// Post results
lRes.FillResFromVCalcul(lVCalcul);
lResults.Add(TOmniValue.CastFrom<TStepRes>(lRes));
end;
end;
Now this implementation runs in about 1min50, which is faster than the monothread version, but far from the gains I expected. I tried simplifying the code by removing the "Post results" part, thinking that it was causing synchronization delays. But it doesn't have any effects.
Running the application inside SamplingProfiler and profiling a worker thread shows that 80% of the time spent by this thread is in NtDelayExecution:
Yet I have no idea why because in the calculation part itself there isn't any synchronization code that I am aware of.
If any of you would be able to point me in the right direction to further debug this, it would be much appreciated.