  1. Thank you guys for your hints. After investigating further, it seems indeed that there are some memory allocation issues. I started fixing them and both the monothread version and the multithread version are now faster. I'll revert back here when I am done with this process.

  2. Hello,


    In a VCL application I am currently trying to optimize a monothread task that is doing many complex geometric calculations and that is taking around 2 minutes and 20 seconds to execute. It seems like a good candidate for implementing a multithread strategy. My computer has 8 cores and 16 threads but I try to implement 8 threads only for now.


    Here is the code implementing the Parallel.For loop:

      var lNumTasks := 8;
      SetLength(lVCalculBuckets, lNumTasks);
      Parallel.For<TObject> (lShadingStepListAsObjects.ToArray)
              .Execute (

    procInitMultiThread and procFinalizeMultiThread copy and free lVCalculBuckets which contains one copy of our working objects per thread:

        procedure TMyClass.procInitMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer);
        var lVCalcul : TVCalcul;
          // Copy data
          lVCalcul := TVCalcul.Create(nil);
          lVCalculBuckets[aTaskIndex] := lVCalcul;
        procedure TMyClass.procFinalizeMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer);
        var lVCalcul : TVCalcul;
          // Delete copied data
          lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]);

    procExecuteMultiThread is just making the calculations and posting them back to the calling thread so that they are displayed on the VCL interface:

        procedure TMyClass.procExecuteMultiThread(aTaskIndex: Integer; var aValue: TObject);
        var lVCalcul : TVCalcul;
            lRes: TStepRes;
          // Retrieve data
          lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]);
          if Assigned(lVCalcul) then
            // Calculate factors
            lRes := TShadingStepRes(aValue);
            lVCalcul.CalculateFactors(lRes.Height, lRes.Width);
            // Post results

    Now this implementation runs in about 1min50, which is faster than the monothread version, but far from the gains I expected. I tried simplifying the code by removing the "Post results" part, thinking that it was causing synchronization delays. But it doesn't have any effects.


    Running the application inside SamplingProfiler and profiling a worker thread shows that 80% of the time spent by this thread is in NtDelayExecution:



    Yet I have no idea why because in the calculation part itself there isn't any synchronization code that I am aware of.


    If any of you would be able to point me in the right direction to further debug this, it would be much appreciated.

  3. Hello,


    I am using Parallel.ForEach in my project and it didn't speed up the process compared to the monothread approach.


    So I tried to run the test "58_ForVsForEach" for CLoopCount at 2 billion and "Parallel.ForEach" is more than 10 times slower than the "for" approach while "Parallel.For" is the fastest approach:



    I would have expected "Parallel.ForEach" to be comparable to "Parallel.For" in terms of speed. Am I missing something obvious? 


    If this is of any help, I am using Delphi 12.1 on Windows 10 with a 4 cores/8 threads processor. I also tried on another computer and got the same kind of results.


    Thanks in advance for your help.
