stephane 3 Posted July 2 Hello, In a VCL application I am currently trying to optimize a monothread task that is doing many complex geometric calculations and that is taking around 2 minutes and 20 seconds to execute. It seems like a good candidate for implementing a multithread strategy. My computer has 8 cores and 16 threads but I try to implement 8 threads only for now. Here is the code implementing the Parallel.For loop: var lNumTasks := 8; SetLength(lVCalculBuckets, lNumTasks); Parallel.For<TObject> (lShadingStepListAsObjects.ToArray) .NoWait .NumTasks(lNumTasks) .OnStop(Parallel.CompleteQueue(lResults)) .Initialize( procInitMultiThread ) .Finalize( procFinalizeMultiThread ) .Execute ( procExecuteMultiThread ); procInitMultiThread and procFinalizeMultiThread copy and free lVCalculBuckets which contains one copy of our working objects per thread: procedure TMyClass.procInitMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer); var lVCalcul : TVCalcul; begin // Copy data lVCalcul := TVCalcul.Create(nil); lVCalcul.CopyLight(Self.VCalcul); lVCalculBuckets[aTaskIndex] := lVCalcul; end; procedure TMyClass.procFinalizeMultiThread(aTaskIndex, aFromIndex, aToIndex: Integer); var lVCalcul : TVCalcul; begin // Delete copied data lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]); FreeAndNil(lVCalcul); end; procExecuteMultiThread is just making the calculations and posting them back to the calling thread so that they are displayed on the VCL interface: procedure TMyClass.procExecuteMultiThread(aTaskIndex: Integer; var aValue: TObject); var lVCalcul : TVCalcul; lRes: TStepRes; begin // Retrieve data lVCalcul := TVCalcul(lVCalculBuckets[aTaskIndex]); if Assigned(lVCalcul) then begin // Calculate factors lRes := TShadingStepRes(aValue); lVCalcul.CalculateFactors(lRes.Height, lRes.Width); // Post results lRes.FillResFromVCalcul(lVCalcul); lResults.Add(TOmniValue.CastFrom<TStepRes>(lRes)); end; end; Now this implementation runs in about 1min50, which is faster than the monothread version, but far from the gains I expected. I tried simplifying the code by removing the "Post results" part, thinking that it was causing synchronization delays. But it doesn't have any effects. Running the application inside SamplingProfiler and profiling a worker thread shows that 80% of the time spent by this thread is in NtDelayExecution: Yet I have no idea why because in the calculation part itself there isn't any synchronization code that I am aware of. If any of you would be able to point me in the right direction to further debug this, it would be much appreciated. Share this post Link to post
Anders Melander 1795 Posted July 2 3 hours ago, stephane said: Running the application inside SamplingProfiler and profiling a worker thread shows that 80% of the time spent by this thread is in NtDelayExecution: Yet I have no idea why because in the calculation part itself there isn't any synchronization code that I am aware of. I don't know how SamplingProfiler works but I would think that it should be able to show you the call stack leading to NtDelayExecution. That should tell you from where and why it's being called. 1 Share this post Link to post
stephane 3 Posted July 3 Thanks a lot for the hint. I found a way to display the caller and it seems that many calls are coming from the system managing the memory: Not sure how to take it from here though. 1 Share this post Link to post
Der schöne Günther 316 Posted July 3 It seems like you are constantly resizing arrays (or creating new ones). I am not sure how much you can tweak/optimize the memory manager that ships with Delphi, but you might want to investigate other memory managers, like this fork of FastMM4: maximmasiutin/FastMM4-AVX: FastMM4 memory manager for Delphi and FreePascal (free pascal/Lazarus). A fork with improved synchronization between the threads that gives performance benefits on thread-heavy applications. Proper synchronization techniques are used depending on context and availability, i.e. umonitor/umwait, spin-wait loops, SwitchToThread, critical sections, etc (github.com) Share this post Link to post
Dave Novo 51 Posted July 3 There is also FastMM5. However, I would suggest looking at your code and figuring out a way to allocate required memory (even if you have to overallocate) and minimize/eliminate heap memory allocations during the threading code. 3 Share this post Link to post
Anders Melander 1795 Posted July 4 6 hours ago, Dave Novo said: However, I would suggest looking at your code and figuring out a way to allocate required memory (even if you have to overallocate) and minimize/eliminate heap memory allocations during the threading code. Exactly. The memory manager is not the problem. If anything, a better performing memory manager will just make it harder to locate and fix the problem. The goal is to not be reliant on a faster memory manager (and FWIW, FastMM4 isn't slow). 1 Share this post Link to post
Tommi Prami 131 Posted July 4 On 7/2/2024 at 10:23 PM, Anders Melander said: I don't know how SamplingProfiler works but I would think that it should be able to show you the call stack leading to NtDelayExecution. That should tell you from where and why it's being called. For better results run the measured process over and over again while profiling. It gets more accurate with every sample you get. I've even run with Monte Carlo enabled (multi threaded app) for 30 minutes, and then check the results. Share this post Link to post
stephane 3 Posted July 5 (edited) Thank you guys for your hints. After investigating further, it seems indeed that there are some memory allocation issues. I started fixing them and both the monothread version and the multithread version are now faster. I'll revert back here when I am done with this process. Edited July 5 by stephane 2 Share this post Link to post