GeMeMe 0 Posted February 22, 2020 Using Delphi 10.3.2 I've encountered the following problem (specifically on Windows, not sure about other platforms): After using TParallel or TTask, not all threads in the thread pool return to idle as expected. I've prepared the following simple example: procedure TForm.Button1Click(Sender: TObject); var SW:TStopWatch; procedure Load; begin TParallel.For(0, 99999999, procedure(i: Integer) var T:Single; begin T:=Sin(i/PI); end); end; begin Memo1.Lines.Add('PPL Test ---------------'); Memo1.Lines.Add('Before: '+TThreadPoolStats.Current.Formatted); SW:=TStopWatch.StartNew; Load; Memo1.Lines.Add('Finished in '+SW.Elapsed.ToString); Memo1.Lines.Add('After: '+TThreadPoolStats.Current.Formatted); Memo1.Lines.Add('------------------------'); end; Repeatedly pressing the button results in the following output: PPL Test --------------- Before: 0 Worker Threads (56 .. 1400) / 0 Idle / 0 Retired / 0 Suspended / Avg. CPU = 0 / Curr. CPU = 0 Finished in 00:00:09.8423686 After: 56 Worker Threads (56 .. 1400) / 43 Idle / 0 Retired / 0 Suspended / Avg. CPU = 30 / Curr. CPU = 100 ------------------------ PPL Test --------------- Before: 56 Worker Threads (56 .. 1400) / 43 Idle / 0 Retired / 0 Suspended / Avg. CPU = 36 / Curr. CPU = 0 Finished in 00:00:04.7841342 After: 65 Worker Threads (56 .. 1400) / 50 Idle / 0 Retired / 0 Suspended / Avg. CPU = 43 / Curr. CPU = 50 ------------------------ PPL Test --------------- Before: 65 Worker Threads (56 .. 1400) / 50 Idle / 0 Retired / 0 Suspended / Avg. CPU = 42 / Curr. CPU = 0 Finished in 00:00:04.7090980 After: 74 Worker Threads (56 .. 1400) / 55 Idle / 0 Retired / 0 Suspended / Avg. CPU = 42 / Curr. CPU = 48 ------------------------ PPL Test --------------- Before: 74 Worker Threads (56 .. 1400) / 55 Idle / 0 Retired / 0 Suspended / Avg. CPU = 38 / Curr. CPU = 0 Finished in 00:00:02.3904253 After: 74 Worker Threads (56 .. 1400) / 50 Idle / 0 Retired / 0 Suspended / Avg. CPU = 57 / Curr. CPU = 98 ------------------------ PPL Test --------------- Before: 74 Worker Threads (56 .. 1400) / 50 Idle / 0 Retired / 0 Suspended / Avg. CPU = 51 / Curr. CPU = 0 Finished in 00:00:05.2865123 After: 84 Worker Threads (56 .. 1400) / 55 Idle / 0 Retired / 0 Suspended / Avg. CPU = 44 / Curr. CPU = 41 ------------------------ PPL Test --------------- Before: 84 Worker Threads (56 .. 1400) / 55 Idle / 0 Retired / 0 Suspended / Avg. CPU = 36 / Curr. CPU = 0 Finished in 00:00:06.8032435 After: 96 Worker Threads (56 .. 1400) / 64 Idle / 0 Retired / 0 Suspended / Avg. CPU = 36 / Curr. CPU = 32 ------------------------ PPL Test --------------- Before: 96 Worker Threads (56 .. 1400) / 64 Idle / 0 Retired / 0 Suspended / Avg. CPU = 0 / Curr. CPU = 0 Finished in 00:00:07.7921758 After: 111 Worker Threads (56 .. 1400) / 77 Idle / 0 Retired / 0 Suspended / Avg. CPU = 27 / Curr. CPU = 23 ------------------------ PPL Test --------------- Before: 111 Worker Threads (56 .. 1400) / 77 Idle / 0 Retired / 0 Suspended / Avg. CPU = 20 / Curr. CPU = 0 Finished in 00:00:02.3559000 After: 111 Worker Threads (56 .. 1400) / 77 Idle / 0 Retired / 0 Suspended / Avg. CPU = 48 / Curr. CPU = 100 ------------------------ PPL Test --------------- Before: 111 Worker Threads (56 .. 1400) / 77 Idle / 0 Retired / 0 Suspended / Avg. CPU = 48 / Curr. CPU = 0 Finished in 00:00:06.6491982 After: 122 Worker Threads (56 .. 1400) / 87 Idle / 0 Retired / 0 Suspended / Avg. CPU = 33 / Curr. CPU = 100 ------------------------ Even after the first run, only 43 threads in the pool seem to have returned to idle. Instead, with almost every further run, the total number of worker threads is increased to make up the difference. Despite this, the total CPU usage is going down slowly. If I wait for the inactive threads to be automatically retired and run the test again, I get something like this: PPL Test --------------- Before: 0 Worker Threads (56 .. 1400) / -36 Idle / 0 Retired / 0 Suspended / Avg. CPU = 0 / Curr. CPU = 0 Finished in 00:00:09.5530676 After: 56 Worker Threads (56 .. 1400) / 20 Idle / 0 Retired / 0 Suspended / Avg. CPU = 30 / Curr. CPU = 100 ------------------------ A negative number of idle threads? And after the run, only 20 threads are stated as idle again. Further runs result in the behavior described above. If someone could shed some light on all this for me, I would highly appreciate it! Am I doing something wrong here? What do I need to do to get 100% CPU usage on every run, without the thread pool ballooning? Thanks! Share this post Link to post
Stefan Glienke 2002 Posted February 22, 2020 (edited) There is a bug somewhere within the thread pool - I have experienced a similar behavior where the thread pool creates more and more threads and they somehow leak somewhere and only exit when the application closes but are not reused to execute tasks. If I remember correctly it has been reported before and is being worked on - I just cannot find the exact number on QP. Edited February 22, 2020 by Stefan Glienke Share this post Link to post
FredS 138 Posted February 22, 2020 (edited) 8 hours ago, GeMeMe said: Finished in 00:00:09.5530676 Then there is that undocumented 'Feature' where after you wait for termination your elapsed time goes from ~1s to ~81s: PPL Test --------------- Before: Worker: 0, (Min: 4..Max: 100), Idle: -9, Retired: 0, Suspended: 0, CPU(Avg): 5, CPU: 2 Finished in 00:00:01.6734632 After: Worker: 4, (Min: 4..Max: 100), Idle: -5, Retired: 0, Suspended: 0, CPU(Avg): 20, CPU: 100 ------------------------ After waiting for threads to terminate: PPL Test --------------- Before: Worker: 4, (Min: 4..Max: 100), Idle: -5, Retired: 0, Suspended: 0, CPU(Avg): 1, CPU: 1 Finished in 00:00:30.7872728 After: Worker: 4, (Min: 4..Max: 100), Idle: -3, Retired: 0, Suspended: 0, CPU(Avg): 20, CPU: 50 ------------------------ PPL Test --------------- Before: Worker: 4, (Min: 4..Max: 100), Idle: -3, Retired: 0, Suspended: 0, CPU(Avg): 1, CPU: 0 Finished in 00:01:21.6815195 After: Worker: 4, (Min: 4..Max: 100), Idle: -2, Retired: 1, Suspended: 0, CPU(Avg): 20, CPU: 100 ------------------------ I ran this in Berlin and it appears to work fine. But I made changes to System.Threading, both to achieve near 100% CPU usage as well as applying some of the patches submitted by those reporting to QC.. Then there is the issue that in Win64 this takes over 37 seconds, RSP-24570 is for CBuilder but its the same in Berlin and Rio 10.3.3 Edited February 22, 2020 by FredS Added Berlin findings, RSP-24570 Share this post Link to post
GeMeMe 0 Posted February 23, 2020 16 hours ago, FredS said: But I made changes to System.Threading, both to achieve near 100% CPU usage as well as applying some of the patches submitted by those reporting to QC.. Could you share these changes/patches or point me to the relevant website, please? Share this post Link to post
FredS 138 Posted February 23, 2020 6 hours ago, GeMeMe said: Could you share Knew that was coming 🙂 This is a monumental task because some patches made it into Tokyo and I merged them while another I applied. I can't exactly publish System.Threading which alters some things dramatically. Will attach the RSP-11267 patch since it was applied over the unaltered Berlin Threading unit. Will add my notes so you get an idea of the changes and limitations. Will add the simple changes which alter the constants used. Not sure how I can deal with some of the core changes though.. Tokyo never made it onto my development machine, that merge was simply an easy way to create an intermediate file that should have been used to update Rio. But in Rio System.Threading was altered so that 1.5 years after its original release we still have to deal with posts like this one.. needless to say I never made that merge.. interface {$REGION 'History'} // earlier - Modified as per RSP-11267 // 09-Jan-2019 - Changed MonitorThreadDelay to 100 to be more responsive, if we are taking a clients system // to 95% CPU or higher shouldn't we be able to adjust that without a long pause // - Changed NumCPUUsageSamples to NumCPUUsageSamplesShort (1000ms) and NumCPUAvgSamples (5000ms) // - Changed Retirement Delay to 500ms Idle already can take 80-320 seconds // - Allow Parking/Suspending of unlimited threads, this is key when another application starts to // chew up resources while tasks have already been spawned // - Fixed CPU usage, was turned off after the last Queued task began work // - Fixed a silly timeout patch that should have used the NoWorker flag instead // Be Aware this may not be a perfect solution if more than one pool is created. // An Assertion has been added to highlight this possible issue // 10-Jan-2019 - Added RunningThreadCount // - Calling CreateMonitorThread always now; We want a Singleton MonitorThead regardless of the amount of Tasks requested // - Added Output2Debug function // 20-Jan-2019 - Merged non Tokyo specific changes from Tokyo // - There are Tokyo specific changes that must be done in a new File {$ENDREGION} {$I ..\..\MatchDelphi.inc} {$TYPEDADDRESS OFF} // required for Resource String Pointers {$SCOPEDENUMS ON} {-$DEFINE OUTPUT_DEBUG_STRING} // Activate to show Output2Debug Messages {$IFDEF DEBUG} {$MESSAGE HINT 'Altered System.Threading tested for Single/Default TThreadPool Only'} {$ENDIF} /// <summary-Fred> /// Changed MonitorThreadDelay to 100 to be more responsive, /// if we are taking a clients system to 95% CPU /// or higher shouldn't we be able to adjust that /// </summary> MonitorThreadDelay = 100 {500}; /// <summary-Fred> /// We keep two Averages now, one of 5 secs for standard output as before /// and another for 1 sec used to calculate Suspensions/Parking /// </summary> NumCPUUsageShortSamples = 1000 div MonitorThreadDelay; NumCPUAvgSamples = 5 * NumCPUUsageShortSamples; /// <summary-Fred> /// SuspendInterval is the value used to limit the number of threads that /// can be Parked, IMO this should reflect MonitorThreadDelay because that /// is the time it takes to recalculate the CPU Averages /// </summary> SuspendInterval = MonitorThreadDelay; // Interval to use for suspending work in worker threads /// <summary-Fred> /// Twice MonitorThreadDelay seems to work well /// </summary> SuspendTime = MonitorThreadDelay * 2; // Time to spend in SuspendWork; RetirementDelay = 5000; // Delay interval for retiring threads and finally the output of the original test: Altered Threading Berlin x64: Before: Worker: 0, Min: 4, Max: 100, Idle: 0, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0 After: Worker: 4, Min: 4, Max: 100, Idle: 3, Retired: 0, Suspended: 1, CPU(Avg): 100, CPU: 100 Finished in 00:00:37.3397033 Before: Worker: 4, Min: 4, Max: 100, Idle: 4, Retired: 0, Suspended: 0, CPU(Avg): 7, CPU: 3 After: Worker: 5, Min: 4, Max: 100, Idle: 4, Retired: 0, Suspended: 1, CPU(Avg): 100, CPU: 100 Finished in 00:00:36.4708368 Before: Worker: 5, Min: 4, Max: 100, Idle: 5, Retired: 0, Suspended: 0, CPU(Avg): 7, CPU: 0 After: Worker: 6, Min: 4, Max: 100, Idle: 5, Retired: 0, Suspended: 1, CPU(Avg): 100, CPU: 100 Finished in 00:00:36.2496669 Before: Worker: 6, Min: 4, Max: 100, Idle: 6, Retired: 0, Suspended: 0, CPU(Avg): 35, CPU: 0 After: Worker: 7, Min: 4, Max: 100, Idle: 6, Retired: 0, Suspended: 1, CPU(Avg): 100, CPU: 100 Finished in 00:00:36.2084714 Before: Worker: 7, Min: 4, Max: 100, Idle: 7, Retired: 0, Suspended: 0, CPU(Avg): 7, CPU: 0 After: Worker: 8, Min: 4, Max: 100, Idle: 7, Retired: 0, Suspended: 1, CPU(Avg): 100, CPU: 100 Finished in 00:00:36.8463683 Before: Worker: 0, Min: 4, Max: 100, Idle: 0, Retired: 0, Suspended: 0, CPU(Avg): 3, CPU: 0 After: Worker: 4, Min: 4, Max: 100, Idle: 3, Retired: 0, Suspended: 1, CPU(Avg): 100, CPU: 100 Finished in 00:00:37.0737835 1 - RSP-11267.patch 2 Share this post Link to post
pyscripter 689 Posted February 24, 2020 For me the following change to System.Threading appears to resolve the issue: function TThreadPool.ShouldGrowPool: Boolean; begin Result := {(FWorkerThreadCount < FMinLimitWorkerThreadCount) and }(FIdleWorkerThreadCount < FQueuedRequestCount) and (FWorkerThreadCount < Self.FMaxLimitWorkerThreadCount); end; The condition FWorkerThreadCount < FMinLimitWorkerThreadCount does not appear to make sense. A slightly modified Load function: procedure Load; Var Sum : Double; begin Sum := 0; Monitor := TObject.Create; try TParallel.For(0, 99999999, procedure(i: Integer) var T:double; begin T:=Sin(i/PI); TMonitor.Enter(Monitor); Sum := Sum + T; TMonitor.Exit(Monitor); end); Writeln('Sum = ', Sum); finally Monitor.Free; end; end; produced the following results for multiple runs: PPL Test --------------- Before: Worker: 0, (Min: 8..Max: 200), Idle: 0, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0 Sum = 1.37111034277419E+0000 Finished in 00:00:03.8990212 After: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0 ------------------------ PPL Test --------------- Before: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0 Sum = 1.37111034276774E+0000 Finished in 00:00:03.8069356 After: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0 ------------------------ PPL Test --------------- Before: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0 Sum = 1.37111034277633E+0000 Finished in 00:00:03.8392113 After: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0 ------------------------ PPL Test --------------- Before: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0 Sum = 1.37111034277271E+0000 Finished in 00:00:03.8729582 After: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0 ------------------------ Share this post Link to post
Anders Melander 1782 Posted February 24, 2020 9 minutes ago, pyscripter said: A slightly modified Load function Is that change relevant to your patch? It's hard to compare the results when more than one factor has changed. Share this post Link to post
pyscripter 689 Posted February 24, 2020 (edited) No, The change was done to make sure that Parallel.For works OK and produces the same result. I am puzzled though by the reported CPU usages. The Task Monitor shows that CPU usage is close to 100%. Edited February 24, 2020 by pyscripter Share this post Link to post
Anders Melander 1782 Posted February 24, 2020 I see. But I'm guessing the introduction of a lock, even if it's a spin lock, might influence the thread schedule. IMO an interlocked increment would have been better for the verification part. Share this post Link to post
FredS 138 Posted February 24, 2020 (edited) 46 minutes ago, pyscripter said: following results for multiple runs CPU/Avg should be showing 100% else the lock is most likely throttling that or your implementation of formatted is off. 46 minutes ago, pyscripter said: The condition FWorkerThreadCount < FMinLimitWorkerThreadCount does not appear to make sense. I believe the point is to spawn MinLimitWorkerThreadCount ASAP and to make those available regardless of CPU load. Also the last [two] tests must be after the pool has cleared, usually 10 secs. Should have been included with the original code: //MMWIN:CLASSCOPY unit _MM_Copy_Buffer_; interface type TThreadPoolStatsHelper = record helper for TThreadPoolStats function Formatted: string; end; implementation function TThreadPoolStatsHelper.Formatted: string; begin Result := Format('Worker: %2d, Min: %2d, Max: %2d, Idle: %2d, Retired: %2d, Suspended: %2d, CPU(Avg): %3d, CPU: %3d', [self.WorkerThreadCount, self.MinLimitWorkerThreadCount, self.MaxLimitWorkerThreadCount, self.IdleWorkerThreadCount, self.RetiredWorkerThreadCount, self.ThreadSuspended, self.AverageCPUUsage, self.CurrentCPUUsage]); end; end. Edited February 24, 2020 by FredS added TThreadPoolStatsHelper Share this post Link to post
pyscripter 689 Posted February 24, 2020 (edited) @Freds I see. There are two ways to grow the pool. One by GrowWorkerPool and another by TThreadPool.TThreadPoolMonitor.GrowThreadPoolIfStarved, so you create the ThreadPoolMonitor after the initial threads are created and then it undertakes the growing or the Pool. Actually the problem with CPU usage was related to this method: procedure TThreadPool.GrowWorkerPool; begin if ShouldGrowPool then begin TMonitor.Enter(FQueue); try if ShouldGrowPool then if FRetiredWorkerThreadCount = 0 then CreateWorkerThread else ResurrectRetiredThread; finally TMonitor.Exit(FQueue); end; end else CreateMonitorThread; end; CreateMonitorThread was never called. If you change the above to: procedure TThreadPool.GrowWorkerPool; begin if ShouldGrowPool then begin TMonitor.Enter(FQueue); try if ShouldGrowPool then if FRetiredWorkerThreadCount = 0 then CreateWorkerThread else ResurrectRetiredThread; finally TMonitor.Exit(FQueue); end; if FMonitorThreadStatus = [] then CreateMonitorThread; end else CreateMonitorThread; end; then CPU usage is reported OK. Edited February 24, 2020 by pyscripter Share this post Link to post
FredS 138 Posted February 24, 2020 14 minutes ago, pyscripter said: CPU usage is reported OK There are multiple issues, first and foremost is the response time for spawning threads. Then there are all those 'undocumented Features' new and old.. Remove the TMonitor in your Load and get rid of the Sum because the only reason to call that is to tax the system. Maybe switch to x64 and run that again, at least the time will increase in x64 and allow for more threads to be added. Don't forget to wait for them to idle out ~10 seconds then run two more times to see if it breaks with negative numbers, which should also give you increased times. See the negative idle in Rio 10.3.3: Share this post Link to post
pyscripter 689 Posted May 5, 2020 I finally got to the bottom of this issue and I have a simple fix. See https://quality.embarcadero.com/browse/RSP-28200 Please test and vote for it. 1 Share this post Link to post
Anders Melander 1782 Posted May 5, 2020 38 minutes ago, pyscripter said: I have a simple fix It seems you have just reverted the two sections of code to the way they where in Delphi 10.2, but do we know why they were changed in the first place. Your fix might be reintroducing another bug. I've not analyzed the internals of the thread pool but your change, and the rationale behind it, "feels" more correct than the original code. Share this post Link to post
pyscripter 689 Posted May 5, 2020 6 minutes ago, Anders Melander said: It seems you have just reverted the two sections of code to the way they where in Delphi 10.2 I was not aware of that. It is certainly wrong now. Share this post Link to post
Anders Melander 1782 Posted May 5, 2020 4 minutes ago, pyscripter said: It is certainly wrong now. or "more wrong" 🙂 Share this post Link to post