Jump to content
GeMeMe

Threads in PPL thread pool not returning to idle as expected

Recommended Posts

Using Delphi 10.3.2 I've encountered the following problem (specifically on Windows, not sure about other platforms):

After using TParallel or TTask, not all threads in the thread pool return to idle as expected. I've prepared the following simple example:

procedure TForm.Button1Click(Sender: TObject);
var
  SW:TStopWatch;

  procedure Load;
  begin
    TParallel.For(0, 99999999, procedure(i: Integer)
    var
      T:Single;
    begin
      T:=Sin(i/PI);
    end);
  end;

begin
  Memo1.Lines.Add('PPL Test ---------------');
  Memo1.Lines.Add('Before: '+TThreadPoolStats.Current.Formatted);
  SW:=TStopWatch.StartNew;
  Load;
  Memo1.Lines.Add('Finished in '+SW.Elapsed.ToString);
  Memo1.Lines.Add('After: '+TThreadPoolStats.Current.Formatted);
  Memo1.Lines.Add('------------------------');
end;

Repeatedly pressing the button results in the following output:

PPL Test ---------------
Before: 0 Worker Threads (56 .. 1400) / 0 Idle / 0 Retired / 0 Suspended / Avg. CPU = 0 / Curr. CPU = 0
Finished in 00:00:09.8423686
After: 56 Worker Threads (56 .. 1400) / 43 Idle / 0 Retired / 0 Suspended / Avg. CPU = 30 / Curr. CPU = 100
------------------------
PPL Test ---------------
Before: 56 Worker Threads (56 .. 1400) / 43 Idle / 0 Retired / 0 Suspended / Avg. CPU = 36 / Curr. CPU = 0
Finished in 00:00:04.7841342
After: 65 Worker Threads (56 .. 1400) / 50 Idle / 0 Retired / 0 Suspended / Avg. CPU = 43 / Curr. CPU = 50
------------------------
PPL Test ---------------
Before: 65 Worker Threads (56 .. 1400) / 50 Idle / 0 Retired / 0 Suspended / Avg. CPU = 42 / Curr. CPU = 0
Finished in 00:00:04.7090980
After: 74 Worker Threads (56 .. 1400) / 55 Idle / 0 Retired / 0 Suspended / Avg. CPU = 42 / Curr. CPU = 48
------------------------
PPL Test ---------------
Before: 74 Worker Threads (56 .. 1400) / 55 Idle / 0 Retired / 0 Suspended / Avg. CPU = 38 / Curr. CPU = 0
Finished in 00:00:02.3904253
After: 74 Worker Threads (56 .. 1400) / 50 Idle / 0 Retired / 0 Suspended / Avg. CPU = 57 / Curr. CPU = 98
------------------------
PPL Test ---------------
Before: 74 Worker Threads (56 .. 1400) / 50 Idle / 0 Retired / 0 Suspended / Avg. CPU = 51 / Curr. CPU = 0
Finished in 00:00:05.2865123
After: 84 Worker Threads (56 .. 1400) / 55 Idle / 0 Retired / 0 Suspended / Avg. CPU = 44 / Curr. CPU = 41
------------------------
PPL Test ---------------
Before: 84 Worker Threads (56 .. 1400) / 55 Idle / 0 Retired / 0 Suspended / Avg. CPU = 36 / Curr. CPU = 0
Finished in 00:00:06.8032435
After: 96 Worker Threads (56 .. 1400) / 64 Idle / 0 Retired / 0 Suspended / Avg. CPU = 36 / Curr. CPU = 32
------------------------
PPL Test ---------------
Before: 96 Worker Threads (56 .. 1400) / 64 Idle / 0 Retired / 0 Suspended / Avg. CPU = 0 / Curr. CPU = 0
Finished in 00:00:07.7921758
After: 111 Worker Threads (56 .. 1400) / 77 Idle / 0 Retired / 0 Suspended / Avg. CPU = 27 / Curr. CPU = 23
------------------------
PPL Test ---------------
Before: 111 Worker Threads (56 .. 1400) / 77 Idle / 0 Retired / 0 Suspended / Avg. CPU = 20 / Curr. CPU = 0
Finished in 00:00:02.3559000
After: 111 Worker Threads (56 .. 1400) / 77 Idle / 0 Retired / 0 Suspended / Avg. CPU = 48 / Curr. CPU = 100
------------------------
PPL Test ---------------
Before: 111 Worker Threads (56 .. 1400) / 77 Idle / 0 Retired / 0 Suspended / Avg. CPU = 48 / Curr. CPU = 0
Finished in 00:00:06.6491982
After: 122 Worker Threads (56 .. 1400) / 87 Idle / 0 Retired / 0 Suspended / Avg. CPU = 33 / Curr. CPU = 100
------------------------

Even after the first run, only 43 threads in the pool seem to have returned to idle. Instead, with almost every further run, the total number of worker threads is increased to make up the difference. Despite this, the total CPU usage is going down slowly.

If I wait for the inactive threads to be automatically retired and run the test again, I get something like this:

PPL Test ---------------
Before: 0 Worker Threads (56 .. 1400) / -36 Idle / 0 Retired / 0 Suspended / Avg. CPU = 0 / Curr. CPU = 0
Finished in 00:00:09.5530676
After: 56 Worker Threads (56 .. 1400) / 20 Idle / 0 Retired / 0 Suspended / Avg. CPU = 30 / Curr. CPU = 100
------------------------

A negative number of idle threads? And after the run, only 20 threads are stated as idle again. Further runs result in the behavior described above.

If someone could shed some light on all this for me, I would highly appreciate it! Am I doing something wrong here? What do I need to do to get 100% CPU usage on every run, without the thread pool ballooning?

Thanks!

Share this post


Link to post

There is a bug somewhere within the thread pool - I have experienced a similar behavior where the thread pool creates more and more threads and they somehow leak somewhere and only exit when the application closes but are not reused to execute tasks. If I remember correctly it has been reported before and is being worked on - I just cannot find the exact number on QP.

Edited by Stefan Glienke

Share this post


Link to post
8 hours ago, GeMeMe said:

Finished in 00:00:09.5530676

Then there is that undocumented 'Feature' where after you wait for termination your elapsed time goes from ~1s to ~81s:

PPL Test ---------------
Before: Worker: 0, (Min: 4..Max: 100), Idle: -9, Retired: 0, Suspended: 0, CPU(Avg): 5, CPU: 2
Finished in 00:00:01.6734632
After:  Worker: 4, (Min: 4..Max: 100), Idle: -5, Retired: 0, Suspended: 0, CPU(Avg): 20, CPU: 100
------------------------

After waiting for threads to terminate:


PPL Test ---------------
Before: Worker: 4, (Min: 4..Max: 100), Idle: -5, Retired: 0, Suspended: 0, CPU(Avg): 1, CPU: 1
Finished in 00:00:30.7872728
After:  Worker: 4, (Min: 4..Max: 100), Idle: -3, Retired: 0, Suspended: 0, CPU(Avg): 20, CPU: 50
------------------------
PPL Test ---------------
Before: Worker: 4, (Min: 4..Max: 100), Idle: -3, Retired: 0, Suspended: 0, CPU(Avg): 1, CPU: 0
Finished in 00:01:21.6815195
After:  Worker: 4, (Min: 4..Max: 100), Idle: -2, Retired: 1, Suspended: 0, CPU(Avg): 20, CPU: 100
------------------------

I ran this in Berlin and it appears to work fine.

But I made changes to System.Threading, both to achieve near 100% CPU usage as well as applying some of the patches submitted by those reporting to QC..

 

Then there is the issue that in Win64 this takes over 37 seconds, RSP-24570 is for CBuilder but its the same in Berlin and Rio 10.3.3

Edited by FredS
Added Berlin findings, RSP-24570

Share this post


Link to post
16 hours ago, FredS said:

But I made changes to System.Threading, both to achieve near 100% CPU usage as well as applying some of the patches submitted by those reporting to QC..

Could you share these changes/patches or point me to the relevant website, please?

Share this post


Link to post
6 hours ago, GeMeMe said:

Could you share

 Knew that was coming 🙂

 

This is a monumental task because some patches made it into Tokyo and I merged them while another I applied.
I can't exactly publish System.Threading which alters some things dramatically.

 

Will attach the RSP-11267 patch since it was applied over the unaltered Berlin Threading unit.
Will add my notes so you get an idea of the changes and limitations. Will add the simple changes which alter the constants used.
Not sure how I can deal with some of the core changes though..

 

Tokyo never made it onto my development machine, that merge was simply an easy way to create an intermediate file that should have been used to update Rio.

But in Rio System.Threading was altered so that 1.5 years after its original release we still have to deal with posts like this one.. needless to say I never made that merge..

 


interface
{$REGION 'History'}
//  earlier     - Modified as per RSP-11267
//  09-Jan-2019 - Changed MonitorThreadDelay to 100 to be more responsive, if we are taking a clients system
//                to 95% CPU or higher shouldn't we be able to adjust that without a long pause
//   		      	- Changed NumCPUUsageSamples to NumCPUUsageSamplesShort (1000ms) and NumCPUAvgSamples (5000ms)
//              - Changed Retirement Delay to 500ms Idle already can take 80-320 seconds
//              - Allow Parking/Suspending of unlimited threads, this is key when another application starts to
//                chew up resources while tasks have already been spawned
//              - Fixed CPU usage, was turned off after the last Queued task began work
//              - Fixed a silly timeout patch that should have used the NoWorker flag instead
//                Be Aware this may not be a perfect solution if more than one pool is created.
//                An Assertion has been added to highlight this possible issue
//  10-Jan-2019 - Added RunningThreadCount
//              - Calling CreateMonitorThread always now; We want a Singleton MonitorThead regardless of the amount of Tasks requested
//              - Added Output2Debug function
//  20-Jan-2019 - Merged non Tokyo specific changes from Tokyo
//              - There are Tokyo specific changes that must be done in a new File
{$ENDREGION}
{$I ..\..\MatchDelphi.inc}
{$TYPEDADDRESS OFF} // required for Resource String Pointers
{$SCOPEDENUMS ON}
{-$DEFINE OUTPUT_DEBUG_STRING} // Activate to show Output2Debug Messages
{$IFDEF DEBUG}
  {$MESSAGE HINT 'Altered System.Threading tested for Single/Default TThreadPool Only'}
{$ENDIF}
    /// <summary-Fred>
    ///   Changed MonitorThreadDelay to 100 to be more responsive,
    ///   if we are taking a clients system to 95% CPU
    ///   or higher shouldn't we be able to adjust that
    /// </summary>
    MonitorThreadDelay = 100 {500};
    /// <summary-Fred>
    ///   We keep two Averages now, one of 5 secs for standard output as before
    ///   and another for 1 sec used to calculate Suspensions/Parking
    /// </summary>
    NumCPUUsageShortSamples = 1000 div MonitorThreadDelay;
    NumCPUAvgSamples        = 5 *  NumCPUUsageShortSamples;
    /// <summary-Fred>
    ///   SuspendInterval is the value used to limit the number of threads that
    ///   can be Parked, IMO this should reflect MonitorThreadDelay because that
    ///   is the time it takes to recalculate the CPU Averages
    /// </summary>
    SuspendInterval = MonitorThreadDelay; // Interval to use for suspending work in worker threads
    /// <summary-Fred>
    ///   Twice MonitorThreadDelay seems to work well
    /// </summary>
    SuspendTime = MonitorThreadDelay * 2; // Time to spend in SuspendWork;
    RetirementDelay = 5000; // Delay interval for retiring threads

and finally the output of the original test:

Altered Threading Berlin x64:

Before: Worker:  0, Min:  4, Max: 100, Idle:  0, Retired:  0, Suspended:  0, CPU(Avg):   0, CPU:   0
After:  Worker:  4, Min:  4, Max: 100, Idle:  3, Retired:  0, Suspended:  1, CPU(Avg): 100, CPU: 100
Finished in 00:00:37.3397033

Before: Worker:  4, Min:  4, Max: 100, Idle:  4, Retired:  0, Suspended:  0, CPU(Avg):   7, CPU:   3
After:  Worker:  5, Min:  4, Max: 100, Idle:  4, Retired:  0, Suspended:  1, CPU(Avg): 100, CPU: 100
Finished in 00:00:36.4708368

Before: Worker:  5, Min:  4, Max: 100, Idle:  5, Retired:  0, Suspended:  0, CPU(Avg):   7, CPU:   0
After:  Worker:  6, Min:  4, Max: 100, Idle:  5, Retired:  0, Suspended:  1, CPU(Avg): 100, CPU: 100
Finished in 00:00:36.2496669

Before: Worker:  6, Min:  4, Max: 100, Idle:  6, Retired:  0, Suspended:  0, CPU(Avg):  35, CPU:   0
After:  Worker:  7, Min:  4, Max: 100, Idle:  6, Retired:  0, Suspended:  1, CPU(Avg): 100, CPU: 100
Finished in 00:00:36.2084714

Before: Worker:  7, Min:  4, Max: 100, Idle:  7, Retired:  0, Suspended:  0, CPU(Avg):   7, CPU:   0
After:  Worker:  8, Min:  4, Max: 100, Idle:  7, Retired:  0, Suspended:  1, CPU(Avg): 100, CPU: 100
Finished in 00:00:36.8463683

Before: Worker:  0, Min:  4, Max: 100, Idle:  0, Retired:  0, Suspended:  0, CPU(Avg):   3, CPU:   0
After:  Worker:  4, Min:  4, Max: 100, Idle:  3, Retired:  0, Suspended:  1, CPU(Avg): 100, CPU: 100
Finished in 00:00:37.0737835

 

1 - RSP-11267.patch

  • Thanks 2

Share this post


Link to post

For me the following change to System.Threading appears to resolve the issue:

 

function TThreadPool.ShouldGrowPool: Boolean;
begin
  Result := {(FWorkerThreadCount < FMinLimitWorkerThreadCount) and }(FIdleWorkerThreadCount < FQueuedRequestCount) and
    (FWorkerThreadCount < Self.FMaxLimitWorkerThreadCount);
end;

The condition FWorkerThreadCount < FMinLimitWorkerThreadCount does not appear to make sense.

 

A slightly modified Load function:

 

  procedure Load;
  Var
    Sum : Double;

  begin
    Sum := 0;
    Monitor := TObject.Create;
    try
      TParallel.For(0, 99999999, procedure(i: Integer)
      var
        T:double;
      begin
        T:=Sin(i/PI);
        TMonitor.Enter(Monitor);
        Sum := Sum + T;
        TMonitor.Exit(Monitor);
      end);
      Writeln('Sum = ', Sum);
    finally
      Monitor.Free;
    end;
  end;

produced the following results for multiple runs:

 

PPL Test ---------------
Before: Worker: 0, (Min: 8..Max: 200), Idle: 0, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0
Sum =  1.37111034277419E+0000
Finished in 00:00:03.8990212
After: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0
------------------------
PPL Test ---------------
Before: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0
Sum =  1.37111034276774E+0000
Finished in 00:00:03.8069356
After: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0
------------------------
PPL Test ---------------
Before: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0
Sum =  1.37111034277633E+0000
Finished in 00:00:03.8392113
After: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0
------------------------
PPL Test ---------------
Before: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0
Sum =  1.37111034277271E+0000
Finished in 00:00:03.8729582
After: Worker: 17, (Min: 8..Max: 200), Idle: 17, Retired: 0, Suspended: 0, CPU(Avg): 0, CPU: 0
------------------------

 

Share this post


Link to post
9 minutes ago, pyscripter said:

A slightly modified Load function

Is that change relevant to your patch?

It's hard to compare the results when more than one factor has changed.

Share this post


Link to post

No, The change was done to make sure that Parallel.For works OK and produces the same result.  

I am puzzled though by the reported CPU usages.  The Task Monitor shows that CPU usage is close to 100%.

Edited by pyscripter

Share this post


Link to post

I see. But I'm guessing the introduction of a lock, even if it's a spin lock, might influence the thread schedule.

IMO an interlocked increment would have been better for the verification part.

Share this post


Link to post
46 minutes ago, pyscripter said:

following results for multiple runs

CPU/Avg should be showing 100% else the lock is most likely throttling that or your implementation of formatted is off.

 

46 minutes ago, pyscripter said:

The condition FWorkerThreadCount < FMinLimitWorkerThreadCount does not appear to make sense.

I believe the point is to spawn MinLimitWorkerThreadCount ASAP and to make those available regardless of CPU load.

 

Also the last [two] tests must be after the pool has cleared, usually 10 secs.

 

Should have been included with the original code:

//MMWIN:CLASSCOPY
unit _MM_Copy_Buffer_;

interface

type
  TThreadPoolStatsHelper = record helper for TThreadPoolStats
    function Formatted: string;
  end;


implementation


function TThreadPoolStatsHelper.Formatted: string;
begin
  Result := Format('Worker: %2d, Min: %2d, Max: %2d, Idle: %2d, Retired: %2d, Suspended: %2d, CPU(Avg): %3d, CPU: %3d',
                   [self.WorkerThreadCount,
                    self.MinLimitWorkerThreadCount, self.MaxLimitWorkerThreadCount,
                    self.IdleWorkerThreadCount, self.RetiredWorkerThreadCount, self.ThreadSuspended,
                    self.AverageCPUUsage, self.CurrentCPUUsage]);
end;

end.

 

Edited by FredS
added TThreadPoolStatsHelper

Share this post


Link to post

@Freds

I see.  There are two ways to grow the pool.  One by GrowWorkerPool and another by TThreadPool.TThreadPoolMonitor.GrowThreadPoolIfStarved, so you create the ThreadPoolMonitor after the initial threads are created and then it undertakes the growing or the Pool.

 

 

Actually the problem with CPU usage was related to this method:

procedure TThreadPool.GrowWorkerPool;
begin
  if ShouldGrowPool then
  begin
    TMonitor.Enter(FQueue);
    try
      if ShouldGrowPool then
        if FRetiredWorkerThreadCount = 0 then
          CreateWorkerThread
        else
          ResurrectRetiredThread;
    finally
      TMonitor.Exit(FQueue);
    end;
  end else
    CreateMonitorThread;
end;

CreateMonitorThread was never called.  

 

If you change the above to:

 

procedure TThreadPool.GrowWorkerPool;
begin
  if ShouldGrowPool then
  begin
    TMonitor.Enter(FQueue);
    try
      if ShouldGrowPool then
        if FRetiredWorkerThreadCount = 0 then
          CreateWorkerThread
        else
          ResurrectRetiredThread;
    finally
      TMonitor.Exit(FQueue);
    end;
    if FMonitorThreadStatus = [] then
      CreateMonitorThread;
  end else
    CreateMonitorThread;
end;

then CPU usage is reported OK.

Edited by pyscripter

Share this post


Link to post
14 minutes ago, pyscripter said:

CPU usage is reported OK

 

There are multiple issues, first and foremost is the response time for spawning threads.

Then there are all those 'undocumented Features' new and old..

 

Remove the TMonitor in your Load and get rid of the Sum because the only reason to call that is to tax the system.
Maybe switch to x64 and run that again, at least the time will increase in x64 and allow for more threads to be added.

 

Don't forget to wait for them to idle out ~10 seconds then run two more times to see if it breaks with negative numbers, which should also give you increased times.

 

See the negative idle in Rio 10.3.3:

 

Share this post


Link to post
38 minutes ago, pyscripter said:

I have a simple fix

It seems you have just reverted the two sections of code to the way they where in Delphi 10.2, but do we know why they were changed in the first place. Your fix might be reintroducing another bug.

 

I've not analyzed the internals of the thread pool but your change, and the rationale behind it, "feels" more correct than the original code.

Share this post


Link to post
6 minutes ago, Anders Melander said:

It seems you have just reverted the two sections of code to the way they where in Delphi 10.2

I was not aware of that.  It is certainly wrong now.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×