Jump to content
hsvandrew

Changes in Parallel Library

Recommended Posts

After much pain and cost and time wasted, I wanted to leave this note for others

TTask, Parallel Library -- DO NOT USE

My advice to all Delphi developers: stick with OmniThread or write your own ThreadPool using TThread's. 

The Parallel Task Library (TTask, Parallel For loops etc) are all based on completely flawed code that still isn't fixed in Rio

 

You'll invest many hours moving code across to TTask and then have to deal with all the core fundamental issues which you may not notice under normal conditions.

 

These include:

Flawed logic that runs only a few threads at a time (worse on a virtual machine)

Flaws which sometimes stop tasks from starting at all

Faults that, due to point 1 could result in a complete bottleneck of your app if you nest TTasks which may be hard to detect inside other functions

Memory leaks on Linux n 10.2.x

 

Hope this helps someone. We have got to the point where we try to avoid all new EMB built libraries where possible when a third party solution exists as we know it will be supported, tested and properly written.

Edited by hsvandrew
  • Thanks 1

Share this post


Link to post

Please allow me to disagree.  I have made non-trivial use of System.Threading in PyScripter (used by thousands of users)  without any problems. The main issue I know of with the library is RSP-11267.   You can see a fix in that report, but even without the fix you can work around the issue by using a different mechanism for signalling task to stop  As for Parallel.For I have not used it extensively, but you can see some code of mine in this StackOverflow question that I think is now used in a commercial product.  An issue with thread creation by Parallel.For (RSP-21177) has been fixed in Rio.

 

Advantages of System.Threading:

  •  By far the easiest way to use tasks (as opposed to threads) in your delphi application
  • Nice interface using anonymous methods 
  • Avoids third-party dependencies
  •  It provides usage statistics that can be used for diagnostic purposes
  •  Nowadays, it provides good performance out of the box, but you can still fine-tune the way it works if you are prepared to dig-in
  •  It uses state-of-the-art synchronization techniques
  •  @Primož Gabrijelčič can correct me on this, but by reading his book "Delphi High Performance", I think, that depending on the type of use, is competitive if not faster than his own OmniThreadLibrary

The Parallel library, like many other new features of Delphi (e.g. Generics) got a bad reputation, due to the many bugs in the early days.  Unfortunately. trust is easy to lose and very hard to rebuild.  But the library does deserves some love from Embarcadero (fixing remaining issues) and Delphi users.   

  • Like 2
  • Thanks 2

Share this post


Link to post

I make a simply use of this library an TTask, and I don't see any problem with it (for sure not full tested), but reading your point "The first task takes forever to start (over a second)" , I would like to know if somebody else can confirm this, because I experimented delays at start, but it did not relate to TTask, because they are tasks that connect to remote servers and I thought that was the problem, but if the TTask start takes a second, it is possible that there is a problem with TTask. I would like to know if in this sense it can be a better to use other alternative like OmniThreadLibrary.

 

Have a good year¡

Share this post


Link to post

Make a small testcase to demo the bug so others can confirm it or point to error in code.

  • Like 3

Share this post


Link to post
Quote

"The first task takes forever to start (over a second)"

Of course not.

However parallel programming is not a piece of cake.   There are so many ways you can shoot yourself in the foot and then blame the library.

Edited by pyscripter
  • Like 2

Share this post


Link to post
1 hour ago, pyscripter said:

Of course not.

However parallel programming is not a piece of cake.   There are so many ways you can shoot yourself in the foot and then blame the library.

Ok I'm prepared to concede this problem isn't happening anymore on 10.2.3 (some of these results came from earlier in 2018) - it is possible this is a linux only problem as that was the platform where this issue was experienced

Edited by hsvandrew

Share this post


Link to post

I'm going to pause this rant for now, perhaps these issues have been resolved and its just the tickets in quality control haven't been closed. 

We spend a lot of time on this in January and it failed miserably (especially for Linux) so we gave up.

It appears it might be ok now on 10.2.3/10.3 I will do some more testing and advise

  • Like 1

Share this post


Link to post

The biggest problem with PPL (apart from number of terrifying bugs that were only recently fixed) is the annoying and uncontrollable behavior of its thread pool. It will start threads as it seems appropriate, sometimes seemingly without considering min/max concurrent tasks configuration (of which the 'min' part is actually not, as per source, see below).

 

// Returns false if attempting to set a value < 0 or > MaxWorkerThreads. The actual number of pool
// threads could be less than this value depending on actual demand. Setting this to too few threads could
// be less than optimal resource utilization.
function SetMinWorkerThreads(Value: Integer): Boolean;

(BTW, the comment suggests that MinWorkerThreads can be set to the same value as MaxWorkerThreads. This is not the case. It must be less than that.)

 

"Annoying and uncontrollable behavior": Sometimes you schedule N threads, then wait, then schedule M (M > N, but M < MaxWorkerThreads) threads, but only N would start. I can repeatedly reproduce this in Tokyo (and Berlin, and Seattle ...). The problem was mostly, but not completely, fixed in Rio. I noticed some problems there, too, but at the moment I don't yet know how to reproduce the problem from scratch.

 

  • Like 1

Share this post


Link to post
13 hours ago, hsvandrew said:

I'm going to pause this rant for now, perhaps these issues have been resolved and its just the tickets in quality control haven't been closed. 

We spend a lot of time on this in January and it failed miserably (especially for Linux) so we gave up.

It appears it might be ok now on 10.2.3/10.3 I will do some more testing and advise

Please amend the title. The title should IMO reflect the contents of the post, which "-------" does not.

  • Like 3

Share this post


Link to post



Quote

 

"Annoying and uncontrollable behavior"

 

@Primož GabrijelčičAt least in Rio it seems to work OK:

program Project2;
{$APPTYPE CONSOLE}
uses
  WinApi.Windows,
  System.Diagnostics,
  System.Threading;

procedure TestParallel;
Var
  TaskArray:  array [1..100] of ITask;
begin
  for var I := Low(TaskArray) to High(TaskArray) do
    TaskArray[I] := TTask.Create(procedure
    begin
      Sleep(100);
    end).Start;
  TTask.WaitForAll(TaskArray);
end;

begin
  var SW := TStopWatch.StartNew;
  //TThreadPool.Default.SetMinWorkerThreads(50);
  TestParallel;
  WriteLn(SW.ElapsedMilliseconds);
  WriteLn;
  Var TPS := TThreadPoolStats.Default;
  WriteLn('WorkerThreadCount: ', TPS.WorkerThreadCount);
  WriteLn('MinLimitWorkerThreadCount: ', TPS.MinLimitWorkerThreadCount);
  WriteLn('MaxLimitWorkerThreadCount: ', TPS.MaxLimitWorkerThreadCount);
  ReadLn;
end.

Output on an 8 processor machine:

1370

WorkerThreadCount: 9
MinLimitWorkerThreadCount: 8
MaxLimitWorkerThreadCount: 200

Output with TThreadPool.Default.SetMinWorkerThreads(50);

 

233

WorkerThreadCount: 50
MinLimitWorkerThreadCount: 50
MaxLimitWorkerThreadCount: 200

With TParallel.&For it works differently and the number of threads created relate to the number of processors.

 

 

Edited by pyscripter
Parallel.For
  • Like 2

Share this post


Link to post

Here's my test with your code to give a version w/ 10.2.

 

402

WorkerThreadCount: 32
MinLimitWorkerThreadCount: 32
MaxLimitWorkerThreadCount: 800

w/ TThreadPool.Default.SetMinWorkerThreads(50); 

214

WorkerThreadCount: 50
MinLimitWorkerThreadCount: 50
MaxLimitWorkerThreadCount: 800

So all works well either.

Share this post


Link to post

@farcodev32 logical processors, Wow!  What is your CPU?  How many cores does it have?   It appears to be super fast.

Edited by pyscripter

Share this post


Link to post

@pyscripter a good old Threadripper 1950x :)   16 cores / 32 threads overclocked @ 3.7Ghz.   I upgraded it from my "old" 3930K Intel in August 2017.

 

Share this post


Link to post
On 1/2/2019 at 7:45 PM, pyscripter said:

in Rio it seems to work OK

 

Modified your code to get a better look at the difference.

 

Both Executed in Berlin 10.1.2
Ideally #1 should reach a higher WorkerThreadCount during its two minutes of execution but still the same with Rio.

#3 may be an issue (also in Rio), imagine running a pile of low intensity tasks first then executing something that loads Data Modules, Databases and whatnot.. starting at the same count you left off could cause out of memory issues unless dealt with (not tested).

#4 IMO is handled better in Berlin, the AverageCPUUsage delegates the Pool to drop below MinLimitWorkerThreadCount.

 

** RIO System.Threading unit, modified to compile in Berlin

  #1 Slow increase, never reached full ActiveTask Count
    00:02:13.8790372
    AverageCPUUsage: 35
    WorkerThreadCount: 34
    MinLimitWorkerThreadCount: 4

  #2 Increased to 58 immediately, holds appropriate AverageCPUUsage    
    00:01:11.9777749
    AverageCPUUsage: 90
    WorkerThreadCount: 58
    MinLimitWorkerThreadCount: 50

  #3 Starts off with #2 WorkerThreadCount and holds appropriate AverageCPUUsage     
    00:01:11.8997475
    AverageCPUUsage: 89
    WorkerThreadCount: 58
    MinLimitWorkerThreadCount: 4

  #4 Does NOT hold appropriate AverageCPUUsage    
    00:01:35.0015869
    AverageCPUUsage: 100
    WorkerThreadCount: 99
    MinLimitWorkerThreadCount: 99

 

** Berlin System.Threading unit, modified with RSP fixes only

 

  #1 Slow increase, never reached full ActiveTask Count
    00:02:13.9110588
    AverageCPUUsage: 34
    WorkerThreadCount: 34
    MinLimitWorkerThreadCount: 4

  #2 Increased to 58 immediately, holds appropriate AverageCPUUsage
    00:01:11.4247095
    AverageCPUUsage: 89
    WorkerThreadCount: 59
    MinLimitWorkerThreadCount: 50
    
  #3 Starts off with #2 WorkerThreadCount and holds appropriate AverageCPUUsage
    00:01:11.6894449
    AverageCPUUsage: 88
    WorkerThreadCount: 59
    MinLimitWorkerThreadCount: 4
 
  #4 Drops appropriately below MinLimitWorkerThreadCount to keep AverageCPUUsage
    00:01:21.0234632
    AverageCPUUsage: 88
    WorkerThreadCount: 78
    MinLimitWorkerThreadCount: 99

 

program Project2;
{$APPTYPE CONSOLE}
uses
  WinApi.Windows,
  System.SysUtils,
  System.Diagnostics,
  System.Threading,
  System.Classes,
  System.TimeSpan;

var
  SW : TStopWatch;
  ActiveTasksTicks : Int64;

const
    SleepMs     = 100;
    SleepCycles = 333;
    HighPrime   = 100 * 1000;

function IsPrime(const Value: Integer): Boolean;
{$REGION 'History'}
//  18-Jul-2018 - From a Parallel Task example
{$ENDREGION}
var Test, k: Integer;
begin
  if Value <= 3 then IsPrime := Value > 1
  else if ((Value mod 2) = 0) or ((Value mod 3) = 0) then IsPrime := False
  else begin
    IsPrime := True;
    k := Trunc(Sqrt(Value));
    Test := 5;
    while Test <= k do begin
      if ((Value mod Test) = 0) or ((Value mod (Test + 2)) = 0) then begin
        IsPrime := False;
        break; { jump out of the for loop }
      end;
      Test := Test + 6;
    end;
  end;
end;

procedure TestParallel;
var
  TaskArray:  array [1..100] of ITask;
  Ticks : Int64;
  i : integer;
begin
  for I := Low(TaskArray) to High(TaskArray) do
    TaskArray[I] := TTask.Create(procedure
    var
      s, p: integer;
      Succeeded : Boolean;
      {* GetActiveTasks *}
      function GetActiveTasks: integer;
      var i : integer;
      begin
        Result := 0;
        for i := Low(TaskArray) to High(TaskArray) do
           if (TaskArray[i].Status = System.Threading.TTaskStatus.Running) then begin
             Inc(Result);
           end;
        end;
    begin
      for s := 1 to SleepCycles do begin
         Ticks := ActiveTasksTicks;
         if (Ticks + TTimeSpan.TicksPerSecond) < SW.ElapsedTicks then begin
           System.AtomicCmpExchange(ActiveTasksTicks, SW.ElapsedTicks, Ticks, Succeeded);
           if Succeeded then begin
             Write(#13 + Format('ActiveTasks: %3.d,  CPU: %3.d%%', [GetActiveTasks, TThreadPoolStats.Current.CurrentCPUUsage]));
           end;
         end;
         Sleep(SleepMs);
         for p := 1 to HighPrime do isPrime(p);
      end
    end).Start;
  TTask.WaitForAll(TaskArray);
end;

procedure TestTasks(AMinWorkerThreads : integer);
var
  TPS : TThreadPoolStats;
begin
  ActiveTasksTicks := 0;
  SW := TStopWatch.StartNew;
  TThreadPool.Default.SetMinWorkerThreads(AMinWorkerThreads);
  TestParallel;
  Write(#13 + StringOfChar(' ', 72));
  WriteLn(#13 + SW.Elapsed.ToString);
  TPS := TThreadPoolStats.Default;
  WriteLn('AverageCPUUsage: ', TPS.AverageCPUUsage);
  WriteLn('WorkerThreadCount: ', TPS.WorkerThreadCount);
  WriteLn('MinLimitWorkerThreadCount: ', TPS.MinLimitWorkerThreadCount);
  WriteLn;
end;

begin
  TestTasks(TThread.ProcessorCount);
  TestTasks(50);
  TestTasks(TThread.ProcessorCount);
  TestTasks(99);
  ReadLn;
end.

 

Share this post


Link to post

The optimal thread pool configuration for IO intensive task is of course different that that for computational intensive tasks.  The first case calls for many more threads than the second, since the tasks are in waiting state most of the time.  System.Threading allows to have separate thread pools for different sets of tasks.

Share this post


Link to post

No offense but I cannot take people serious that test any threading library code by putting sleep into their thread execution to simulate any real workload...

Edited by Stefan Glienke
  • Like 1

Share this post


Link to post

//Notice: I have just renamed this thread to a more meaningful subject.

  • Like 2

Share this post


Link to post
5 minutes ago, Stefan Glienke said:

No offense but I cannot take people serious that test any threading library code by putting sleep into their thread execution to simulate any real workload...

Because?

 

Unless the thread pool manager is measuring CPU load, the result should be the same. And if the thread pool manager does that, it is easy to replace the code with a busy loop. As far as my bug hunting in Delphi's TThreadPool went, replacing Sleep(2000) with a 2 second busy loop never changed anything.

Share this post


Link to post

Because Sleep "suspends the execution of the current thread" (see MSDN) opposed to some "real" work. That might or might not affect significantly the behavior of the threading library itself depending on where it has glitches or not.

Share this post


Link to post

@Stefan GlienkeIt does depend on what you are testing.  If one claims that it takes > 1s for the first task to execute, it is sufficient to show it is not true.   Also the work load on many real-life multi-threaded applications is not dissimilar.   Say you fire 100 HTTP requests and you wait for the results to come, or you do asynchronous IO with memory mapped files (using overlapped events).

 

I like this use of language:

  • No doubt … followed by a debatable statement
  • It is clear that … followed by an unclear statement
  • No offence …. followed by an offensive statement
  • etc.
Edited by pyscripter
  • Thanks 1
  • Haha 1

Share this post


Link to post

My point was that countering that there is "uncontrollable behavior" sometimes with showing that there is nothing wrong when firing off a bunch of sleep threads is useless. A thread that does nothing but sleep suspends and gives control back to the OS scheduler which knows not to wake up that thread until time is over. You can see that in the numbers you posted.

 

Also when doing IO heavy stuff you hardly want to put that into CPU threads but use I/O completion ports.

Edited by Stefan Glienke

Share this post


Link to post

"Might or might not". 

 

Sleep is a good tool to demonstrate a bug. But I agree with you - if it works with Sleep, it still may not work with a high CPU load code. So proving with "Sleep" that the library works is not good enough. OTOH, proving with a high CPU load code that it works is also not good enough. 

 

"Unit tests can only be used to prove that a parallel code is NOT working."

- me

 

(BTW, I know that we agree here 🙂 I just wanted to state that testing parallel code with Sleep is a perfectly good tool. It is just not good enough.)

  • Like 1

Share this post


Link to post

Alternatives to sleep,could be to have a class that creates large arrays of random values, and do Bubble sort on them, optionally logging the progress to disk 😛

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×