Jump to content
dummzeuch

Odd timing when using multiple threads

Recommended Posts

Posted (edited)

The following is about a 32 bit Windows console application.

 

I'm trying to improve performance for analyzing a huge Mono8 bitmap, stored in memory as a an array of bytes (no TBitmap involved).

For each pixel the algorithm calculates the brightness for the pixel itself as well as a few surrounding pixels and writes the result to an array of byte.

 

Doing this single threaded takes about 1300 ms.

 

Since this can be done by multiple threads completely independently without the risk of race conditions I have split the work into n work packages, each processing some lines of the bitmap and each processed by a separate thread. The threads have been created in advance in a thread pool. Since there is no risk of race conditions I don't use any synchronization or locking mechanism during the processing.

 

Synchronization is only necessary for assigning a work package to a thread and for signalling that a work package has been processed.

This is the relevant code of the thread's execute method and the SetNext method called by the main thread to assign a work package to it:

procedure TWorkerThread.Execute;
begin
  inherited Execute;
  while not Terminated do begin
    FNewPackageEvent.WaitFor(500);
    FNewPackageEvent.ResetEvent;
    if Terminated then
      Exit; //==>
    FCritSect.Enter;
    try
      if Assigned(FData) and Assigned(FWorkCall) then
        FWorkCall(FData);
    finally
      FCritSect.Leave;
    end;
  end;
end;

procedure TWorkerThread.SetNext(_WorkCall: TWorkPackageCall; _Data: Pointer);
begin
  FCritSect.Enter;
  try
    if Assigned(FData) or Assigned(@FWorkCall) then
      raise ESigException.Create('Programmer error: Package is already assigned.');
    FData := _Data;
    FWorkCall := _WorkCall;
    FNewPackageEvent.SetEvent;
  finally
    FCritSect.Leave;
  end;
end;

FNewPackageEvent is a TEvent and FCritSect is a critical section, each unique to the TWorkerThread instance.

A work package consists of a data pointer and a procedure pointer (not method pointer) to call.

 

When a work package has finished processing a counter variable gets decremented like this:

InterlockedDecrement(FCounter^);

The main thread also processes one work package (it's also counted as one of the threads below) and then waits for the others to finish:

  while WorkPackageCounter > 0 do
    Sleep(WorkPackageCounter);

(FCounter above is a pointer to WorkPackageCounter). Since every thread only processes one work package the counter starts as the number of threads.

 

This works fine but I get some rather odd timing:

 

Average time on 1 calls using 1 threads 1278 [ms]
Average time on 1 calls using 2 threads 877 [ms]    <------
Average time on 1 calls using 3 threads 1627 [ms]  <------
Average time on 1 calls using 4 threads 1580 [ms]
Average time on 1 calls using 5 threads 1511 [ms]
Average time on 1 calls using 6 threads 1167 [ms]
Average time on 1 calls using 7 threads 1438 [ms]
Average time on 1 calls using 8 threads 1036 [ms]
Average time on 1 calls using 9 threads 957 [ms]
Average time on 1 calls using 10 threads 847 [ms]
Average time on 1 calls using 11 threads 958 [ms]
Average time on 1 calls using 12 threads 843 [ms]
Average time on 1 calls using 13 threads 821 [ms]
Average time on 1 calls using 14 threads 715 [ms]
Average time on 1 calls using 15 threads 799 [ms]
Average time on 1 calls using 16 threads 647 [ms]
Average time on 1 calls using 17 threads 693 [ms]
Average time on 1 calls using 18 threads 656 [ms]
Average time on 1 calls using 19 threads 525 [ms]
Average time on 1 calls using 20 threads 613 [ms]

 

As you can see, distributing the work to 2 threads nearly halves the processing time, but adds some overhead which is what I would have expected.

 

But distributing it to 3 threads all of a sudden takes more time than the single threaded approach.

 

When adding more threads the processing time goes down again until it reaches some kind of minimum with 15 threads.

 

This timing is from only a single call, but I also tried it with 100 calls and the average was about the same.

 

The CPU is an Intel Xeon with 4 cores + Hyperthreading which gives me a logical 8 processor cores.

 

Can anybody give me a hint what to look for to explain the oddity of 3 threads taking longer than 2 threads?

Edited by dummzeuch

Share this post


Link to post
Posted (edited)

Hmm, doesn't

FNewPackageEvent.WaitFor(500);

wait for the event at most for 500 ms and then go further anyway even if it's still non signaled?

You also didn't show the main code that calls SetNext methods, now I see possibility that SetNext call could happen when a thread still processes its task so crit sect waits for it to finish.

Edited by Fr0sT.Brutal

Share this post


Link to post
2 minutes ago, Fr0sT.Brutal said:

wait for the event at most for 500 ms and then go further anyway even if it's still non signaled?

  Yes but it doesn't do anything because supposedly FData=nil. Instead it terminates the thread (I'm guessing).

This isn't how I would design a work queue but I don't think the queue's the problem.

 

@dummzeuch Are threads reading from memory in the same cache line as other threads are writing to - or are the threads writing to memory in the same cache line.

If so you are probably suffering from false sharing and/or cache-line ping pong.

Share this post


Link to post
Posted (edited)
1 hour ago, Fr0sT.Brutal said:

Hmm, doesn't


FNewPackageEvent.WaitFor(500);

wait for the event at most for 500 ms and then go further anyway even if it's still non signaled?

Yes, it does indeed. But since ...

Assigned(FData) and Assigned(FWorkCall)

... will be False if no work package has been assigned, it will simply do nothing and then return into the wait state.

1 hour ago, Fr0sT.Brutal said:

You also didn't show the main code that calls SetNext methods, now I see possibility that SetNext call could happen when a thread still processes its task so crit sect waits for it to finish.

Good point, but since the thread pool is specific to the class doing the processing that can't be the case. All threads are idle when bitmap is passed into the method and will be idle again once the processing has finished.

 

I found something though: I was using a single dynamic array to hold some intermediate results which was initialized within the work packages. Replacing it with a static array of the required length removed the timing oddity. Now two threads consistently take significantly longer than one thread and the processing time goes down from there on.

 

This is kind of consistent but not really useful. 😞

Edited by dummzeuch

Share this post


Link to post

After replacing the dynamic array with a static array (see above), the timing now looks like this:

 

 10 calls using  1 threads:  1158 ms (TotalTime [ms]:  1140 SingleTimes [ms]:  1140)
 10 calls using  2 threads:  1412 ms (TotalTime [ms]:  2872 SingleTimes [ms]:  1437  1435)
 10 calls using  3 threads:  1302 ms (TotalTime [ms]:  2501 SingleTimes [ms]:  1055  1060   386)
 10 calls using  4 threads:   752 ms (TotalTime [ms]:  2657 SingleTimes [ms]:   503   650   767   737)
 10 calls using  5 threads:   670 ms (TotalTime [ms]:  2832 SingleTimes [ms]:   364   627   668   582   591)
 10 calls using  6 threads:   643 ms (TotalTime [ms]:  3195 SingleTimes [ms]:   316   528   553   501   635   662)
 10 calls using  7 threads:   503 ms (TotalTime [ms]:  2973 SingleTimes [ms]:   459   466   360   473   493   404   318)
 10 calls using  8 threads:   520 ms (TotalTime [ms]:  3460 SingleTimes [ms]:   344   364   356   470   498   450   486   492)
 10 calls using  9 threads:   526 ms (TotalTime [ms]:  3854 SingleTimes [ms]:   316   322   427   428   440   498   517   456   450)
 10 calls using 10 threads:   417 ms (TotalTime [ms]:  3238 SingleTimes [ms]:   228   325   338   392   274   386   361   337   324   273)

 

(TotalTime is the sum of the processing time for all threads. SingleTimes are the processing times for each thread.)

Share this post


Link to post
Posted (edited)
9 minutes ago, Anders Melander said:

Are threads reading from memory in the same cache line as other threads are writing to - or are the threads writing to memory in the same cache line.

If so you are probably suffering from false sharing and/or cache-line ping pong.

My knowledge of processor architecture is insufficient to answer this question.

 

The work packages are advanced records in a dynamic array, so their data is located within the same memory area on the heap.

Edited by dummzeuch

Share this post


Link to post
1 hour ago, dummzeuch said:

My knowledge of processor architecture is insufficient to answer this question.

It's a good thing we have Google then. There are lots of (relatively) easy to understand explanations these topics.

 

1 hour ago, dummzeuch said:

The work packages are advanced records in a dynamic array, so their data is located within the same memory area on the heap.

The data being manipulated just needs to be in different cache lines. The size of a cache line depends on the processor. Again: Google.

Share this post


Link to post
3 hours ago, Anders Melander said:

The size of a cache line depends on the processor.

The size can be retrieved from System.TMonitor.CacheLineSize.

Share this post


Link to post
15 hours ago, dummzeuch said:
15 hours ago, Anders Melander said:

Are threads reading from memory in the same cache line as other threads are writing to - or are the threads writing to memory in the same cache line.

If so you are probably suffering from false sharing and/or cache-line ping pong.

My knowledge of processor architecture is insufficient to answer this question.

 

The work packages are advanced records in a dynamic array, so their data is located within the same memory area on the heap.

Which turned out to be the problem. The work memory area of each work package was declared like this:

  TWorkPackage = record
    FCounter: PInt32;
    FIsDone: LongBool;
    FScanLine0: PByte;
    FBytesPerLine: Integer;
    FWidth: Integer;
    FHeight: Integer;
    FTop: Integer;
    FBottom: Integer;
    FPixelValues: TMedianValuesArr;
    FMedianPixelArr: TMedianPixelArr;
    FArr: TBrightnessMatrix;
    FSum: int64;
    FStopwatch: TStopwatch;
  end;

And these were stored in a dynamic array, so they were all located within the same memory area. FMedianPixelArr is constantly being written to in all threads. If I understand the  cache line stuff correctly, the fact that these records were stored all within a memory block smaller than the CPU cache line (apparently 256 Bytes at most with current CPUs) this caused the cache becoming invalid for all threads every time one of the threads wrote to this area.

 

For testing this, I have now simply increased the size of the record by 256 bytes by adding an array [0..255] of byte to it so each of them is larger than the maximum possible cache line.

 

Here is the timing after this change:

 

  1 calls using  1 threads:  1143 ms (TotalTime [ms]:  1125 SingleTimes [ms]:  1125)
  1 calls using  2 threads:   607 ms (TotalTime [ms]:  1124 SingleTimes [ms]:   565   559)
  1 calls using  3 threads:   451 ms (TotalTime [ms]:  1186 SingleTimes [ms]:   395   393   398)
  1 calls using  4 threads:   368 ms (TotalTime [ms]:  1222 SingleTimes [ms]:   308   298   312   304)
  1 calls using  5 threads:   367 ms (TotalTime [ms]:  1357 SingleTimes [ms]:   244   306   303   242   262)
  1 calls using  6 threads:   344 ms (TotalTime [ms]:  1533 SingleTimes [ms]:   228   259   274   271   245   256)
  1 calls using  7 threads:   296 ms (TotalTime [ms]:  1636 SingleTimes [ms]:   222   246   227   219   236   250   236)
  1 calls using  8 threads:   304 ms (TotalTime [ms]:  1806 SingleTimes [ms]:   226   222   224   225   227   225   224   233)
  1 calls using  9 threads:   322 ms (TotalTime [ms]:  1909 SingleTimes [ms]:   230   226   198   238   220   197   199   195   206)
  1 calls using 10 threads:   326 ms (TotalTime [ms]:  2009 SingleTimes [ms]:   174   216   172   208   176   236   232   178   213   204)
  1 calls using 11 threads:   295 ms (TotalTime [ms]:  2140 SingleTimes [ms]:   232   222   221   165   221   171   201   162   158   181   206)
  1 calls using 12 threads:   273 ms (TotalTime [ms]:  2494 SingleTimes [ms]:   212   208   224   210   227   210   212   212   198   202   198  

 

Which is exactly what I would have expected: The total processing time is reduced roughly inverse proportionally with the number of threads until the overhead of splitting the work reduces the potential gain of using more threads.

 

Hm, looking at the single times values: Given that with each additional thread the number of lines each thread needs to be processing decreases, I wonder why  these times don't get any lower than around 200 ms.

 

Anyway, thanks @Anders Melander you seem to have nailed the problem.

 

  • Like 1

Share this post


Link to post
1 hour ago, dummzeuch said:

I wonder why  these times don't get any lower than around 200 ms

Probably because the threads compete for a free core? If there are more threads than cores, they are not executed one after the other on a core, but are interrupted to give other threads a chance to run. It is even possible that a thread runs on several cores until it finishes.

 

Given the timing for the 1 thread case, the theoretical minimum for 12 thread should be less than 100 ms, which is far away from the actual numbers.

Share this post


Link to post
Posted (edited)
1 hour ago, Uwe Raabe said:

Probably because the threads compete for a free core? If there are more threads than cores, they are not executed one after the other on a core.

Of course: There are only 8 virtual processors on my computer, so the maximum of threads that can run in parallel is 8. I should have thought of that.

Edited by dummzeuch

Share this post


Link to post

You can also get the line size with sysinternals coreinfo, output looks like this:

Logical Processor to Cache Map:
**----------  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
**----------  Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
**----------  Unified Cache       0, Level 2,  512 KB, Assoc   8, LineSize  64
************  Unified Cache       1, Level 3,   32 MB, Assoc  16, LineSize  64
--**--------  Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
--**--------  Instruction Cache   1, Level 1,   32 KB, Assoc   8, LineSize  64
--**--------  Unified Cache       2, Level 2,  512 KB, Assoc   8, LineSize  64

 

Share this post


Link to post
7 hours ago, dummzeuch said:

For testing this, I have now simply increased the size of the record by 256 bytes by adding an array [0..255] of byte to it so each of them is larger than the maximum possible cache line.

A perfectly valid solution. You could also do it like this:

TPackageItem = record
  WorkPackage: TWorkPackage;
  Padding: array[0..256-SizeOf(WorkPackage)-1) of byte;
end;

 

I don't know how you distribute the rows being processed by each thread but another thing to try is to ensure that each thread will be working on a contiguous set of rows.

For example for 10 rows and 3 threads the rows should be divided something like this:

AAAAAAAAAA

AAAAAAAAAA

AAAAAAAAAA

BBBBBBBBBBB

BBBBBBBBBBB

BBBBBBBBBBB

CCCCCCCCCC

CCCCCCCCCC

CCCCCCCCCC

CCCCCCCCCC

and not like this:

AAAAAAAAAA

BBBBBBBBBBB

CCCCCCCCCC

AAAAAAAAAA

BBBBBBBBBBB

CCCCCCCCCC

AAAAAAAAAA
BBBBBBBBBBB
CCCCCCCCCC

CCCCCCCCCC

This will improver cache locality.

 

Another thing that will improve cache locality is only working on one row at a time. For example if you are averaging 3x3 pixels then I'm guessing that you are reading pixels from 3 different rows for each pixel. If the size of a row (or even 2 rows) is greater than the cache then this will trash the cache.

To avoid this you can first process the bitmap horizontally, transpose the bitmap and then process it horizontally again. One might think that a transpose is too expensive but it's actually fairly cheap (if you use a good blocked transpose algorithm) and the gain from not thrashing the cache makes it worth it.

Share this post


Link to post
Posted (edited)
2 hours ago, Anders Melander said:

You could also do it like this:

Interesting approach.

2 hours ago, Anders Melander said:

ensure that each thread will be working on a contiguous set of rows

I already do that.

2 hours ago, Anders Melander said:

Another thing that will improve cache locality is only working on one row at a time. For example if you are averaging 3x3 pixels then I'm guessing that you are reading pixels from 3 different rows for each pixel. If the size of a row (or even 2 rows) is greater than the cache then this will trash the cache.

Hm, yes. I'll have to think about this.

 

So far I am quite satisfied with reducing the analyzing time for the pictures by a factor of 8 (on my computer). Now I need to find out how many logical processors are available on the computers on which this program will actually be used. Since these are probably >2 years old, they might need updating anyway. The program itself now has 60% processor usage, up from 17% (when it was single threaded), so there are probably some other parts which could be improved which might be easier to achieve.

Edited by dummzeuch

Share this post


Link to post
3 hours ago, dummzeuch said:

Now I need to find out how many logical processors are available on the computers on which this program will actually be used.

TThread.ProcessorCount seems what you are looking for:

    /// <summary>
    ///    The number of processor cores on which this application is running. This will include virtual
    ///    "Hyper-threading" cores on many modern Intel CPUs. It is ultimately based on what the underlying
    ///    operating system reports.
    /// </summary>
    class property ProcessorCount: Integer read FProcessorCount;

 

Share this post


Link to post
11 hours ago, Uwe Raabe said:

TThread.ProcessorCount seems what you are looking for:


    /// <summary>
    ///    The number of processor cores on which this application is running. This will include virtual
    ///    "Hyper-threading" cores on many modern Intel CPUs. It is ultimately based on what the underlying
    ///    operating system reports.
    /// </summary>
    class property ProcessorCount: Integer read FProcessorCount;

 

I've taken the low tech approach: I asked the colleagues, what type of processor their computer(s) have.

Share this post


Link to post
1 hour ago, dummzeuch said:

I asked the colleagues, what type of processor their computer(s) have.

Don't forget to tell them that they shall notify you when someday their hardware changes.

Share this post


Link to post
Posted (edited)
6 hours ago, Uwe Raabe said:

Don't forget to tell them that they shall notify you when someday their hardware changes.

Oh, you were referring to how the program decides how many threads to use? I have implemented that already to use as many threads as there are logical processors, determined using GetSystemInfo. I was more worried whether their current computers are good enough to run the program or if they need an upgrade. If their hardware changes in the future, it will be likely become faster anyway.

Edited by dummzeuch
  • Like 1

Share this post


Link to post
5 hours ago, dummzeuch said:

Oh, you were referring how the program decides how many threads to use? I have implemented that already to use as many threads as there are logical processors, determined using GetSystemInfo. I was more worried whether their current computers are good enough to run the program or if they need an upgrade. If their hardware changes in the future, it will be likely become faster anyway.

Remember that in the new Intel hybrid architectures (such as Alder Lake) not all cores are usable, or rather not all cores have the same processing "power" (eg. P-Core and E-Core).

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×