dummzeuch 1505 Posted May 12, 2022 (edited) The following is about a 32 bit Windows console application. I'm trying to improve performance for analyzing a huge Mono8 bitmap, stored in memory as a an array of bytes (no TBitmap involved). For each pixel the algorithm calculates the brightness for the pixel itself as well as a few surrounding pixels and writes the result to an array of byte. Doing this single threaded takes about 1300 ms. Since this can be done by multiple threads completely independently without the risk of race conditions I have split the work into n work packages, each processing some lines of the bitmap and each processed by a separate thread. The threads have been created in advance in a thread pool. Since there is no risk of race conditions I don't use any synchronization or locking mechanism during the processing. Synchronization is only necessary for assigning a work package to a thread and for signalling that a work package has been processed. This is the relevant code of the thread's execute method and the SetNext method called by the main thread to assign a work package to it: procedure TWorkerThread.Execute; begin inherited Execute; while not Terminated do begin FNewPackageEvent.WaitFor(500); FNewPackageEvent.ResetEvent; if Terminated then Exit; //==> FCritSect.Enter; try if Assigned(FData) and Assigned(FWorkCall) then FWorkCall(FData); finally FCritSect.Leave; end; end; end; procedure TWorkerThread.SetNext(_WorkCall: TWorkPackageCall; _Data: Pointer); begin FCritSect.Enter; try if Assigned(FData) or Assigned(@FWorkCall) then raise ESigException.Create('Programmer error: Package is already assigned.'); FData := _Data; FWorkCall := _WorkCall; FNewPackageEvent.SetEvent; finally FCritSect.Leave; end; end; FNewPackageEvent is a TEvent and FCritSect is a critical section, each unique to the TWorkerThread instance. A work package consists of a data pointer and a procedure pointer (not method pointer) to call. When a work package has finished processing a counter variable gets decremented like this: InterlockedDecrement(FCounter^); The main thread also processes one work package (it's also counted as one of the threads below) and then waits for the others to finish: while WorkPackageCounter > 0 do Sleep(WorkPackageCounter); (FCounter above is a pointer to WorkPackageCounter). Since every thread only processes one work package the counter starts as the number of threads. This works fine but I get some rather odd timing: Average time on 1 calls using 1 threads 1278 [ms] Average time on 1 calls using 2 threads 877 [ms] <------ Average time on 1 calls using 3 threads 1627 [ms] <------ Average time on 1 calls using 4 threads 1580 [ms] Average time on 1 calls using 5 threads 1511 [ms] Average time on 1 calls using 6 threads 1167 [ms] Average time on 1 calls using 7 threads 1438 [ms] Average time on 1 calls using 8 threads 1036 [ms] Average time on 1 calls using 9 threads 957 [ms] Average time on 1 calls using 10 threads 847 [ms] Average time on 1 calls using 11 threads 958 [ms] Average time on 1 calls using 12 threads 843 [ms] Average time on 1 calls using 13 threads 821 [ms] Average time on 1 calls using 14 threads 715 [ms] Average time on 1 calls using 15 threads 799 [ms] Average time on 1 calls using 16 threads 647 [ms] Average time on 1 calls using 17 threads 693 [ms] Average time on 1 calls using 18 threads 656 [ms] Average time on 1 calls using 19 threads 525 [ms] Average time on 1 calls using 20 threads 613 [ms] As you can see, distributing the work to 2 threads nearly halves the processing time, but adds some overhead which is what I would have expected. But distributing it to 3 threads all of a sudden takes more time than the single threaded approach. When adding more threads the processing time goes down again until it reaches some kind of minimum with 15 threads. This timing is from only a single call, but I also tried it with 100 calls and the average was about the same. The CPU is an Intel Xeon with 4 cores + Hyperthreading which gives me a logical 8 processor cores. Can anybody give me a hint what to look for to explain the oddity of 3 threads taking longer than 2 threads? Edited May 12, 2022 by dummzeuch Share this post Link to post
Fr0sT.Brutal 900 Posted May 12, 2022 (edited) Hmm, doesn't FNewPackageEvent.WaitFor(500); wait for the event at most for 500 ms and then go further anyway even if it's still non signaled? You also didn't show the main code that calls SetNext methods, now I see possibility that SetNext call could happen when a thread still processes its task so crit sect waits for it to finish. Edited May 12, 2022 by Fr0sT.Brutal Share this post Link to post
Anders Melander 1783 Posted May 12, 2022 2 minutes ago, Fr0sT.Brutal said: wait for the event at most for 500 ms and then go further anyway even if it's still non signaled? Yes but it doesn't do anything because supposedly FData=nil. Instead it terminates the thread (I'm guessing). This isn't how I would design a work queue but I don't think the queue's the problem. @dummzeuch Are threads reading from memory in the same cache line as other threads are writing to - or are the threads writing to memory in the same cache line. If so you are probably suffering from false sharing and/or cache-line ping pong. Share this post Link to post
dummzeuch 1505 Posted May 12, 2022 (edited) 1 hour ago, Fr0sT.Brutal said: Hmm, doesn't FNewPackageEvent.WaitFor(500); wait for the event at most for 500 ms and then go further anyway even if it's still non signaled? Yes, it does indeed. But since ... Assigned(FData) and Assigned(FWorkCall) ... will be False if no work package has been assigned, it will simply do nothing and then return into the wait state. 1 hour ago, Fr0sT.Brutal said: You also didn't show the main code that calls SetNext methods, now I see possibility that SetNext call could happen when a thread still processes its task so crit sect waits for it to finish. Good point, but since the thread pool is specific to the class doing the processing that can't be the case. All threads are idle when bitmap is passed into the method and will be idle again once the processing has finished. I found something though: I was using a single dynamic array to hold some intermediate results which was initialized within the work packages. Replacing it with a static array of the required length removed the timing oddity. Now two threads consistently take significantly longer than one thread and the processing time goes down from there on. This is kind of consistent but not really useful. 😞 Edited May 12, 2022 by dummzeuch Share this post Link to post
dummzeuch 1505 Posted May 12, 2022 After replacing the dynamic array with a static array (see above), the timing now looks like this: 10 calls using 1 threads: 1158 ms (TotalTime [ms]: 1140 SingleTimes [ms]: 1140) 10 calls using 2 threads: 1412 ms (TotalTime [ms]: 2872 SingleTimes [ms]: 1437 1435) 10 calls using 3 threads: 1302 ms (TotalTime [ms]: 2501 SingleTimes [ms]: 1055 1060 386) 10 calls using 4 threads: 752 ms (TotalTime [ms]: 2657 SingleTimes [ms]: 503 650 767 737) 10 calls using 5 threads: 670 ms (TotalTime [ms]: 2832 SingleTimes [ms]: 364 627 668 582 591) 10 calls using 6 threads: 643 ms (TotalTime [ms]: 3195 SingleTimes [ms]: 316 528 553 501 635 662) 10 calls using 7 threads: 503 ms (TotalTime [ms]: 2973 SingleTimes [ms]: 459 466 360 473 493 404 318) 10 calls using 8 threads: 520 ms (TotalTime [ms]: 3460 SingleTimes [ms]: 344 364 356 470 498 450 486 492) 10 calls using 9 threads: 526 ms (TotalTime [ms]: 3854 SingleTimes [ms]: 316 322 427 428 440 498 517 456 450) 10 calls using 10 threads: 417 ms (TotalTime [ms]: 3238 SingleTimes [ms]: 228 325 338 392 274 386 361 337 324 273) (TotalTime is the sum of the processing time for all threads. SingleTimes are the processing times for each thread.) Share this post Link to post
dummzeuch 1505 Posted May 12, 2022 (edited) 9 minutes ago, Anders Melander said: Are threads reading from memory in the same cache line as other threads are writing to - or are the threads writing to memory in the same cache line. If so you are probably suffering from false sharing and/or cache-line ping pong. My knowledge of processor architecture is insufficient to answer this question. The work packages are advanced records in a dynamic array, so their data is located within the same memory area on the heap. Edited May 12, 2022 by dummzeuch Share this post Link to post
Anders Melander 1783 Posted May 12, 2022 1 hour ago, dummzeuch said: My knowledge of processor architecture is insufficient to answer this question. It's a good thing we have Google then. There are lots of (relatively) easy to understand explanations these topics. 1 hour ago, dummzeuch said: The work packages are advanced records in a dynamic array, so their data is located within the same memory area on the heap. The data being manipulated just needs to be in different cache lines. The size of a cache line depends on the processor. Again: Google. Share this post Link to post
Uwe Raabe 2057 Posted May 12, 2022 3 hours ago, Anders Melander said: The size of a cache line depends on the processor. The size can be retrieved from System.TMonitor.CacheLineSize. Share this post Link to post
dummzeuch 1505 Posted May 13, 2022 15 hours ago, dummzeuch said: 15 hours ago, Anders Melander said: Are threads reading from memory in the same cache line as other threads are writing to - or are the threads writing to memory in the same cache line. If so you are probably suffering from false sharing and/or cache-line ping pong. My knowledge of processor architecture is insufficient to answer this question. The work packages are advanced records in a dynamic array, so their data is located within the same memory area on the heap. Which turned out to be the problem. The work memory area of each work package was declared like this: TWorkPackage = record FCounter: PInt32; FIsDone: LongBool; FScanLine0: PByte; FBytesPerLine: Integer; FWidth: Integer; FHeight: Integer; FTop: Integer; FBottom: Integer; FPixelValues: TMedianValuesArr; FMedianPixelArr: TMedianPixelArr; FArr: TBrightnessMatrix; FSum: int64; FStopwatch: TStopwatch; end; And these were stored in a dynamic array, so they were all located within the same memory area. FMedianPixelArr is constantly being written to in all threads. If I understand the cache line stuff correctly, the fact that these records were stored all within a memory block smaller than the CPU cache line (apparently 256 Bytes at most with current CPUs) this caused the cache becoming invalid for all threads every time one of the threads wrote to this area. For testing this, I have now simply increased the size of the record by 256 bytes by adding an array [0..255] of byte to it so each of them is larger than the maximum possible cache line. Here is the timing after this change: 1 calls using 1 threads: 1143 ms (TotalTime [ms]: 1125 SingleTimes [ms]: 1125) 1 calls using 2 threads: 607 ms (TotalTime [ms]: 1124 SingleTimes [ms]: 565 559) 1 calls using 3 threads: 451 ms (TotalTime [ms]: 1186 SingleTimes [ms]: 395 393 398) 1 calls using 4 threads: 368 ms (TotalTime [ms]: 1222 SingleTimes [ms]: 308 298 312 304) 1 calls using 5 threads: 367 ms (TotalTime [ms]: 1357 SingleTimes [ms]: 244 306 303 242 262) 1 calls using 6 threads: 344 ms (TotalTime [ms]: 1533 SingleTimes [ms]: 228 259 274 271 245 256) 1 calls using 7 threads: 296 ms (TotalTime [ms]: 1636 SingleTimes [ms]: 222 246 227 219 236 250 236) 1 calls using 8 threads: 304 ms (TotalTime [ms]: 1806 SingleTimes [ms]: 226 222 224 225 227 225 224 233) 1 calls using 9 threads: 322 ms (TotalTime [ms]: 1909 SingleTimes [ms]: 230 226 198 238 220 197 199 195 206) 1 calls using 10 threads: 326 ms (TotalTime [ms]: 2009 SingleTimes [ms]: 174 216 172 208 176 236 232 178 213 204) 1 calls using 11 threads: 295 ms (TotalTime [ms]: 2140 SingleTimes [ms]: 232 222 221 165 221 171 201 162 158 181 206) 1 calls using 12 threads: 273 ms (TotalTime [ms]: 2494 SingleTimes [ms]: 212 208 224 210 227 210 212 212 198 202 198 Which is exactly what I would have expected: The total processing time is reduced roughly inverse proportionally with the number of threads until the overhead of splitting the work reduces the potential gain of using more threads. Hm, looking at the single times values: Given that with each additional thread the number of lines each thread needs to be processing decreases, I wonder why these times don't get any lower than around 200 ms. Anyway, thanks @Anders Melander you seem to have nailed the problem. 1 Share this post Link to post
Uwe Raabe 2057 Posted May 13, 2022 1 hour ago, dummzeuch said: I wonder why these times don't get any lower than around 200 ms Probably because the threads compete for a free core? If there are more threads than cores, they are not executed one after the other on a core, but are interrupted to give other threads a chance to run. It is even possible that a thread runs on several cores until it finishes. Given the timing for the 1 thread case, the theoretical minimum for 12 thread should be less than 100 ms, which is far away from the actual numbers. Share this post Link to post
dummzeuch 1505 Posted May 13, 2022 (edited) 1 hour ago, Uwe Raabe said: Probably because the threads compete for a free core? If there are more threads than cores, they are not executed one after the other on a core. Of course: There are only 8 virtual processors on my computer, so the maximum of threads that can run in parallel is 8. I should have thought of that. Edited May 13, 2022 by dummzeuch Share this post Link to post
SwiftExpat 65 Posted May 13, 2022 You can also get the line size with sysinternals coreinfo, output looks like this: Logical Processor to Cache Map: **---------- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 **---------- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 **---------- Unified Cache 0, Level 2, 512 KB, Assoc 8, LineSize 64 ************ Unified Cache 1, Level 3, 32 MB, Assoc 16, LineSize 64 --**-------- Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 --**-------- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 --**-------- Unified Cache 2, Level 2, 512 KB, Assoc 8, LineSize 64 Share this post Link to post
Anders Melander 1783 Posted May 13, 2022 7 hours ago, dummzeuch said: For testing this, I have now simply increased the size of the record by 256 bytes by adding an array [0..255] of byte to it so each of them is larger than the maximum possible cache line. A perfectly valid solution. You could also do it like this: TPackageItem = record WorkPackage: TWorkPackage; Padding: array[0..256-SizeOf(WorkPackage)-1) of byte; end; I don't know how you distribute the rows being processed by each thread but another thing to try is to ensure that each thread will be working on a contiguous set of rows. For example for 10 rows and 3 threads the rows should be divided something like this: AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA BBBBBBBBBBB BBBBBBBBBBB BBBBBBBBBBB CCCCCCCCCC CCCCCCCCCC CCCCCCCCCC CCCCCCCCCC and not like this: AAAAAAAAAA BBBBBBBBBBB CCCCCCCCCC AAAAAAAAAA BBBBBBBBBBB CCCCCCCCCC AAAAAAAAAA BBBBBBBBBBB CCCCCCCCCC CCCCCCCCCC This will improver cache locality. Another thing that will improve cache locality is only working on one row at a time. For example if you are averaging 3x3 pixels then I'm guessing that you are reading pixels from 3 different rows for each pixel. If the size of a row (or even 2 rows) is greater than the cache then this will trash the cache. To avoid this you can first process the bitmap horizontally, transpose the bitmap and then process it horizontally again. One might think that a transpose is too expensive but it's actually fairly cheap (if you use a good blocked transpose algorithm) and the gain from not thrashing the cache makes it worth it. Share this post Link to post
dummzeuch 1505 Posted May 13, 2022 (edited) 2 hours ago, Anders Melander said: You could also do it like this: Interesting approach. 2 hours ago, Anders Melander said: ensure that each thread will be working on a contiguous set of rows I already do that. 2 hours ago, Anders Melander said: Another thing that will improve cache locality is only working on one row at a time. For example if you are averaging 3x3 pixels then I'm guessing that you are reading pixels from 3 different rows for each pixel. If the size of a row (or even 2 rows) is greater than the cache then this will trash the cache. Hm, yes. I'll have to think about this. So far I am quite satisfied with reducing the analyzing time for the pictures by a factor of 8 (on my computer). Now I need to find out how many logical processors are available on the computers on which this program will actually be used. Since these are probably >2 years old, they might need updating anyway. The program itself now has 60% processor usage, up from 17% (when it was single threaded), so there are probably some other parts which could be improved which might be easier to achieve. Edited May 13, 2022 by dummzeuch Share this post Link to post
Uwe Raabe 2057 Posted May 13, 2022 3 hours ago, dummzeuch said: Now I need to find out how many logical processors are available on the computers on which this program will actually be used. TThread.ProcessorCount seems what you are looking for: /// <summary> /// The number of processor cores on which this application is running. This will include virtual /// "Hyper-threading" cores on many modern Intel CPUs. It is ultimately based on what the underlying /// operating system reports. /// </summary> class property ProcessorCount: Integer read FProcessorCount; Share this post Link to post
dummzeuch 1505 Posted May 14, 2022 11 hours ago, Uwe Raabe said: TThread.ProcessorCount seems what you are looking for: /// <summary> /// The number of processor cores on which this application is running. This will include virtual /// "Hyper-threading" cores on many modern Intel CPUs. It is ultimately based on what the underlying /// operating system reports. /// </summary> class property ProcessorCount: Integer read FProcessorCount; I've taken the low tech approach: I asked the colleagues, what type of processor their computer(s) have. Share this post Link to post
Uwe Raabe 2057 Posted May 14, 2022 1 hour ago, dummzeuch said: I asked the colleagues, what type of processor their computer(s) have. Don't forget to tell them that they shall notify you when someday their hardware changes. Share this post Link to post
dummzeuch 1505 Posted May 14, 2022 (edited) 6 hours ago, Uwe Raabe said: Don't forget to tell them that they shall notify you when someday their hardware changes. Oh, you were referring to how the program decides how many threads to use? I have implemented that already to use as many threads as there are logical processors, determined using GetSystemInfo. I was more worried whether their current computers are good enough to run the program or if they need an upgrade. If their hardware changes in the future, it will be likely become faster anyway. Edited May 14, 2022 by dummzeuch 1 Share this post Link to post
DelphiUdIT 176 Posted May 14, 2022 5 hours ago, dummzeuch said: Oh, you were referring how the program decides how many threads to use? I have implemented that already to use as many threads as there are logical processors, determined using GetSystemInfo. I was more worried whether their current computers are good enough to run the program or if they need an upgrade. If their hardware changes in the future, it will be likely become faster anyway. Remember that in the new Intel hybrid architectures (such as Alder Lake) not all cores are usable, or rather not all cores have the same processing "power" (eg. P-Core and E-Core). Share this post Link to post