Jump to content

dummzeuch

Members
  • Content Count

    1879
  • Joined

  • Last visited

  • Days Won

    66

dummzeuch last won the day on May 4

dummzeuch had the most liked content!

Community Reputation

1081 Excellent

Technical Information

  • Delphi-Version
    Delphi 2007

Recent Profile Visitors

5196 profile views
  1. Oh, you were referring to how the program decides how many threads to use? I have implemented that already to use as many threads as there are logical processors, determined using GetSystemInfo. I was more worried whether their current computers are good enough to run the program or if they need an upgrade. If their hardware changes in the future, it will be likely become faster anyway.
  2. I've taken the low tech approach: I asked the colleagues, what type of processor their computer(s) have.
  3. Interesting approach. I already do that. Hm, yes. I'll have to think about this. So far I am quite satisfied with reducing the analyzing time for the pictures by a factor of 8 (on my computer). Now I need to find out how many logical processors are available on the computers on which this program will actually be used. Since these are probably >2 years old, they might need updating anyway. The program itself now has 60% processor usage, up from 17% (when it was single threaded), so there are probably some other parts which could be improved which might be easier to achieve.
  4. Of course: There are only 8 virtual processors on my computer, so the maximum of threads that can run in parallel is 8. I should have thought of that.
  5. My knowledge of processor architecture is insufficient to answer this question. The work packages are advanced records in a dynamic array, so their data is located within the same memory area on the heap. Which turned out to be the problem. The work memory area of each work package was declared like this: TWorkPackage = record FCounter: PInt32; FIsDone: LongBool; FScanLine0: PByte; FBytesPerLine: Integer; FWidth: Integer; FHeight: Integer; FTop: Integer; FBottom: Integer; FPixelValues: TMedianValuesArr; FMedianPixelArr: TMedianPixelArr; FArr: TBrightnessMatrix; FSum: int64; FStopwatch: TStopwatch; end; And these were stored in a dynamic array, so they were all located within the same memory area. FMedianPixelArr is constantly being written to in all threads. If I understand the cache line stuff correctly, the fact that these records were stored all within a memory block smaller than the CPU cache line (apparently 256 Bytes at most with current CPUs) this caused the cache becoming invalid for all threads every time one of the threads wrote to this area. For testing this, I have now simply increased the size of the record by 256 bytes by adding an array [0..255] of byte to it so each of them is larger than the maximum possible cache line. Here is the timing after this change: 1 calls using 1 threads: 1143 ms (TotalTime [ms]: 1125 SingleTimes [ms]: 1125) 1 calls using 2 threads: 607 ms (TotalTime [ms]: 1124 SingleTimes [ms]: 565 559) 1 calls using 3 threads: 451 ms (TotalTime [ms]: 1186 SingleTimes [ms]: 395 393 398) 1 calls using 4 threads: 368 ms (TotalTime [ms]: 1222 SingleTimes [ms]: 308 298 312 304) 1 calls using 5 threads: 367 ms (TotalTime [ms]: 1357 SingleTimes [ms]: 244 306 303 242 262) 1 calls using 6 threads: 344 ms (TotalTime [ms]: 1533 SingleTimes [ms]: 228 259 274 271 245 256) 1 calls using 7 threads: 296 ms (TotalTime [ms]: 1636 SingleTimes [ms]: 222 246 227 219 236 250 236) 1 calls using 8 threads: 304 ms (TotalTime [ms]: 1806 SingleTimes [ms]: 226 222 224 225 227 225 224 233) 1 calls using 9 threads: 322 ms (TotalTime [ms]: 1909 SingleTimes [ms]: 230 226 198 238 220 197 199 195 206) 1 calls using 10 threads: 326 ms (TotalTime [ms]: 2009 SingleTimes [ms]: 174 216 172 208 176 236 232 178 213 204) 1 calls using 11 threads: 295 ms (TotalTime [ms]: 2140 SingleTimes [ms]: 232 222 221 165 221 171 201 162 158 181 206) 1 calls using 12 threads: 273 ms (TotalTime [ms]: 2494 SingleTimes [ms]: 212 208 224 210 227 210 212 212 198 202 198 Which is exactly what I would have expected: The total processing time is reduced roughly inverse proportionally with the number of threads until the overhead of splitting the work reduces the potential gain of using more threads. Hm, looking at the single times values: Given that with each additional thread the number of lines each thread needs to be processing decreases, I wonder why these times don't get any lower than around 200 ms. Anyway, thanks @Anders Melander you seem to have nailed the problem.
  6. My knowledge of processor architecture is insufficient to answer this question. The work packages are advanced records in a dynamic array, so their data is located within the same memory area on the heap.
  7. After replacing the dynamic array with a static array (see above), the timing now looks like this: 10 calls using 1 threads: 1158 ms (TotalTime [ms]: 1140 SingleTimes [ms]: 1140) 10 calls using 2 threads: 1412 ms (TotalTime [ms]: 2872 SingleTimes [ms]: 1437 1435) 10 calls using 3 threads: 1302 ms (TotalTime [ms]: 2501 SingleTimes [ms]: 1055 1060 386) 10 calls using 4 threads: 752 ms (TotalTime [ms]: 2657 SingleTimes [ms]: 503 650 767 737) 10 calls using 5 threads: 670 ms (TotalTime [ms]: 2832 SingleTimes [ms]: 364 627 668 582 591) 10 calls using 6 threads: 643 ms (TotalTime [ms]: 3195 SingleTimes [ms]: 316 528 553 501 635 662) 10 calls using 7 threads: 503 ms (TotalTime [ms]: 2973 SingleTimes [ms]: 459 466 360 473 493 404 318) 10 calls using 8 threads: 520 ms (TotalTime [ms]: 3460 SingleTimes [ms]: 344 364 356 470 498 450 486 492) 10 calls using 9 threads: 526 ms (TotalTime [ms]: 3854 SingleTimes [ms]: 316 322 427 428 440 498 517 456 450) 10 calls using 10 threads: 417 ms (TotalTime [ms]: 3238 SingleTimes [ms]: 228 325 338 392 274 386 361 337 324 273) (TotalTime is the sum of the processing time for all threads. SingleTimes are the processing times for each thread.)
  8. Yes, it does indeed. But since ... Assigned(FData) and Assigned(FWorkCall) ... will be False if no work package has been assigned, it will simply do nothing and then return into the wait state. Good point, but since the thread pool is specific to the class doing the processing that can't be the case. All threads are idle when bitmap is passed into the method and will be idle again once the processing has finished. I found something though: I was using a single dynamic array to hold some intermediate results which was initialized within the work packages. Replacing it with a static array of the required length removed the timing oddity. Now two threads consistently take significantly longer than one thread and the processing time goes down from there on. This is kind of consistent but not really useful. 😞
  9. The following is about a 32 bit Windows console application. I'm trying to improve performance for analyzing a huge Mono8 bitmap, stored in memory as a an array of bytes (no TBitmap involved). For each pixel the algorithm calculates the brightness for the pixel itself as well as a few surrounding pixels and writes the result to an array of byte. Doing this single threaded takes about 1300 ms. Since this can be done by multiple threads completely independently without the risk of race conditions I have split the work into n work packages, each processing some lines of the bitmap and each processed by a separate thread. The threads have been created in advance in a thread pool. Since there is no risk of race conditions I don't use any synchronization or locking mechanism during the processing. Synchronization is only necessary for assigning a work package to a thread and for signalling that a work package has been processed. This is the relevant code of the thread's execute method and the SetNext method called by the main thread to assign a work package to it: procedure TWorkerThread.Execute; begin inherited Execute; while not Terminated do begin FNewPackageEvent.WaitFor(500); FNewPackageEvent.ResetEvent; if Terminated then Exit; //==> FCritSect.Enter; try if Assigned(FData) and Assigned(FWorkCall) then FWorkCall(FData); finally FCritSect.Leave; end; end; end; procedure TWorkerThread.SetNext(_WorkCall: TWorkPackageCall; _Data: Pointer); begin FCritSect.Enter; try if Assigned(FData) or Assigned(@FWorkCall) then raise ESigException.Create('Programmer error: Package is already assigned.'); FData := _Data; FWorkCall := _WorkCall; FNewPackageEvent.SetEvent; finally FCritSect.Leave; end; end; FNewPackageEvent is a TEvent and FCritSect is a critical section, each unique to the TWorkerThread instance. A work package consists of a data pointer and a procedure pointer (not method pointer) to call. When a work package has finished processing a counter variable gets decremented like this: InterlockedDecrement(FCounter^); The main thread also processes one work package (it's also counted as one of the threads below) and then waits for the others to finish: while WorkPackageCounter > 0 do Sleep(WorkPackageCounter); (FCounter above is a pointer to WorkPackageCounter). Since every thread only processes one work package the counter starts as the number of threads. This works fine but I get some rather odd timing: Average time on 1 calls using 1 threads 1278 [ms] Average time on 1 calls using 2 threads 877 [ms] <------ Average time on 1 calls using 3 threads 1627 [ms] <------ Average time on 1 calls using 4 threads 1580 [ms] Average time on 1 calls using 5 threads 1511 [ms] Average time on 1 calls using 6 threads 1167 [ms] Average time on 1 calls using 7 threads 1438 [ms] Average time on 1 calls using 8 threads 1036 [ms] Average time on 1 calls using 9 threads 957 [ms] Average time on 1 calls using 10 threads 847 [ms] Average time on 1 calls using 11 threads 958 [ms] Average time on 1 calls using 12 threads 843 [ms] Average time on 1 calls using 13 threads 821 [ms] Average time on 1 calls using 14 threads 715 [ms] Average time on 1 calls using 15 threads 799 [ms] Average time on 1 calls using 16 threads 647 [ms] Average time on 1 calls using 17 threads 693 [ms] Average time on 1 calls using 18 threads 656 [ms] Average time on 1 calls using 19 threads 525 [ms] Average time on 1 calls using 20 threads 613 [ms] As you can see, distributing the work to 2 threads nearly halves the processing time, but adds some overhead which is what I would have expected. But distributing it to 3 threads all of a sudden takes more time than the single threaded approach. When adding more threads the processing time goes down again until it reaches some kind of minimum with 15 threads. This timing is from only a single call, but I also tried it with 100 calls and the average was about the same. The CPU is an Intel Xeon with 4 cores + Hyperthreading which gives me a logical 8 processor cores. Can anybody give me a hint what to look for to explain the oddity of 3 threads taking longer than 2 threads?
  10. In case anybody wants to test it: This problem should be fixed in revision #3843
  11. Wow, that's great support from Alex-T over on GitHub. He is currently helping me to fix those pesky compile problems, so I will be able to update SynRegExp with the original RegExpr unit.
  12. The current RegExpr unit on GitHub fixes that problem, but unfortunately it no longer compiles with pre UNICODE Delphis because somebody added a {$define Unicode} to it two years ago.
  13. dummzeuch

    Some grep related problems

    I just tried to reproduce your problems with Delphi 2007 (I haven't got Delphi XE4 on this computer) but hiding the history list worked fine even when restarting the IDE multiple times, and I also didn't get the access violation when searching with the options you listed. So, without getting any more specific information on when this happens, there is nothing I can do. For the AV you could try to compile a new DLL and debug the issue yourself.
  14. dummzeuch

    GExperts Replace Components..

    The bug in the Open Tools API still exists in Delphi 11.1 + April 2022 patch. I just re-enabled the code and got an access violation trying to replace a TAdoTable with some fields with a TAdoQuery. You might want to vote for this issue: https://quality.embarcadero.com/browse/RSP-25645
  15. dummzeuch

    New GExperts under Delphi 7

    verified and committed in revision #3842 Thanks again.
×