TParallelArray Sort Performance...

Anders Melander · September 13, 2024

Wouldn't it make sense to do a CLFLUSH before the sort so it doesn't benefit from all the data already being in the cache?

procedure FlushCache(Data: Pointer; Size: Integer);
const
  CACHE_LINE_SIZE = 64;
asm
@NextBlock:
  CLFLUSH  [Data + Size]
  SUB      Size,CACHE_LINE_SIZE
  JGE      @NextBlock
end;

Anders Melander · September 13, 2024

4 minutes ago, Stefan Glienke said:

Looks like I am getting twice the speed at worst and 50 times at best.

Impressive.

Eric Grange · September 13, 2024

Hmm, definitely something odd. I used the version from https://bitbucket.org/sglienke/spring4d/src/master/ maybe it's off in some way ?
Here is on i7-1165G7, Win64, optimization on, stack frames off, Spring is only marginally ahead of the RTL, I get these timings

Fill array (double), elements count: 500000
Start sorting ...
RTL TArray.Sort (ms.): 63
Spring TArray.Sort (ms.): 56

Fill array (double), elements count: 5000000
Start sorting ...
RTL TArray.Sort (ms.): 596
Spring TArray.Sort (ms.): 574

Fill array (double), elements count: 100000000
Start sorting ...
RTL TArray.Sort (ms.): 14311
Spring TArray.Sort (ms.): 13708

Quicksort non-generic is about 30% faster than Spring in all 3 tests.

Stefan Glienke · September 13, 2024

Did you precompile Spring or are you compiling directly from the sources? If you precompile (recommended) then you need to precompile with release settings of course. If I had to bet I would say you still have range checking enabled (RTL has that disabled, so does your code)

Also tbh I don't care about a 30% improvement of a handwritten algo specifically for one type over the generic one. Yes, compiler could do a better job with more aggressively inlining and stuff but nowadays I am happy already if it generates working code.

23 minutes ago, Anders Melander said:
Wouldn't it make sense to do a CLFLUSH before the sort so it doesn't benefit from all the data already being in the cache?
procedure FlushCache(Data: Pointer; Size: Integer);
const
  CACHE_LINE_SIZE = 64;
asm
@NextBlock:
  CLFLUSH  [Data + Size]
  SUB      Size,CACHE_LINE_SIZE
  JGE      @NextBlock
end;

Apart from the code not compiling I would say it depends on what you want to benchmark - how sort performs on completely cold data? Like some data that you loaded an hour ago, then did completely different things so it does not reside in the cache anymore and now you want to have the fastest sorting possible. Then yes, otherwise I would say that usually you typically are sorting data that is already in cache because you previously filled the array or list with it, or processed it followed by the sort operation.

Some additional read on that "clear the LLC for some benchmark" topic: https://stackoverflow.com/a/49077734/587106

Edited September 13, 2024 by Stefan Glienke

Eric Grange · September 13, 2024

> If I had to bet I would say you still have range checking enabled (RTL has that disabled, so does your code)

Ah, no, it was on. I has spotted some "{$IFDEF RANGECHECKS_ON}" in the code and assumed Spring would control range checking, disabling it in sections where it's safe (because asserted, explicit loops on count/length, unit tested, etc. such as for a sort) 😞

However looking more closely at the Spring source that doesn't seem to be the case (f.i. Vector<T> GetItem would lose range checking globally as well), so timings with range checking off would be a bit of an oddball scenario for Spring.

> Looks like I am getting twice the speed at worst and 50 times at best.

That sounds impressive!

DelphiUdIT · September 13, 2024

2 hours ago, Stefan Glienke said:

To me they are not - if you have some backend server code that is already doing multiple things in parallel it won't get any better if you then do parallel sort.

Uhmm ... during my tests maximum charge of CPU is 4% ... I don't think that any other normal load can change the timing in impressive way.

Of course depends where you need these works. In my applications I will not able to use parallel algo, like already discuss in other posts cause heavy load (85% of CPU). but I never used in my life any sort algo, 'cause normally I had only maximum of some hundred of elements and I use normal shitfing (memcopy) in entrance of array.

Bye

Stefan Glienke · February 27

Just a few new numbers of a not yet released parallel pdq sort - using the benchmark code from this comment earlier in this thread

Fill array (double), elements count: 500000
Start sorting ...
RTL TArray.Sort (ms.): 47
RTL TParallelArray.Sort (ms.): 35
Spring TArray.Sort (ms.): 12
Spring TArray.Sort_Parallel (ms.): 3

Fill array (double), elements count: 5000000
Start sorting ...
RTL TArray.Sort (ms.): 551
RTL TParallelArray.Sort (ms.): 128
Spring TArray.Sort (ms.): 136
Spring TArray.Sort_Parallel (ms.): 64

Fill array (double), elements count: 100000000
Start sorting ...
RTL TArray.Sort (ms.): 12724
RTL TParallelArray.Sort (ms.): 1884
Spring TArray.Sort (ms.): 3035
Spring TArray.Sort_Parallel (ms.): 675

Again - these numbers are fluctuating a bit because the benchmark is a "run once" benchmark and it depends on the current CPU state etc - also I did not tweak the threshold and CPU count yet - simply calling TTask.Run from System.Threading to fork some slices into parallel execution.

But overall it does not look too bad, doesn't it?

Edited February 27 by Stefan Glienke

Vincent Parrett · February 28

@Stefan Glienke Looking forward to your talk at Delphi Summit 😃

Sign In

TParallelArray Sort Performance...

Recommended Posts

Anders Melander 2023

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

Eric Grange 12

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

Eric Grange 12

Share this post

Link to post

DelphiUdIT 248

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

Vincent Parrett 847

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity