Jump to content
Steve Maughan

TParallelArray Sort Performance...

Recommended Posts

Wouldn't it make sense to do a CLFLUSH  before the sort so it doesn't benefit from all the data already being in the cache?

 

procedure FlushCache(Data: Pointer; Size: Integer);
const
  CACHE_LINE_SIZE = 64;
asm
@NextBlock:
  CLFLUSH  [Data + Size]
  SUB      Size,CACHE_LINE_SIZE
  JGE      @NextBlock
end;

 

Share this post


Link to post
4 minutes ago, Stefan Glienke said:

Looks like I am getting twice the speed at worst and 50 times at best.

Impressive.

Share this post


Link to post

Hmm, definitely something odd. I used the version from https://bitbucket.org/sglienke/spring4d/src/master/ maybe it's off in some way ?
Here is on i7-1165G7, Win64, optimization on, stack frames off, Spring is only marginally ahead of the RTL, I get these timings

Fill array (double), elements count: 500000
Start sorting ...
RTL TArray.Sort (ms.): 63
Spring TArray.Sort (ms.): 56

Fill array (double), elements count: 5000000
Start sorting ...
RTL TArray.Sort (ms.): 596
Spring TArray.Sort (ms.): 574

Fill array (double), elements count: 100000000
Start sorting ...
RTL TArray.Sort (ms.): 14311
Spring TArray.Sort (ms.): 13708

Quicksort non-generic is about 30% faster than Spring in all 3 tests.

Share this post


Link to post

Did you precompile Spring or are you compiling directly from the sources? If you precompile (recommended) then you need to precompile with release settings of course. If I had to bet I would say you still have range checking enabled (RTL has that disabled, so does your code)

 

Also tbh I don't care about a 30% improvement of a handwritten algo specifically for one type over the generic one. Yes, compiler could do a better job with more aggressively inlining and stuff but nowadays I am happy already if it generates working code.

 

23 minutes ago, Anders Melander said:

Wouldn't it make sense to do a CLFLUSH  before the sort so it doesn't benefit from all the data already being in the cache?

 


procedure FlushCache(Data: Pointer; Size: Integer);
const
  CACHE_LINE_SIZE = 64;
asm
@NextBlock:
  CLFLUSH  [Data + Size]
  SUB      Size,CACHE_LINE_SIZE
  JGE      @NextBlock
end;

 

Apart from the code not compiling I would say it depends on what you want to benchmark - how sort performs on completely cold data? Like some data that you loaded an hour ago, then did completely different things so it does not reside in the cache anymore and now you want to have the fastest sorting possible. Then yes, otherwise I would say that usually you typically are sorting data that is already in cache because you previously filled the array or list with it, or processed it followed by the sort operation.

 

Some additional read on that "clear the LLC for some benchmark" topic: https://stackoverflow.com/a/49077734/587106

Edited by Stefan Glienke

Share this post


Link to post

If I had to bet I would say you still have range checking enabled (RTL has that disabled, so does your code)


Ah, no, it was on. I has spotted some "{$IFDEF RANGECHECKS_ON}" in the code and assumed Spring would control range checking, disabling it in sections where it's safe (because asserted, explicit loops on count/length, unit tested, etc. such as for a sort) 😞

However looking more closely at the Spring source that doesn't seem to be the case (f.i. Vector<T> GetItem would lose range checking globally as well), so timings with range checking off would be a bit of an oddball scenario for Spring.


> Looks like I am getting twice the speed at worst and 50 times at best.

 

That sounds impressive!

 

Share this post


Link to post
2 hours ago, Stefan Glienke said:

To me they are not - if you have some backend server code that is already doing multiple things in parallel it won't get any better if you then do parallel sort.

Uhmm ... during my tests maximum charge of CPU is 4% ... I don't think that any other normal load can change the timing in impressive way.

 

Of course depends where you need these works. In my applications I will not able to use parallel algo, like already discuss in other posts cause heavy load (85% of CPU). but I never used in my life any sort algo, 'cause normally I had only maximum of some hundred of elements and I use normal shitfing (memcopy) in entrance of array.

 

Bye

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×