Anders Melander 1795 Posted September 13 Wouldn't it make sense to do a CLFLUSH before the sort so it doesn't benefit from all the data already being in the cache? procedure FlushCache(Data: Pointer; Size: Integer); const CACHE_LINE_SIZE = 64; asm @NextBlock: CLFLUSH [Data + Size] SUB Size,CACHE_LINE_SIZE JGE @NextBlock end; Share this post Link to post
Anders Melander 1795 Posted September 13 4 minutes ago, Stefan Glienke said: Looks like I am getting twice the speed at worst and 50 times at best. Impressive. Share this post Link to post
Eric Grange 11 Posted September 13 Hmm, definitely something odd. I used the version from https://bitbucket.org/sglienke/spring4d/src/master/ maybe it's off in some way ? Here is on i7-1165G7, Win64, optimization on, stack frames off, Spring is only marginally ahead of the RTL, I get these timings Fill array (double), elements count: 500000 Start sorting ... RTL TArray.Sort (ms.): 63 Spring TArray.Sort (ms.): 56 Fill array (double), elements count: 5000000 Start sorting ... RTL TArray.Sort (ms.): 596 Spring TArray.Sort (ms.): 574 Fill array (double), elements count: 100000000 Start sorting ... RTL TArray.Sort (ms.): 14311 Spring TArray.Sort (ms.): 13708 Quicksort non-generic is about 30% faster than Spring in all 3 tests. Share this post Link to post
Stefan Glienke 2009 Posted September 13 (edited) Did you precompile Spring or are you compiling directly from the sources? If you precompile (recommended) then you need to precompile with release settings of course. If I had to bet I would say you still have range checking enabled (RTL has that disabled, so does your code) Also tbh I don't care about a 30% improvement of a handwritten algo specifically for one type over the generic one. Yes, compiler could do a better job with more aggressively inlining and stuff but nowadays I am happy already if it generates working code. 23 minutes ago, Anders Melander said: Wouldn't it make sense to do a CLFLUSH before the sort so it doesn't benefit from all the data already being in the cache? procedure FlushCache(Data: Pointer; Size: Integer); const CACHE_LINE_SIZE = 64; asm @NextBlock: CLFLUSH [Data + Size] SUB Size,CACHE_LINE_SIZE JGE @NextBlock end; Apart from the code not compiling I would say it depends on what you want to benchmark - how sort performs on completely cold data? Like some data that you loaded an hour ago, then did completely different things so it does not reside in the cache anymore and now you want to have the fastest sorting possible. Then yes, otherwise I would say that usually you typically are sorting data that is already in cache because you previously filled the array or list with it, or processed it followed by the sort operation. Some additional read on that "clear the LLC for some benchmark" topic: https://stackoverflow.com/a/49077734/587106 Edited September 13 by Stefan Glienke Share this post Link to post
Eric Grange 11 Posted September 13 > If I had to bet I would say you still have range checking enabled (RTL has that disabled, so does your code) Ah, no, it was on. I has spotted some "{$IFDEF RANGECHECKS_ON}" in the code and assumed Spring would control range checking, disabling it in sections where it's safe (because asserted, explicit loops on count/length, unit tested, etc. such as for a sort) 😞 However looking more closely at the Spring source that doesn't seem to be the case (f.i. Vector<T> GetItem would lose range checking globally as well), so timings with range checking off would be a bit of an oddball scenario for Spring. > Looks like I am getting twice the speed at worst and 50 times at best. That sounds impressive! Share this post Link to post
DelphiUdIT 178 Posted September 13 2 hours ago, Stefan Glienke said: To me they are not - if you have some backend server code that is already doing multiple things in parallel it won't get any better if you then do parallel sort. Uhmm ... during my tests maximum charge of CPU is 4% ... I don't think that any other normal load can change the timing in impressive way. Of course depends where you need these works. In my applications I will not able to use parallel algo, like already discuss in other posts cause heavy load (85% of CPU). but I never used in my life any sort algo, 'cause normally I had only maximum of some hundred of elements and I use normal shitfing (memcopy) in entrance of array. Bye Share this post Link to post