Jump to content

Recommended Posts

This is quite interesting. Have not looked into how it is implemented, but speedup claims are pretty substantial.

https://github.com/intel/x86-simd-sort

Rarely sort is the real bottleneck in any code, but faster is still faster 🙂

 

-Tee-

Edited by Tommi Prami

Share this post


Link to post

Quite interesting, but I think it's not complete. Support to AVX2 should be full available.

Intel new processors (12th, 13th and 14th gen.) don't support AVX512 (not even in the future, it seems).

 

In the professional line of Intel (like some Xeon family) only a subset of AVX512 instructions are supported.

 

The AVX2 instructions are more widely supported.

 

Bye

  • Like 1

Share this post


Link to post

A serious question because many sort benchmarks are so obsessed with sorting large arrays/vectors of just integers and floats:

Do people in reality really have these kinds of data that need to be sorted?

Share this post


Link to post
7 minutes ago, Stefan Glienke said:

A serious question because many sort benchmarks are so obsessed with sorting large arrays/vectors of just integers and floats:

Do people in reality really have these kinds of data that need to be sorted?

I'm not expert on that, but from what I have seen of development "out there" I don't think there is this great need. But perhaps for certain specific purposes or for some algorithms this may be necessary. The development, even if it is a niche, is still going well.
What I was focusing on is that it is useless to provide only partial "optimization" parts, because they are too dependent on the hardware ... unless such optimizations are to be used only and exclusively on specific hardware (if I have to mainly use Intel's I certainly wouldn't use the AVX512, but if I only use AMD maybe yes).

Share this post


Link to post
5 hours ago, Stefan Glienke said:

A serious question because many sort benchmarks are so obsessed with sorting large arrays/vectors of just integers and floats:

Do people in reality really have these kinds of data that need to be sorted?

Benchmarks are benchmarks and usually compare the capability of a certain piece of hardware or a system as a whole to solve a task. That's what they do and speed up is the point, because slower is easy to achieve in general.

 

The granularity of data is growing and considering time series gets more important than ever before, since hardware became fast enough.

 

I'm also doing hard to find an application for just sorting key figures/values regarding an array as an independent dimension not connected to descriptive values. Maybe my brain is too strongly tied to non-math database thinking and I'm no technician or mathematician. These guys make use of such data-structures.

 

Sorting values should result in faster lookup times acting as an index and extracting ranges in series of values for example. No idea why someone would need that, but as mentioned short before ...

 

In general we are used to, me at least, to know what I'm looking for. If values are just retrieved searching and creating ranges can help to find out what should be looked up. Integer (just normal) as well as floating point values can act as describing dimensions too. In technical applications or the analysis of production line data regarding or planning worked that way. In a business application almost everything that can be related to a surrogate id is more or less discrete and searching starts with specifying describing values retrieved from describing dimensions assumed to have an impact. In case of finding the impacting/influencing dimensions values measured have to be regarded/analyzed first, values grouped and so on.

 

One of my former business partners simplified optimization by removing criteria that had no impact to the optimization of the product mix in production first instead of considering a fixed set.

 

Richard Werner (German economist) used pretty much the same trick to remove the impact of the interest rate to the demand for credit in business for example unless the plug is pulled by the central bank. It's no surprise because 'a line'/company providing goods in the classical Mid-European industry model used the interest rate after app. 35(40) to 50 years to decide if the line should keep producing and just repair the broken machines or build a new plant (reproduction of the line). Since the mid. 1980s even in Austria everything has been pushed into direction the 'Anglo-Saxon' industrial/business model. Those guys simply rebuilt the plant after 45 years and never really bothered with this situation. The myth of the impact of the interest rate and the believe concerning investment decisions is still alive and kept alive.

 

The same myths do exist in many production planning systems considering a wide variety of influential factors/criteria retrieved from the past and with this for so called experience. That's why so many optimization model rely on let's say about 40 describing dimensions/criteria/influential factors taken from a pool of maybe 2000 potential influential factors while in current production, in a steel plant liquid phase, only 3 or 4 really drive the optimization software's decision and the result of the planning and optimization run. This requires the analysis of the values/key figures and describing technical dimensions first an the analysis should finish quickly, because today a run of production plan happens in de facto real-time (about 0,5 sec or less) and finding the influential factors should also not take long. In the past 20 years ago such a run was allowed to take 20 minutes.

 

Share this post


Link to post
15 hours ago, Stefan Glienke said:

A serious question because many sort benchmarks are so obsessed with sorting large arrays/vectors of just integers and floats:

Do people in reality really have these kinds of data that need to be sorted?

We write software that does scientific data analysis. We regularly have to sort arrays of millions of elements of data. But I recognize we are in the minority of developers.

  • Like 1

Share this post


Link to post
16 hours ago, Stefan Glienke said:

Do people in reality really have these kinds of data that need to be sorted?

In fact, most of people really, though indirectly, have this kind of data sorted 😉

DB keys

  • Like 1

Share this post


Link to post
1 hour ago, Fr0sT.Brutal said:

In fact, most of people really, though indirectly, have this kind of data sorted 😉

DB keys

Indirectly does not matter for this algorithm because it directly sorts a vector of integer or float - and DB keys are organized in a different layout.

 

2 hours ago, Dave Novo said:

We write software that does scientific data analysis. We regularly have to sort arrays of millions of elements of data. But I recognize we are in the minority of developers.

Again, the question was not in sorting millions of elements of data but strictly sorting arrays of integer or float.

Edited by Stefan Glienke

Share this post


Link to post

In statistics and/or data analysis to calculate the median you have to sort the data, at least I always have. When plotting data, say in an xy plot, sometimes it's better to sort the data by x first. So, in technical fields it's not that uncommon.

Share this post


Link to post
36 minutes ago, DJSox said:

In statistics and/or data analysis to calculate the median you have to sort the data, at least I always have.

This sounds like a really interesting numerical problem! I suspect that in practical terms, sorting is going to be the winner though.

Share this post


Link to post

Correct, we sort arrays of integers and floats all the time. Both double and single. And have to process these millions of values as well. By "process" I mean perform mathematical operations on them. For us the Dew Research library mtxVec has proven invaluable.

Share this post


Link to post
On 11/2/2023 at 4:36 PM, Stefan Glienke said:

A serious question because many sort benchmarks are so obsessed with sorting large arrays/vectors of just integers and floats:

Do people in reality really have these kinds of data that need to be sorted?

Good point.

Mainly in real life apps I've been working with almost every time sorting is been for strings. Then some quite small portion been list of objects or records.


-Tee-

Share this post


Link to post

I'm not sure about exactly this library but some predecessors indeed were able to sort pointer list whereas the "key" is a float/int 32/64 . The key can be specified as an "offset" to the

starting address -> this allows to sort pointers to records/classes....

at least that is the way they do it in "Fast quicksort Implementation using AVX Instructions"

 

The definitions in the library for AVX512 are actually using arrays of templates so I guess that means that you can sort records right not only Double/float... ?

 

On 11/3/2023 at 1:23 PM, DJSox said:

In statistics and/or data analysis to calculate the median you have to sort the data, at least I always have. When plotting data, say in an xy plot, sometimes it's better to sort the data by x first. So, in technical fields it's not that uncommon.

This problem is solved by an algorithm that allows to extract the kth largest element (not only the median) -> basically it does the partitioning of quicksort but does not sort hte "unnecessary" half. That in fact allows to pick the median in O(n) (roughly) time....

 

And if you want to go fancy you can do a rolling median over an array of doubles. The algorithm can be found here:
https://github.com/mikerabat/mrmath/blob/master/RollingMedMean.pas

 

 

Edited by mikerabat

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×