Jump to content
Mike Torrettinni

Fast lookup tables - TArray.BinarySearch vs Dictionary vs binary search

Recommended Posts

Good point - I will even go even further and think all the lookup methods should returning rather than a specific property value which a) improves performance, b) makes them better comparable and c) makes this code more flexible

Most then even turn into easily inlinable one liners.

Share this post


Link to post
5 hours ago, balabuev said:

Here is my try. I've changed a lot in existing code (mentioned early comparers, equality comparers, etc.). Also, I've replaced returned string type in functions by PString, just because otherwise string handling takes too much time and algorythms comparison becomes not interesting. 

As well I've added my own minimalistic implementation of hash maps (not universal).

 

mapstest.zip

 

 

Some very impressive numbers, x32 and x64!

  • Like 1

Share this post


Link to post
Posted (edited)

FWIW the approach to use objects as bucket items for a hashmap is a no go in real code as your memory will be all over the place plus the memory overhead of objects. In a microbenchmark you will get away with that because there won't be much more memory around.

On my i7 the TNameMap already drops behind the RTL dict at around 50000k items. The same seems to be the case for the TSynDict as that drops even more.

Edited by Stefan Glienke

Share this post


Link to post
Posted (edited)

@balabuev is this something you came up with on the spot, or you are using something similar or is from another library? This is not licensed, right, I can use in commercial software?

Edited by Mike Torrettinni

Share this post


Link to post
Posted (edited)
5 hours ago, Mike Torrettinni said:

is this something you came up with on the spot, or you are using something similar or is from another library? This is not licensed, right, I can use in commercial software?

 

No library used, just a simplest implementation of hash map. Use it as you need.

 

7 hours ago, Stefan Glienke said:

the approach to use objects as bucket items for a hashmap is a no go in real code

 

I'm hearing about this the first time ever, and it seems to me very strange. Also, a single hash map can be organized on top of TDataLine user objects without any additional objects (via additng Next and Hash fields directly to TDataLine). Two maps, however will need another one pair of such properties, which is not so graceful, but also possible, if we speak about the extreme case.

 

7 hours ago, Stefan Glienke said:

On my i7 the TNameMap already drops behind the RTL dict at around 50000k items.

 

I've only tested with 100k items. I can see the effect. However, it's not clear for me why this happens.

 

 

 

Edited by balabuev
  • Thanks 1

Share this post


Link to post
On 3/6/2021 at 2:32 PM, Mike Torrettinni said:

I'm refactoring usage of various different look-up tables and I'm trying to come with 1 fast look-up table design. Most of them have 1 thing in common: search by integer (ID) and search by string (Name).

All data is sorted.

 

So, I bench-marked TArray.BinarySearch, TDictionary and custom binary search.

Even though a flaw in design was pointed out for TArray.BinarySearch, (see https://en.delphipraxis.net/topic/4575-why-is-tarraybinarysearch-slower-than-normal-binary-search-function/), I still wanted to compare results:

 

For search by integer, custom binary search seems to be a little faster than TDictionary, until large tables:

image.png.67388e6df369cfce9ae31490179558c9.png

 

For search by string, custom binary search wins even more, still for not large tables:

image.png.50244b99e8133d6704d0554f0f020b16.png

 

Interesting is that for smaller searches, 10 records, a sequential search is faster than TDictionary.

 

I also tried TGpStringHash (by Primoz Gabrielcic), which is very fast, but unfortunately no implementation for integers.

TSynDictionary (from mORMot) is also very fast, but I don't use mORMotand license is not friendly for my commercial project.

 

Is there any other fast search implementations that is not too specialized and ready to use out of the box, that I can try?

 

 

 

 

Take a look - Binary Array Set. Rare, but very useful data structure.

  • Thanks 1

Share this post


Link to post
On 5/2/2021 at 4:39 AM, julkas said:

Take a look - Binary Array Set. Rare, but very useful data structure.

That is interesting. Do you use it, for what cases you prefer using it?

 

I can't find Delphi implementation. Do you know does it exist?

Share this post


Link to post
22 hours ago, Mike Torrettinni said:

That is interesting. Do you use it, for what cases you prefer using it?

 

I can't find Delphi implementation. Do you know does it exist?

There is no ready to use Pascal implementation 😞 . You can start / take ideas from existing C++ & Java code and create good (without remove/delete 😞) ordered set or order statistics set (array bisection algorithm may be helpful) Object Pascal class.

Share this post


Link to post
2 hours ago, julkas said:

There is no ready to use Pascal implementation 😞 . You can start / take ideas from existing C++ & Java code and create good (without remove/delete 😞) ordered set or order statistics set (array bisection algorithm may be helpful) Object Pascal class.

And then after you spent quite some time implementing this without defects that it would not be any faster than all the ready-to-use collection types or the actual performance improvement would be completely negligible.

Share this post


Link to post

I must say, when in the phase to get stuff to work, i throw TDictionaries all over the place. It is fast to code, it's legible. Sometimes enabling them with a TList or TArray for sorted output or some such. Of course, if its just 10 digits, it's an array from start. Just make it readable and working!

 

Then, when the project has been/is being deployed to some alpha/beta users and is stabile enough (in order not to confuse testers or myself) i do some extensive monitoring. That shows where the point of interest are to for example ditch a dict and use an array, or delve deep into the book to get inspiration.

 

What i mean to say is, use the most "logical" way of storing in the beginning. Then focus on the parts of the system were it really pays. That way, you will have legible code most places and "tuned" (more complex source-reading, more sensitive maintenance) code where it really matters.

 

Just my $,05. 

Edited by Dany Marmur
  • Like 2

Share this post


Link to post

@Dany Marmur Agreed - it's the job of runtime library developers to get the most out of their data structures and algorithms so the users of those libraries don't have to worry 99.9% of the time but just chose the reasonable type/algo for the job.

 

Funny enough I am currently in the process to do exactly that for my library and can tell you that for the RTL it's certainly not the case, unfortunately.

  • Like 4

Share this post


Link to post

An impressive number of implemented collections for sure but it's only compatible with FreePascal and from a quick look you won't easily make that code Delphi compatible.

But thanks for mentioning it - I will certainly run some benchmarks to compare.

Share this post


Link to post
11 hours ago, Stefan Glienke said:

An impressive number of implemented collections for sure but it's only compatible with FreePascal and from a quick look you won't easily make that code Delphi compatible.

But thanks for mentioning it - I will certainly run some benchmarks to compare.

FPC benchmarks

Share this post


Link to post
On 3/18/2021 at 2:19 AM, balabuev said:

Here is my try. I've changed a lot in existing code (mentioned early comparers, equality comparers, etc.). Also, I've replaced returned string type in functions by PString, just because otherwise string handling takes too much time and algorythms comparison becomes not interesting. 

As well I've added my own minimalistic implementation of hash maps (not universal).

 

mapstest.zip

 

 

Hi. I've downloaded and looked at your NameMapUnit unit and have the following comments:

1. Firstly, great speed, so thanks and well done.

2. I suggest adding an "InitialCapacity" parameter to your  TMapBase.constructor

3. It seems to me that your TMapBase.Rehash method, which concurrently removes and reinserts items into FItems, risks some duplication. (OK, on further testing this doesn't seem to be a problem.) Anyhow, there may be a marginal improvement by reinserting TMapItem items instead of recreating them.

Cheers.

Edited by angusj

Share this post


Link to post
2 hours ago, angusj said:

Anyhow, there may be a marginal improvement by reinserting TMapItem items instead of recreating them

I tried to keep the code reasonably simple. Yes, sure, some improvements are possible. Instead of objects we can at least use records for map items with New/Dispose (or even better, with GetMem/FreeMem), as @Stefan Glienke mentioned before (if I've understood right). We can also go futher and try to minimize heap allocations, or use some sort of map item pool, etc. We also can use some initial capacity as you propose.

 

But, this will complicate the demo code, thus, hiding the main idea, which it tries to show. Moreover, map filling is not critical for this particular test. We measure only search speed.

 

 

Share this post


Link to post
3 hours ago, balabuev said:

But, this will complicate the demo code,

Noted. Cheers, and thanks again.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×