Jump to content

Recommended Posts

New hasher in town, to test and benchmark:
https://blog.synopse.info/?post/2021/02/12/New-AesNiHash-for-mORMot-2

 

Murmur and xxHash just far away from it, in terms of speed, and also in terms of collisions I guess... 15GB/s on my Core i3, on both Win32 and Win64.

 

The smallest length of 0-15 bytes are taken without any branch, 16-128 bytes have no loop, and 129+ are hashed with 128 bytes per iteration.
Also note its anti-DOS abilities, thanks to its random seed at process startup.
So it was especially tuned for a hashmap/dictionary.

Edited by Arnaud Bouchez
  • Like 1
  • Thanks 1

Share this post


Link to post

The search speedup for hashmaps in this case is ~ 1/N, where N is the size of the table.

As long as the table is not completely filled with less than < 0.7 the table works well, and then comes performance degradation.

In such cases, the table size is increased and all data is moved to the new table.


This takes extra time and also means that while you do this, your system will not be able to serve clients.

In my opinion, up to about 16 elements, it is better to use a regular unsorted list and a linear search.
Then it makes sense to use either a TList with sorting each time a new key is added, ~ N * log2(N) for quicksort, or a tree where search and insertion with order maintenance is an inexpensive log2(N).

 

Advantages of hashmaps.
Search time for hashmap includes: hash generation + 1 to m equality comparisons.


Ideal hash function:
1. must be fast
2. Does not create collisions (does not generate the same key for different keys).

 

I have compared many hash functions in my time and realized that even the simplest ones provide good results on real data.

I have tested it on a database of addresses with about a million entries and for a client base with about 500,000 entries.
The algorithms included md5, sha-2, sha-3, src32, multiplicative hash.

 

function HashMultiplicative(const key: PByte; Size: Cardinal): Cardinal;
var
  i, hash: Cardinal;
  p: PByte;
begin
  hash := 5381;
  p := key;
  for i := 1 to Size do
  begin
    hash := 33 * hash + p^;
    Inc(p);
  end;
  Result := hash;
end;

 

For strings, a function that takes string length in characters into account when calculating the hash is useful.
For floating point numbers, it is better to use an algorithm that hashes the exponent and the mantissa of the number separately.
If it's an object or a record, sometimes it makes sense to write your own function that hashes selected key fields of the object.

 

I like hash table implementations with chains.

I usually know the approximate number of objects in the system and can immediately set the required input table size N.
This will require a bit more memory, but you know right away what the speedup will be and there will be no time wasted on dynamically increasing the table.

There is no point in increasing the size of the table much more than N.


Since the input index is determined by the search
index_input := hash mod N;

 

You can't choose a table size multiple of a byte, it increases the crowding of data in the table.

For many hashing algorithms, if you choose a size other than a prime number, the probability of collisions will increase and it will degrade the statistics.

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×