Jump to content
Tommi Prami

String comparison in HashTable

Recommended Posts

Yellow,

 

Watched this video last night and it had interesting things:  

 

I

image.thumb.png.6bdd26d891d7ed6cc58cf3215120d8da.png

 

Did not test this yet in any way in Delphi, how to write it to make it behave same way as the C++, but this trick gave quite a performance boost.

 

Also I've not  never looked into the "perfect hash tables" and there was quite interesting tricks there, especially when you have and suitable dataset to use it for. 

 

-Tee-

Share this post


Link to post
3 hours ago, David Heffernan said:

What seems missing to me is how to restrict the input data to be just valid typescript keywords. 

AFAIU he solves just the task of keyword lookup, not the general data.

Share this post


Link to post
31 minutes ago, Fr0sT.Brutal said:

AFAIU he solves just the task of keyword lookup, not the general data.

The thing is, how are you going to get to the stage where you have got a token that you know is a valid keyword?

 

Perfect hashes are cool and all, but I wonder if this is the best example. 

Share this post


Link to post
12 minutes ago, David Heffernan said:

The thing is, how are you going to get to the stage where you have got a token that you know is a valid keyword?

Well, I just watched it super quickly with rewind but my guess is that he optimizes a parser or something similar when software already got token that is known to be an identifier and he just wants to quickly check if it's a keyword.

 

I wonder whether things would be even faster if he'd stored lengths of keywords and compared them before calling strncmp. That should give some gain on cases with similar first letter but different lengths

 

Edited by Fr0sT.Brutal

Share this post


Link to post
59 minutes ago, Fr0sT.Brutal said:

I wonder whether things would be even faster if he'd stored lengths of keywords and compared them before calling strncmp. That should give some gain on cases with similar first letter but different lengths

He tried that too.

Share this post


Link to post
1 hour ago, Fr0sT.Brutal said:

known to be an identifier and he just wants to quickly check if it's a keyword.

Perfect hash only works if you can restrict your input domain, or am I wrong?

Edited by David Heffernan

Share this post


Link to post

This is a highly specialized subject and neither my job nor my personal interests got me all to involved in this area of computer science. So as a layman, if I understand this correctly, he'll have to go back to the drawing board, as soon as he discovers characters are now UTF8, UTF16 or even UTF32, because his finely tuned algorithm is designed for single byte characters. Or is it as easy as switching data types?

Share this post


Link to post
10 minutes ago, Sherlock said:

This is a highly specialized subject and neither my job nor my personal interests got me all to involved in this area of computer science. So as a layman, if I understand this correctly, he'll have to go back to the drawing board, as soon as he discovers characters are now UTF8, UTF16 or even UTF32, because his finely tuned algorithm is designed for single byte characters. Or is it as easy as switching data types?

He's probably using UTF-8 

Share this post


Link to post
54 minutes ago, David Heffernan said:

Perfect hash only works if you can restrict your input domain, or am I wrong?

Since as the last step he does an actual comparison between the input string and the string "found" by the hash function(*1), it should work with any input string. The important part is that the strings he is searching for are known beforehand, so he can fine tune the hash function.

 

(*1: At least he was still doing that when I stopped watching at about 3/4 of the video.)

Share this post


Link to post

I am not sure what y'all discussing here but the snippet that Tommi posted is about checking the first character before even calling strncmp (video at around 15:15).

It's pretty obvious why that is faster - if the characters don't match you get rid of the function call.

 

The algo does not need to restrict anything - the hashtable contains all keywords and the input is any string that occurs in the source code - if it's a keyword, it gets found, if it's not, then it won't be found, simple as that.

Edited by Stefan Glienke
  • Like 2

Share this post


Link to post
3 hours ago, dummzeuch said:

Since as the last step he does an actual comparison between the input string and the string "found" by the hash function(*1), it should work with any input string.

This is the part I didn't watch, but yeah, that makes so much more sense now! 

Share this post


Link to post

1) TL&WR

 

Do not try to use those tricks in anything close to a common-use hash table, e.g. your own data library.

 

So much micro-benchmarking for no benefit in practice: all this is not applicable to a common-use library.
The Rust RTL has already been optimized and scrutinized by a lot of people and production workloads.

I would never try to apply what he found for his own particular case to any hash table implementation.

 

2) hash function

 

From profiling on real workloads, with a good enough hash function, there are only a few collisions.

Of course, FNV is a slow function, with a bad collision rate.

With a good hash function, e.g. our AesNiHash32 from https://github.com/synopse/mORMot2/blob/master/src/crypt/mormot.crypt.core.asmx64.inc#L6500 which is inspired by the one in Go RTL, comparing the first char is enough to reject most collisions, due to its almost-random hash spreading.

Then, the idea of "tweaking the hash function" is just a pure waste of computer resource for a common-use library, once you have a realistic hash table with thousands (millions) of items.
Of course, this guy want to hash a table of a few elements, which are known in advance. So it is not a problem for him.

So no hash function was the fastest. Of course.
But this is not a hash table any more - it is a dedicated algorithm for a specific use-case.

 

3) security

 

Only using the first and last characters is an awful assumption for a hash process in a common library. It may work for his own dataset, but it is a very unsafe practice.
This is the 101 of hash table security: don't make it guessable, or you would expose yourself to hash flooding http://ocert.org/advisories/ocert-2012-001.html

 

4) one known algorithm for such a fixed keyword lookup

 

The purpose of this video is to quickly find a value within a fixed list of keywords.

 

And from what I have seen in practice, some algorithms would perform better because won't involve a huge hash table, and won't pollute the CPU cache.
For instance, this code is used on billions of computers, on billions of datasets, and works very well in practice:
https://sqlite.org/src/file?name=src/tokenize.c&ci=trunk

The code is generated by https://sqlite.org/src/file?name=tool/mkkeywordhash.c&ci=trunk

An extract is:
 

/* Check to see if z[0..n-1] is a keyword. If it is, write the
** parser symbol code for that keyword into *pType.  Always
** return the integer n (the length of the token). */
static int keywordCode(const char *z, int n, int *pType){
  int i, j;
  const char *zKW;
  if( n>=2 ){
    i = ((charMap(z[0])*4) ^ (charMap(z[n-1])*3) ^ n*1) % 127;
    for(i=((int)aKWHash[i])-1; i>=0; i=((int)aKWNext[i])-1){
      if( aKWLen[i]!=n ) continue;
      zKW = &zKWText[aKWOffset[i]];
      if( (z[0]&~0x20)!=zKW[0] ) continue;
      if( (z[1]&~0x20)!=zKW[1] ) continue;
      j = 2;
      while( j<n && (z[j]&~0x20)==zKW[j] ){ j++; }
      if( j<n ) continue;
      *pType = aKWCode[i];
      break;
    }
  }
  return n;
}

Its purpose was to reduce the code size, but in practice, it also reduces CPU cache pollution and tends to be very fast, thanks to a 128 bytes hash table.
This code is close to what the video proposes - just even more optimized.

Edited by Arnaud Bouchez

Share this post


Link to post
On 7/21/2023 at 3:13 PM, David Heffernan said:

He's probably using UTF-8 

From the asm shown in the video, it seems to compare WORD PTR characters, so it is likely to be UTF-16.

Share this post


Link to post

I watched the video in jumps but what I understood:

- The guy needs specialized function to check if a word is from a very short set of predefined words

- These words are latin

- He uses "hash" as int32 = s[0],s[1],s[-2],s[-1] (inputs are ASCII/UTF8 but algo may also work with UTF16 with only 1st and last char used as hash) (hmm, but he'll got array of 2bil of pointers then...)

- The hash is assumed to be unique for his set (he just takes wordlist[key] without any sub-arrays of items with the same hash)

- And he also checks for 1st char before running full compare

Edited by Fr0sT.Brutal
  • Thanks 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×