String comparison in HashTable

Tommi Prami · July 21, 2023

Yellow,

Watched this video last night and it had interesting things:

I

Did not test this yet in any way in Delphi, how to write it to make it behave same way as the C++, but this trick gave quite a performance boost.

Also I've not never looked into the "perfect hash tables" and there was quite interesting tricks there, especially when you have and suitable dataset to use it for.

-Tee-

David Heffernan · July 21, 2023

What seems missing to me is how to restrict the input data to be just valid typescript keywords.

Fr0sT.Brutal · July 21, 2023

3 hours ago, David Heffernan said:

What seems missing to me is how to restrict the input data to be just valid typescript keywords.

AFAIU he solves just the task of keyword lookup, not the general data.

David Heffernan · July 21, 2023

31 minutes ago, Fr0sT.Brutal said:

AFAIU he solves just the task of keyword lookup, not the general data.

The thing is, how are you going to get to the stage where you have got a token that you know is a valid keyword?

Perfect hashes are cool and all, but I wonder if this is the best example.

Fr0sT.Brutal · July 21, 2023

12 minutes ago, David Heffernan said:

The thing is, how are you going to get to the stage where you have got a token that you know is a valid keyword?

Well, I just watched it super quickly with rewind but my guess is that he optimizes a parser or something similar when software already got token that is known to be an identifier and he just wants to quickly check if it's a keyword.

I wonder whether things would be even faster if he'd stored lengths of keywords and compared them before calling strncmp. That should give some gain on cases with similar first letter but different lengths

Edited July 21, 2023 by Fr0sT.Brutal

dummzeuch · July 21, 2023

59 minutes ago, Fr0sT.Brutal said:

I wonder whether things would be even faster if he'd stored lengths of keywords and compared them before calling strncmp. That should give some gain on cases with similar first letter but different lengths

He tried that too.

David Heffernan · July 21, 2023

1 hour ago, Fr0sT.Brutal said:

known to be an identifier and he just wants to quickly check if it's a keyword.

Perfect hash only works if you can restrict your input domain, or am I wrong?

Edited July 21, 2023 by David Heffernan

Sherlock · July 21, 2023

This is a highly specialized subject and neither my job nor my personal interests got me all to involved in this area of computer science. So as a layman, if I understand this correctly, he'll have to go back to the drawing board, as soon as he discovers characters are now UTF8, UTF16 or even UTF32, because his finely tuned algorithm is designed for single byte characters. Or is it as easy as switching data types?

David Heffernan · July 21, 2023

10 minutes ago, Sherlock said:

This is a highly specialized subject and neither my job nor my personal interests got me all to involved in this area of computer science. So as a layman, if I understand this correctly, he'll have to go back to the drawing board, as soon as he discovers characters are now UTF8, UTF16 or even UTF32, because his finely tuned algorithm is designed for single byte characters. Or is it as easy as switching data types?

He's probably using UTF-8

dummzeuch · July 21, 2023

54 minutes ago, David Heffernan said:

Perfect hash only works if you can restrict your input domain, or am I wrong?

Since as the last step he does an actual comparison between the input string and the string "found" by the hash function(*1), it should work with any input string. The important part is that the strings he is searching for are known beforehand, so he can fine tune the hash function.

(*1: At least he was still doing that when I stopped watching at about 3/4 of the video.)

Stefan Glienke · July 21, 2023

I am not sure what y'all discussing here but the snippet that Tommi posted is about checking the first character before even calling strncmp (video at around 15:15).

It's pretty obvious why that is faster - if the characters don't match you get rid of the function call.

The algo does not need to restrict anything - the hashtable contains all keywords and the input is any string that occurs in the source code - if it's a keyword, it gets found, if it's not, then it won't be found, simple as that.

Edited July 21, 2023 by Stefan Glienke

David Heffernan · July 21, 2023

3 hours ago, dummzeuch said:

Since as the last step he does an actual comparison between the input string and the string "found" by the hash function(*1), it should work with any input string.

This is the part I didn't watch, but yeah, that makes so much more sense now!

Arnaud Bouchez · July 24, 2023

1) TL&WR

Do not try to use those tricks in anything close to a common-use hash table, e.g. your own data library.

So much micro-benchmarking for no benefit in practice: all this is not applicable to a common-use library.
The Rust RTL has already been optimized and scrutinized by a lot of people and production workloads.

I would never try to apply what he found for his own particular case to any hash table implementation.

2) hash function

From profiling on real workloads, with a good enough hash function, there are only a few collisions.

Of course, FNV is a slow function, with a bad collision rate.

With a good hash function, e.g. our AesNiHash32 from https://github.com/synopse/mORMot2/blob/master/src/crypt/mormot.crypt.core.asmx64.inc#L6500 which is inspired by the one in Go RTL, comparing the first char is enough to reject most collisions, due to its almost-random hash spreading.

Then, the idea of "tweaking the hash function" is just a pure waste of computer resource for a common-use library, once you have a realistic hash table with thousands (millions) of items.
Of course, this guy want to hash a table of a few elements, which are known in advance. So it is not a problem for him.

So no hash function was the fastest. Of course.
But this is not a hash table any more - it is a dedicated algorithm for a specific use-case.

3) security

Only using the first and last characters is an awful assumption for a hash process in a common library. It may work for his own dataset, but it is a very unsafe practice.
This is the 101 of hash table security: don't make it guessable, or you would expose yourself to hash flooding http://ocert.org/advisories/ocert-2012-001.html

4) one known algorithm for such a fixed keyword lookup

The purpose of this video is to quickly find a value within a fixed list of keywords.

And from what I have seen in practice, some algorithms would perform better because won't involve a huge hash table, and won't pollute the CPU cache.
For instance, this code is used on billions of computers, on billions of datasets, and works very well in practice:
https://sqlite.org/src/file?name=src/tokenize.c&ci=trunk

The code is generated by https://sqlite.org/src/file?name=tool/mkkeywordhash.c&ci=trunk

An extract is:

/* Check to see if z[0..n-1] is a keyword. If it is, write the
** parser symbol code for that keyword into *pType.  Always
** return the integer n (the length of the token). */
static int keywordCode(const char *z, int n, int *pType){
  int i, j;
  const char *zKW;
  if( n>=2 ){
    i = ((charMap(z[0])*4) ^ (charMap(z[n-1])*3) ^ n*1) % 127;
    for(i=((int)aKWHash[i])-1; i>=0; i=((int)aKWNext[i])-1){
      if( aKWLen[i]!=n ) continue;
      zKW = &zKWText[aKWOffset[i]];
      if( (z[0]&~0x20)!=zKW[0] ) continue;
      if( (z[1]&~0x20)!=zKW[1] ) continue;
      j = 2;
      while( j<n && (z[j]&~0x20)==zKW[j] ){ j++; }
      if( j<n ) continue;
      *pType = aKWCode[i];
      break;
    }
  }
  return n;
}

Its purpose was to reduce the code size, but in practice, it also reduces CPU cache pollution and tends to be very fast, thanks to a 128 bytes hash table.
This code is close to what the video proposes - just even more optimized.

Edited July 24, 2023 by Arnaud Bouchez

Arnaud Bouchez · July 24, 2023

On 7/21/2023 at 3:13 PM, David Heffernan said:

He's probably using UTF-8

From the asm shown in the video, it seems to compare WORD PTR characters, so it is likely to be UTF-16.

Fr0sT.Brutal · July 24, 2023

I watched the video in jumps but what I understood:

- The guy needs specialized function to check if a word is from a very short set of predefined words

- These words are latin

- He uses "hash" as int32 = s[0],s[1],s[-2],s[-1] (inputs are ASCII/UTF8 but algo may also work with UTF16 with only 1st and last char used as hash) (hmm, but he'll got array of 2bil of pointers then...)

- The hash is assumed to be unique for his set (he just takes wordlist[key] without any sub-arrays of items with the same hash)

- And he also checks for 1st char before running full compare

Edited July 24, 2023 by Fr0sT.Brutal

Sign In

String comparison in HashTable

Recommended Posts

Tommi Prami 123

Share this post

Link to post

David Heffernan 2300

Share this post

Link to post

Fr0sT.Brutal 897

Share this post

Link to post

David Heffernan 2300

Share this post

Link to post

Fr0sT.Brutal 897

Share this post

Link to post

dummzeuch 1392

Share this post

Link to post

David Heffernan 2300

Share this post

Link to post

Sherlock 632

Share this post

Link to post

David Heffernan 2300

Share this post

Link to post

dummzeuch 1392

Share this post

Link to post

Stefan Glienke 1905

Share this post

Link to post

David Heffernan 2300

Share this post

Link to post

Arnaud Bouchez 380

Share this post

Link to post

Arnaud Bouchez 380

Share this post

Link to post

Fr0sT.Brutal 897

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity