Tommi Prami 130 Posted July 21, 2023 Yellow, Watched this video last night and it had interesting things: I Did not test this yet in any way in Delphi, how to write it to make it behave same way as the C++, but this trick gave quite a performance boost. Also I've not never looked into the "perfect hash tables" and there was quite interesting tricks there, especially when you have and suitable dataset to use it for. -Tee- Share this post Link to post
David Heffernan 2347 Posted July 21, 2023 What seems missing to me is how to restrict the input data to be just valid typescript keywords. Share this post Link to post
Fr0sT.Brutal 900 Posted July 21, 2023 3 hours ago, David Heffernan said: What seems missing to me is how to restrict the input data to be just valid typescript keywords. AFAIU he solves just the task of keyword lookup, not the general data. Share this post Link to post
David Heffernan 2347 Posted July 21, 2023 31 minutes ago, Fr0sT.Brutal said: AFAIU he solves just the task of keyword lookup, not the general data. The thing is, how are you going to get to the stage where you have got a token that you know is a valid keyword? Perfect hashes are cool and all, but I wonder if this is the best example. Share this post Link to post
Fr0sT.Brutal 900 Posted July 21, 2023 (edited) 12 minutes ago, David Heffernan said: The thing is, how are you going to get to the stage where you have got a token that you know is a valid keyword? Well, I just watched it super quickly with rewind but my guess is that he optimizes a parser or something similar when software already got token that is known to be an identifier and he just wants to quickly check if it's a keyword. I wonder whether things would be even faster if he'd stored lengths of keywords and compared them before calling strncmp. That should give some gain on cases with similar first letter but different lengths Edited July 21, 2023 by Fr0sT.Brutal Share this post Link to post
dummzeuch 1506 Posted July 21, 2023 59 minutes ago, Fr0sT.Brutal said: I wonder whether things would be even faster if he'd stored lengths of keywords and compared them before calling strncmp. That should give some gain on cases with similar first letter but different lengths He tried that too. Share this post Link to post
David Heffernan 2347 Posted July 21, 2023 (edited) 1 hour ago, Fr0sT.Brutal said: known to be an identifier and he just wants to quickly check if it's a keyword. Perfect hash only works if you can restrict your input domain, or am I wrong? Edited July 21, 2023 by David Heffernan Share this post Link to post
Sherlock 663 Posted July 21, 2023 This is a highly specialized subject and neither my job nor my personal interests got me all to involved in this area of computer science. So as a layman, if I understand this correctly, he'll have to go back to the drawing board, as soon as he discovers characters are now UTF8, UTF16 or even UTF32, because his finely tuned algorithm is designed for single byte characters. Or is it as easy as switching data types? Share this post Link to post
David Heffernan 2347 Posted July 21, 2023 10 minutes ago, Sherlock said: This is a highly specialized subject and neither my job nor my personal interests got me all to involved in this area of computer science. So as a layman, if I understand this correctly, he'll have to go back to the drawing board, as soon as he discovers characters are now UTF8, UTF16 or even UTF32, because his finely tuned algorithm is designed for single byte characters. Or is it as easy as switching data types? He's probably using UTF-8 Share this post Link to post
dummzeuch 1506 Posted July 21, 2023 54 minutes ago, David Heffernan said: Perfect hash only works if you can restrict your input domain, or am I wrong? Since as the last step he does an actual comparison between the input string and the string "found" by the hash function(*1), it should work with any input string. The important part is that the strings he is searching for are known beforehand, so he can fine tune the hash function. (*1: At least he was still doing that when I stopped watching at about 3/4 of the video.) Share this post Link to post
Stefan Glienke 2009 Posted July 21, 2023 (edited) I am not sure what y'all discussing here but the snippet that Tommi posted is about checking the first character before even calling strncmp (video at around 15:15). It's pretty obvious why that is faster - if the characters don't match you get rid of the function call. The algo does not need to restrict anything - the hashtable contains all keywords and the input is any string that occurs in the source code - if it's a keyword, it gets found, if it's not, then it won't be found, simple as that. Edited July 21, 2023 by Stefan Glienke 2 Share this post Link to post
David Heffernan 2347 Posted July 21, 2023 3 hours ago, dummzeuch said: Since as the last step he does an actual comparison between the input string and the string "found" by the hash function(*1), it should work with any input string. This is the part I didn't watch, but yeah, that makes so much more sense now! Share this post Link to post
Arnaud Bouchez 407 Posted July 24, 2023 (edited) 1) TL&WR Do not try to use those tricks in anything close to a common-use hash table, e.g. your own data library. So much micro-benchmarking for no benefit in practice: all this is not applicable to a common-use library. The Rust RTL has already been optimized and scrutinized by a lot of people and production workloads. I would never try to apply what he found for his own particular case to any hash table implementation. 2) hash function From profiling on real workloads, with a good enough hash function, there are only a few collisions. Of course, FNV is a slow function, with a bad collision rate. With a good hash function, e.g. our AesNiHash32 from https://github.com/synopse/mORMot2/blob/master/src/crypt/mormot.crypt.core.asmx64.inc#L6500 which is inspired by the one in Go RTL, comparing the first char is enough to reject most collisions, due to its almost-random hash spreading. Then, the idea of "tweaking the hash function" is just a pure waste of computer resource for a common-use library, once you have a realistic hash table with thousands (millions) of items. Of course, this guy want to hash a table of a few elements, which are known in advance. So it is not a problem for him. So no hash function was the fastest. Of course. But this is not a hash table any more - it is a dedicated algorithm for a specific use-case. 3) security Only using the first and last characters is an awful assumption for a hash process in a common library. It may work for his own dataset, but it is a very unsafe practice. This is the 101 of hash table security: don't make it guessable, or you would expose yourself to hash flooding http://ocert.org/advisories/ocert-2012-001.html 4) one known algorithm for such a fixed keyword lookup The purpose of this video is to quickly find a value within a fixed list of keywords. And from what I have seen in practice, some algorithms would perform better because won't involve a huge hash table, and won't pollute the CPU cache. For instance, this code is used on billions of computers, on billions of datasets, and works very well in practice: https://sqlite.org/src/file?name=src/tokenize.c&ci=trunk The code is generated by https://sqlite.org/src/file?name=tool/mkkeywordhash.c&ci=trunk An extract is: /* Check to see if z[0..n-1] is a keyword. If it is, write the ** parser symbol code for that keyword into *pType. Always ** return the integer n (the length of the token). */ static int keywordCode(const char *z, int n, int *pType){ int i, j; const char *zKW; if( n>=2 ){ i = ((charMap(z[0])*4) ^ (charMap(z[n-1])*3) ^ n*1) % 127; for(i=((int)aKWHash[i])-1; i>=0; i=((int)aKWNext[i])-1){ if( aKWLen[i]!=n ) continue; zKW = &zKWText[aKWOffset[i]]; if( (z[0]&~0x20)!=zKW[0] ) continue; if( (z[1]&~0x20)!=zKW[1] ) continue; j = 2; while( j<n && (z[j]&~0x20)==zKW[j] ){ j++; } if( j<n ) continue; *pType = aKWCode[i]; break; } } return n; } Its purpose was to reduce the code size, but in practice, it also reduces CPU cache pollution and tends to be very fast, thanks to a 128 bytes hash table. This code is close to what the video proposes - just even more optimized. Edited July 24, 2023 by Arnaud Bouchez Share this post Link to post
Arnaud Bouchez 407 Posted July 24, 2023 On 7/21/2023 at 3:13 PM, David Heffernan said: He's probably using UTF-8 From the asm shown in the video, it seems to compare WORD PTR characters, so it is likely to be UTF-16. Share this post Link to post
Fr0sT.Brutal 900 Posted July 24, 2023 (edited) I watched the video in jumps but what I understood: - The guy needs specialized function to check if a word is from a very short set of predefined words - These words are latin - He uses "hash" as int32 = s[0],s[1],s[-2],s[-1] (inputs are ASCII/UTF8 but algo may also work with UTF16 with only 1st and last char used as hash) (hmm, but he'll got array of 2bil of pointers then...) - The hash is assumed to be unique for his set (he just takes wordlist[key] without any sub-arrays of items with the same hash) - And he also checks for 1st char before running full compare Edited July 24, 2023 by Fr0sT.Brutal 1 Share this post Link to post