MarkShark 27 Posted March 16, 2022 Hi all! I often find myself needing to look up a string token in a dictionary. Usually, the token is part of a larger string being parsed. Currently I take a copy of that portion of the string using SetString or Copy and use that new string for the lookup. The code looks something like this: // Not shown: parse some source string and get a pointer to the first char of the token and the length of said token // Now get a copy of that token so we can look it up in the dictionary SetString(TempToken, FTokenStart, FTokenLength); if not FTokenDictionary.TryGetValue(TempToken, Result) then Result := UnknownToken; I'm wondering if there's any way to avoid the SetString part of this. Thanks for any ideas! Share this post Link to post
Fr0sT.Brutal 900 Posted March 16, 2022 (edited) it's a hash table so the string is only needed for computing the hash. Probably you can compute the hash by yourself and provide it to class method. Hash is computed by FComparer.GetHashCode(Key) that you can provide in C-tor if TDict. However, you'll have to define your own TKey instead of string (record with pointer to 1st char and length perhaps) and hashing method. Or, you can use string hashes as keys and search like this FTokenDictionary.TryGetValue(HashFromPChar(FTokenStart, FTokenLength), Result) Edited March 16, 2022 by Fr0sT.Brutal 1 Share this post Link to post
A.M. Hoornweg 144 Posted March 16, 2022 43 minutes ago, MarkShark said: Hi all! I often find myself needing to look up a string token in a dictionary. Usually, the token is part of a larger string being parsed. Currently I take a copy of that portion of the string using SetString or Copy and use that new string for the lookup. The code looks something like this: // Not shown: parse some source string and get a pointer to the first char of the token and the length of said token // Now get a copy of that token so we can look it up in the dictionary SetString(TempToken, FTokenStart, FTokenLength); if not FTokenDictionary.TryGetValue(TempToken, Result) then Result := UnknownToken; I'm wondering if there's any way to avoid the SetString part of this. Thanks for any ideas! You could make the "key" in your tokendictionary a UInt64 instead of a string, but you'd also have to check that your strings have a minimum length of 8 bytes or else you'll get an access violation. Type pUint64=^Unit64; {assuming unicode string} if (fTokenLength >= 4) then if not fTokenDictionary.TryGetValue(pUint64(@Tokenstart)^,result) then ... Share this post Link to post
Remy Lebeau 1397 Posted March 16, 2022 (edited) 1 hour ago, A.M. Hoornweg said: but you'd also have to check that your strings have a minimum length of 8 bytes or else you'll get an access violation. Your example is checking for 4 bytes, not 8 bytes. And also, presumably @TokenStart should be FTokenStart, which is already a pointer. But, in any case, if the strings are greater than SizeOf(UInt64) in length, you are only using those first 8 bytes for the "key", so you might not end up with unique keys for different string values. Edited March 16, 2022 by Remy Lebeau Share this post Link to post
MarkShark 27 Posted March 16, 2022 Thanks all! I ended up using Frost's suggestion of using the hash as the key like so: Hash := THashFNV1a32.GetHashValue(FTokenStart^, FTokenLength * SizeOf(Char)); if not FTokenDictionary.TryGetValue(Hash, Result) then Result := UnknownToken; And it appears to be roughly three times faster than my old verson using SetString! I used that particular hash because it looks like that's what Delphi's default string comparer uses. Share this post Link to post
MarkShark 27 Posted March 16, 2022 My technique above (using a hash as the key in a TDictionary) seems to work, however, it assumes that hashing my tokens will always give a unique value. Am I correct that that's a problem? Any thoughts on how to handle it? Share this post Link to post
Remy Lebeau 1397 Posted March 16, 2022 (edited) 17 minutes ago, MarkShark said: My technique above (using a hash as the key in a TDictionary) seems to work, however, it assumes that hashing my tokens will always give a unique value. Am I correct that that's a problem? Yes. Depending on the hash algorithm used, the risk of hash collisions occurring is small, but not non-existant. In this situation, since you want strings as keys, I would probably opt to define a custom record type that holds a PChar and a length, and then use that record as the dictionary key. You would just need to provide a custom IEqualityComparer implementation to compare the keys. See https://stackoverflow.com/a/27820226/65863 as an example. Edited March 16, 2022 by Remy Lebeau 1 Share this post Link to post
MarkShark 27 Posted March 16, 2022 Excellent, thank you! I realize now that this is what Frost meant in his initial suggestion above and thank you Remy for that confirmation and link. I've done exactly what was suggested with the record as key and a custom equality comparer. Seems to work great and is still three times faster than my original version. Awesome! 1 Share this post Link to post
David Heffernan 2345 Posted March 16, 2022 I think I'd write a custom dict for this. 1 Share this post Link to post
A.M. Hoornweg 144 Posted March 17, 2022 15 hours ago, Remy Lebeau said: Your example is checking for 4 bytes, not 8 bytes. And also, presumably @TokenStart should be FTokenStart, which is already a pointer. But, in any case, if the strings are greater than SizeOf(UInt64) in length, you are only using those first 8 bytes for the "key", so you might not end up with unique keys for different string values. No, "{assuming unicode string}" -> 1 char is 2 bytes so I'm actualy testing for 8. OP stated "Usually, the token is part of a larger string being parsed." So I assumed OP wanted to do a quick lookup in a token dictionary to figure out if there are any entries starting with that substring in the actual dictionary. Share this post Link to post
Stefan Glienke 2002 Posted March 17, 2022 (edited) This is how I would solve this - using the record type to preserve the original string and a dedicated equality comparer. Using Spring4d collections here, but would be similar when using the RTL. {$APPTYPE CONSOLE} uses Spring.Collections, Spring.Comparers, System.SysUtils, System.StrUtils, System.Generics.Defaults; type TTokenKey = record s: string; p: PChar; len: Integer; constructor Create(const s: string); overload; constructor Create(p: PChar; len: Integer); overload; end; TTokenKeyComparer = class(TInterfacedObject, IEqualityComparer<TTokenKey>) public function Equals(const left, right: TTokenKey): Boolean; reintroduce; function GetHashCode(const value: TTokenKey): Integer; reintroduce; end; { TTokenKey } constructor TTokenKey.Create(const s: string); begin self.s := s; p := Pointer(s); len := Length(s); end; constructor TTokenKey.Create(p: PChar; len: Integer); begin self.p := p; self.len := len; end; { TTokenKeyComparer } function TTokenKeyComparer.Equals(const left, right: TTokenKey): Boolean; begin Result := (left.len = right.len) and CompareMem(left.p, right.p, left.len); end; function TTokenKeyComparer.GetHashCode(const value: TTokenKey): Integer; begin Result := DefaultHashFunction(value.p^, value.len * SizeOf(Char)); end; const sentence = 'the quick brown fox jumps over the lazy dog'; begin var tokens := TCollections.CreateSet<TTokenKey>(TTokenKeyComparer.Create); for var s in SplitString(sentence, ' ') do tokens.Add(TTokenKey.Create(s)); Writeln(tokens.Count); Writeln(tokens.Contains(TTokenKey.Create(@sentence[5], 5))); // 'quick' ->true Writeln(tokens.Contains(TTokenKey.Create(@sentence[6], 4))); // 'uick' -> false Readln; end. Edited March 17, 2022 by Stefan Glienke Share this post Link to post
Fr0sT.Brutal 900 Posted March 17, 2022 (edited) 15 minutes ago, Stefan Glienke said: This is how I would solve this Yep, this generic approach is OK, however it includes managed record (which pulls hidden finally section with it) and creation of new record for every comparison. Depending on usage scenario this could slow things down. I suppose if creation of a short string was critical, any little detail could also matter. BTW, Mark, ensure your dictionary check routine has no hidden finally section (set breakpoint to the final "end" of the method, wait for break, open ASM view and look for "handlefinally"). It could significantly reduce performance on frequently called functions so should be removed where possible. Edited March 17, 2022 by Fr0sT.Brutal Share this post Link to post
Stefan Glienke 2002 Posted March 17, 2022 (edited) 7 minutes ago, Fr0sT.Brutal said: Yep, this generic approach is OK, however it includes managed record (which pulls hidden finally section with it) and creation of new record for every comparison. Depending on usage scenario this could slow things down. I suppose if creation of a short string was critical, any little detail could also matter. I am very sure that record finalization is way faster than heap allocations caused by SetString - if after profiling it still has noticeable impact then one could extract the string keeping from the record into some other structure - the point is they need to be kept alive as long as they are inside the dictionary - how does not matter. Edited March 17, 2022 by Stefan Glienke Share this post Link to post
MarkShark 27 Posted March 17, 2022 @Stefan Glienke Thank you! My code is very close to yours, which is a nice validation for me lol. I also put the "stored" string in the record (@Frost I also tested keeping them in a separate list to remove the record being managed, but it didn't seem to make any difference when I benchmarked it.) I would be using Spring4D for this in my own code, however it may end up in SynEdit's highlighters and so I'm trying to keep dependencies down and to leverage the rtl as much as possible. Share this post Link to post
MarkShark 27 Posted March 17, 2022 I realized in my benchmarking above that I was comparing a case insensitive dictionary to a case sensitive one. After correcting that the performance improvement for using this technique is 30%, which is still significant, but not what I reported above for sure lol. I do notice something odd. I'm loading about 300 SQL keywords (select, from, etc.) into a dictionary<string, integer> and if I check FKeywords.Collisions I'm getting a result of 90. Looking at the code for TDictionary it looks like they take the results of the hash and convert it a positive integer for the array of buckets (I hope that's the correct terminology.) I *think* that's where the collisions come in. It looks like Spring4d doesn't have the collisions property, so I haven't been able to do a sanity check against it. Share this post Link to post
Stefan Glienke 2002 Posted March 17, 2022 (edited) Little explanation of how the RTL dictionary works: - as Capacity, it always uses n^2 - that enables a fast calculation from the hashcode to the bucket index by simply bitmasking it (which makes it important to have a good hash function that also has good distribution in the lower bits of the hashcode - since 11 the RTL uses FNV1a which is quite good). - now how to deal with collisions (and I mean those in the buckets, not the hashcode itself because for 300 entries there would be a capacity of 512). When distributing 300 items across 512 slots it is very likely that you get a few collisions. The RTL uses a very simple so-called linear probing for that - it keeps looking at the next bucket until it finds a free one. While this provides good locality it naturally leads to primary clustering which you can see by the Collision count. This number tells the number of items that don't sit in the slot where their hashcode would have put them. As of 2.0 Spring uses a different hashtable implementation which is using a clever algorithm from the Python dictionary which mitigates this quite nicely. FWIW especially if you have a fixed amount of strings such a keywords it might be better to not use a hashtable because of the hash function overhead but rather some handcrafted lookup table. Since keywords most likely always start with a letter you can do something like DWS does: see https://github.com/EricGrange/DWScript/blob/master/Source/dwsTokenizer.pas#L798 just with the first letter you get the very tiny array of the keywords starting with that letter and then you do a linear search - which beats a hashtable any time. Edited March 17, 2022 by Stefan Glienke 3 1 Share this post Link to post