dummzeuch 1505 Posted February 23, 2020 (edited) I think that some functionality in GExperts (and some of my other tools) could profit from switching from sorted TStringLists to a dictionary. Unfortunately I can't just use TDictionary<string, [whatever]> because I want to continue supporting Delphi versions that didn't have generics (in particular Delphi 2007). Does anybody know of a free dictionary implementation that stores pointers indexed by strings? I know of THashedStringList but it has too many gotchas to simply use it (e.g. the hashes get recalculated every time a new item was inserted and IndexOf is called, which is exactly what I would need to do). Yes, I could write one myself, but why reinvent the wheel if there already is one? Edited February 23, 2020 by dummzeuch Share this post Link to post
David Heffernan 2345 Posted February 23, 2020 (edited) Expand a generic dictionary implementation manually. Just replace the TKey and TValue textually. It would also be very valid to stop developing GExperts for pre generic delphi. Edited February 23, 2020 by David Heffernan 1 Share this post Link to post
Alexander Sviridenkov 356 Posted February 23, 2020 http://guildalfa.ru/alsha/node/32 Article is in russian, but at the bottom there is link to implementation file, Share this post Link to post
dummzeuch 1505 Posted February 23, 2020 (edited) 2 hours ago, David Heffernan said: Expand a generic dictionary implementation manually. Just replace the TKey and TValue textually. Even if that was as easy as it sounds, I doubt that this meets the definition of "free implementation". 2 hours ago, David Heffernan said: It would also be very valid to stop developing GExperts for pre generic delphi. Yes, this is what many people tell me. I might even drop some of them, but since I do most of my professional work with Delphi 2007 (the last pre Unicode version), dropping support for it in GExperts would not really be smart. Edited February 23, 2020 by dummzeuch 1 Share this post Link to post
dummzeuch 1505 Posted February 23, 2020 1 hour ago, Alexander Sviridenkov said: http://guildalfa.ru/alsha/node/32 Article is in russian, but at the bottom there is link to implementation file, What's the license of this code? Share this post Link to post
Alexander Sviridenkov 356 Posted February 23, 2020 @dummzeuch written in a unit header - "Free for any use" 1 1 Share this post Link to post
David Heffernan 2345 Posted February 23, 2020 37 minutes ago, dummzeuch said: Even if that was as easy as it sounds, I doubt that this meets the definition of "free implementation". It is easy. And there are open source dictionaries out there. As for whether they are compatible with the GExperts license, I've no idea. Share this post Link to post
Dave Novo 51 Posted February 24, 2020 I would suggest having a look at https://www.amazon.com/Tomes-Delphi-Algorithms-Data-Structures-ebook/dp/B007FKB0EI It has all those basic data structures for pre generic delphi. IIRC the code is all open source. If you can find an old version with the CD you might not even have to type anything in. I will see if I have a copy of the code in the office and will attach it here if I find it. Share this post Link to post
Thijs van Dien 9 Posted February 24, 2020 (edited) Did you consider IniFiles.TStringHash? It has a fixed number of buckets, but you could compose it and implement your own reallocation logic if desired. Also, you could inherit from it to use a better hash should the default not be good enough. Edited February 24, 2020 by Thijs van Dien Share this post Link to post
dummzeuch 1505 Posted February 24, 2020 (edited) 2 hours ago, Thijs van Dien said: Did you consider IniFiles.TStringHash? Yes: 12 hours ago, dummzeuch said: I know of THashedStringList but it has too many gotchas to simply use it I thought of improving it too, but it probably would be more work than writing my own. Just the fact that it's based on TStrings adds overhead that I would like to avoid. (I guess you meant THashedStringlist, because if I remember correctly, TStringHash isn't even exported.) Edited February 24, 2020 by dummzeuch Share this post Link to post
dummzeuch 1505 Posted February 24, 2020 3 hours ago, Dave Novo said: I would suggest having a look at https://www.amazon.com/Tomes-Delphi-Algorithms-Data-Structures-ebook/dp/B007FKB0EI It has all those basic data structures for pre generic delphi. IIRC the code is all open source. If I remember correctly (I bought Tomes of Delphi years ago), the source is not freely available. But it's a good idea, I'll have to check. Share this post Link to post
Uwe Raabe 2057 Posted February 24, 2020 4 hours ago, Dave Novo said: I would suggest having a look at https://www.amazon.com/Tomes-Delphi-Algorithms-Data-Structures-ebook/dp/B007FKB0EI It has all those basic data structures for pre generic delphi. Interestingly that was my first thought, too. Unfortunately I couldn't find any directory implementation described there. Share this post Link to post
luebbe 26 Posted February 24, 2020 (edited) A github search revealed these: - https://github.com/coolsoftware/VHashedStringList - https://github.com/bodo-hugo-barwich/hash-lists I didn't take a closer look. Just dumping the links here. Edited February 24, 2020 by luebbe 1 Share this post Link to post
timfrost 78 Posted February 24, 2020 I have never liked the idea of a hash, and I use an implementation of a ternary seach tree as a dictionary. I based it on an article in Dr Dobb's Journal in 1998, and it has stood the test of time! I have versions for C and Pascal; the latter for AnsiChar, WidwChar 32-bit and WideChar 64-bit keys. It is incredibly fast and easy to use, and has reasonable memory consumption, unless of course you load a million random 100-byte keys (which takes only a second or so). It has various ways to load and look up keys, and to traverse the set; values can be integer, object or string. I use it it in many different ways, both to match a full key and to find the longest initial partial key. On updating a value, the space used by the old value is not recovered, but that suits most of my uses as a dictionary; and the unit does include a function to rebuild the data store for a dynamic application, if you can spare a second to do so. In its simplest form: tree := tTSTtree.Create; tree.insert(key, value); q := tree.match(key); if (q <> nil) then s := string(pchar(q)); The source is reasonably well commented, and I would be happy let you have a copy, but I have never got around to properly documenting it and uploading it somewhere. 1 1 Share this post Link to post
David Heffernan 2345 Posted February 24, 2020 TST, like binary tree, requires there to be an ordering of the key values? Share this post Link to post
dummzeuch 1505 Posted February 24, 2020 4 hours ago, luebbe said: A github search revealed these: - https://github.com/coolsoftware/VHashedStringList - https://github.com/bodo-hugo-barwich/hash-lists I didn't take a closer look. Just dumping the links here. Both look interesting, thanks. Now, I just need to find the time to run timing tests. 😉 Share this post Link to post
dummzeuch 1505 Posted February 24, 2020 2 hours ago, timfrost said: I have never liked the idea of a hash, and I use an implementation of a ternary seach tree as a dictionary. I based it on an article in Dr Dobb's Journal in 1998, and it has stood the test of time! The source is reasonably well commented, and I would be happy let you have a copy, but I have never got around to properly documenting it and uploading it somewhere. Would you please be so kind and send it to me via private message? (Unless of course you prefer to attach it publicly to this thread.) Share this post Link to post
Angus Robertson 574 Posted February 24, 2020 ICS includes a unit OverbyteIcsAvlTrees.pas written by Arno Garrels, from the unit: Implements a fast cache-like data storage based on two linked AVL-Trees for primary and secondary indexing. Primary key of type string has to be unique, secondary key of type TDateTime may have duplicates. AVL or balanced binary trees are extremely efficient data structures for searching data. Finding an element in 65536 requires at most 16 compares. Uses an AVL-Tree as it is described in the book "Algorithms & Data Structures", Prof. Niklaus Wirth. No real dependencies on other ICS units. Angus 1 1 Share this post Link to post
Thijs van Dien 9 Posted February 24, 2020 (edited) 10 hours ago, dummzeuch said: Yes: I thought of improving it too, but it probably would be more work than writing my own. Just the fact that it's based on TStrings adds overhead that I would like to avoid. (I guess you meant THashedStringlist, because if I remember correctly, TStringHash isn't even exported.) TStringHash is a completely different class than THashedStringList and gives you exactly what you asked for: string to Integer (Pointer). I've been happily using it in D6 and it's still publicly available in D2010 (the latest version I have at hand). Sure it might not be the absolute optimal solution, but it will be a noticeable improvement over the status quo, is very understandable, and doesn't need any third party code. Edited February 24, 2020 by Thijs van Dien 1 Share this post Link to post
dummzeuch 1505 Posted February 24, 2020 36 minutes ago, Thijs van Dien said: TStringHash is a completely different class than THashedStringList and gives you exactly what you asked for: string to Integer (Pointer). I've been happily using it in D6 and it's still publicly available in D2010 (the latest version I have at hand). Sure it might not be the absolute optimal solution, but it will be a noticeable improvement over the status quo, is very understandable, and doesn't need any third party code. Hm, I must be doing something wrong. I can't see any improvement of a TStringHash over a sorted TStringList. Adding the same entries to a TStringHash took 3.12 seconds while it took 3.07 for a sorted TStringList. The same took 0.32 seconds for a TDictionary. Looking them up with ValueOf took approximately the same amount of time for TStringHash and sorted TStringList, while for TDictionary it took 0.32 seconds. I tried sizes of 1024 and 10240. This is the timing code: procedure TForm1.DoTiming(sl: TStringHash); const CYCLES = 100; var IdxCycles: Integer; Idx1: Integer; Idx2: Integer; Idx3: Integer; sw: TStopwatch; s: string; Idx: Integer; begin sw := TStopwatch.StartNew; for IdxCycles := 1 to CYCLES do begin for Idx1 := Ord('A') to Ord('Z') do begin for Idx2 := Ord('A') to Ord('Z') do begin for Idx3 := Ord('A') to Ord('Z') do begin s := chr(Idx1) + chr(Idx2) + chr(Idx3); if sl.ValueOf(s) = -1 then sl.Add(s, Idx1 * 10000 + Idx2 * 100 + Idx3); end; end; end; end; sw.Stop; m_Output.Lines.Add('Add: ' + sw.Elapsed.ToString); sw.Reset; sw.Start; for IdxCycles := 1 to CYCLES do begin for Idx1 := Ord('A') to Ord('Z') do begin for Idx2 := Ord('A') to Ord('Z') do begin for Idx3 := Ord('A') to Ord('Z') do begin s := chr(Idx1) + chr(Idx2) + chr(Idx3); Idx := sl.ValueOf(s); Assert(Idx = Idx1 * 10000 + Idx2 * 100 + Idx3); end; end; end; end; m_Output.Lines.Add('IndexOf: ' + sw.Elapsed.ToString); end; The code for TStringList is very similar. It uses AddObject instead of Add and IndexOf instead of ValueOf. procedure Tf_HashedStringListTest.DoTiming(sl: TStringList); const CYCLES = 100; var IdxCycles: Integer; Idx1: Integer; Idx2: Integer; Idx3: Integer; sw: TStopwatch; s: string; Idx: Integer; begin sl.Sorted := True; sl.CaseSensitive := True; sl.Duplicates := dupError; sw := TStopwatch.StartNew; sl.BeginUpdate; for IdxCycles := 1 to CYCLES do begin for Idx1 := Ord('A') to Ord('Z') do begin for Idx2 := Ord('A') to Ord('Z') do begin for Idx3 := Ord('A') to Ord('Z') do begin s := chr(Idx1) + chr(Idx2) + chr(Idx3); if sl.IndexOf(s) = -1 then sl.AddObject(s, Pointer(Idx1 * 10000 + Idx2 * 100 + Idx3)); end; end; end; end; sl.EndUpdate; sw.Stop; m_Output.Lines.Add(sl.Count.ToString + ': Add: ' + sw.Elapsed.ToString); sw.Reset; sw.Start; for IdxCycles := 1 to CYCLES do begin for Idx1 := Ord('A') to Ord('Z') do begin for Idx2 := Ord('A') to Ord('Z') do begin for Idx3 := Ord('A') to Ord('Z') do begin s := chr(Idx1) + chr(Idx2) + chr(Idx3); sl.IndexOf(s); end; end; end; end; m_Output.Lines.Add(sl.Count.ToString + ': IndexOf: ' + sw.Elapsed.ToString); And the one for the TDictionary just calls TryGetValue instead of IndexOf and Add instead of AddObject. I guess it has something to do with the limited length of the strings. Or I am overlooking something else entirely (it has been a long day and I'm not going to test the other solutions right now). Share this post Link to post
Anders Melander 1782 Posted February 24, 2020 1 hour ago, dummzeuch said: Or I am overlooking something else entirely It's a bit hard to see what exactly it is you're benchmarking with that code. Your first loop is supposedly meant to measure Add performance, but it also contains a Lookup. And after the first iteration of the outer loop it just produces duplicates, but then skips all those duplicates. Also your test is too synthetic. Use realistic data instead. I'm guessing your actual data isn't all possible three character strings with perfect distribution... And is sorting the stringlist after each insert really a realistic scenario? If you're really that interested in the performance of your dictionary I suggest you profile with real data instead of this guesswork. Share this post Link to post
Fr0sT.Brutal 900 Posted February 25, 2020 (edited) I believe mORMot has the fastest implementation of everything :)) There also big libs like Griijy , TheUnknownOnes etc. Check https://github.com/Fr0sT-Brutal/awesome-pascal Edited February 25, 2020 by Fr0sT.Brutal 2 Share this post Link to post
dummzeuch 1505 Posted February 25, 2020 18 minutes ago, Fr0sT.Brutal said: I believe mORMot has the fastest implementation of everything :)) There also big libs like Griijy , TheUnknownOnes etc. Check https://github.com/Fr0sT-Brutal/awesome-pascal Hey, I just noted that my dzlib is missing from that list. 😉 Thanks for the hints. 1 Share this post Link to post
dummzeuch 1505 Posted February 25, 2020 (edited) 15 hours ago, Anders Melander said: Your first loop is supposedly meant to measure Add performance, but it also contains a Lookup. And after the first iteration of the outer loop it just produces duplicates, but then skips all those duplicates. Also your test is too synthetic. Use realistic data instead. I'm guessing your actual data isn't all possible three character strings with perfect distribution... You are right about that this is not a good test. My goal was not to create a real performance test for it, just something easy to write that gives me an idea which possible solution to investigate further. But yes, inserting strings in ascending order is definitely going to create some bias. 15 hours ago, Anders Melander said: And is sorting the stringlist after each insert really a realistic scenario? Where am I sorting the string list? It's set to be sorted in the beginning and not changed. Actually I didn't post the code for the string list at all, just for the hash and the dictionary. (OK, i did) Edited February 25, 2020 by dummzeuch Share this post Link to post
Anders Melander 1782 Posted February 25, 2020 1 hour ago, dummzeuch said: Where am I sorting the string list? Here: 17 hours ago, dummzeuch said: procedure Tf_HashedStringListTest.DoTiming(sl: TStringList); ... begin sl.Sorted := True; ... 1 hour ago, dummzeuch said: It's set to be sorted in the beginning and not changed. Actually I didn't post the code for the string list at all, just for the hash and the dictionary. Okay, but setting Sorted=True and then adding to the list will keep the list sorted; It's using binary search to find the insertion point so it's not that bad, but it might be better to add unsorted and then sort at the end. It depends on the dataset. Share this post Link to post