Jump to content
dummzeuch

pre-generic dictionary class

Recommended Posts

I think that some functionality in GExperts (and some of my other tools) could profit from switching from sorted TStringLists to a dictionary. Unfortunately I can't just use TDictionary<string, [whatever]> because I want to continue supporting Delphi versions that didn't have generics (in particular Delphi 2007). Does anybody know of a free dictionary implementation that stores pointers indexed by strings? I know of THashedStringList but it has too many gotchas to simply use it (e.g. the hashes get recalculated every time a new item was inserted and IndexOf is called, which is exactly what I would need to do). Yes, I could write one myself, but why reinvent the wheel if there already is one?

Edited by dummzeuch

Share this post


Link to post

Expand a generic dictionary implementation manually. Just replace the TKey and TValue textually.

 

It would also be very valid to stop developing GExperts for pre generic delphi. 

 

Edited by David Heffernan
  • Like 1

Share this post


Link to post
2 hours ago, David Heffernan said:

Expand a generic dictionary implementation manually. Just replace the TKey and TValue textually.

Even if that was as easy as it sounds, I doubt that this meets the definition of "free implementation".

 

2 hours ago, David Heffernan said:

It would also be very valid to stop developing GExperts for pre generic delphi. 

Yes, this is what many people tell me. I might even drop some of them, but since I do most of my professional work with Delphi 2007 (the last pre Unicode version), dropping support for it in GExperts would not really be smart.

Edited by dummzeuch
  • Like 1

Share this post


Link to post
37 minutes ago, dummzeuch said:

Even if that was as easy as it sounds, I doubt that this meets the definition of "free implementation".

It is easy. And there are open source dictionaries out there. As for whether they are compatible with the GExperts license, I've no idea. 

Share this post


Link to post

I would suggest having a look at

https://www.amazon.com/Tomes-Delphi-Algorithms-Data-Structures-ebook/dp/B007FKB0EI

 

It has all those basic data structures for pre generic delphi.

 

IIRC the code is all open source. If you can find an old version with the CD you might not even have to type anything in. I will see if I have a copy of the code in the office and will attach it here if I find it.

Share this post


Link to post

Did you consider IniFiles.TStringHash? It has a fixed number of buckets, but you could compose it and implement your own reallocation logic if desired. Also, you could inherit from it to use a better hash should the default not be good enough.

Edited by Thijs van Dien

Share this post


Link to post
2 hours ago, Thijs van Dien said:

Did you consider IniFiles.TStringHash?

Yes:

12 hours ago, dummzeuch said:

I know of THashedStringList but it has too many gotchas to simply use it

I thought of improving it too, but it probably would be more work than writing my own. Just the fact that it's based on TStrings adds overhead that I would like to avoid.

 

(I guess you meant THashedStringlist, because if I remember correctly, TStringHash isn't even exported.)

Edited by dummzeuch

Share this post


Link to post
3 hours ago, Dave Novo said:

I would suggest having a look at

https://www.amazon.com/Tomes-Delphi-Algorithms-Data-Structures-ebook/dp/B007FKB0EI

 

It has all those basic data structures for pre generic delphi.

 

IIRC the code is all open source.

If I remember correctly (I bought Tomes of Delphi years ago), the source is not freely available. But it's a good idea, I'll have to check.

Share this post


Link to post

I have never liked the idea of a hash, and I use an implementation of a ternary seach tree as a dictionary. I based it on an article in Dr Dobb's Journal in 1998, and it has stood the test of time!

 

I have versions for C and Pascal; the latter for AnsiChar, WidwChar 32-bit and WideChar 64-bit keys. It is incredibly fast and easy to use, and has reasonable memory consumption, unless of course you load a million random 100-byte keys (which takes only a second or so). It has various ways to load and look up keys, and to traverse the set; values can be integer, object or string. I use it it in many different ways, both to match a full key and to find the longest initial partial key. On updating a value, the space used by the old value is not recovered, but that suits most of my uses as a dictionary; and the unit does include a function to rebuild the data store for a dynamic application, if you can spare a second to do so.  In its simplest form:

 

  tree := tTSTtree.Create;
  tree.insert(key, value);
  q := tree.match(key);
  if (q <> nil) then s := string(pchar(q));
 
The source is reasonably well commented, and I would be happy let you have a copy, but I have never got around to properly documenting it and uploading it somewhere.
 

  • Like 1
  • Thanks 1

Share this post


Link to post
2 hours ago, timfrost said:

I have never liked the idea of a hash, and I use an implementation of a ternary seach tree as a dictionary. I based it on an article in Dr Dobb's Journal in 1998, and it has stood the test of time!

 

The source is reasonably well commented, and I would be happy let you have a copy, but I have never got around to properly documenting it and uploading it somewhere.
 

Would you please be so kind and send it to me via private message? (Unless of course you prefer to attach it publicly to this thread.)

Share this post


Link to post

ICS includes a unit OverbyteIcsAvlTrees.pas written by Arno Garrels, from the unit:

 

Implements a fast cache-like data storage based on two linked AVL-Trees for primary and secondary indexing.  Primary key of type string has to be unique, secondary key of type TDateTime may have duplicates.   AVL or balanced binary trees are extremely efficient data structures for searching data. Finding an element in  65536 requires at most 16 compares.  Uses an AVL-Tree as it is described in the book "Algorithms & Data Structures", Prof. Niklaus Wirth.

 

No real dependencies on other ICS units.

 

Angus

  • Like 1
  • Thanks 1

Share this post


Link to post
10 hours ago, dummzeuch said:

Yes:

I thought of improving it too, but it probably would be more work than writing my own. Just the fact that it's based on TStrings adds overhead that I would like to avoid.

 

(I guess you meant THashedStringlist, because if I remember correctly, TStringHash isn't even exported.)

TStringHash is a completely different class than THashedStringList and gives you exactly what you asked for: string to Integer (Pointer). I've been happily using it in D6 and it's still publicly available in D2010 (the latest version I have at hand). Sure it might not be the absolute optimal solution, but it will be a noticeable improvement over the status quo, is very understandable, and doesn't need any third party code.

Edited by Thijs van Dien
  • Like 1

Share this post


Link to post
36 minutes ago, Thijs van Dien said:

TStringHash is a completely different class than THashedStringList and gives you exactly what you asked for: string to Integer (Pointer). I've been happily using it in D6 and it's still publicly available in D2010 (the latest version I have at hand). Sure it might not be the absolute optimal solution, but it will be a noticeable improvement over the status quo, is very understandable, and doesn't need any third party code.

Hm, I must be doing something wrong. I can't see any improvement of a TStringHash over a sorted TStringList. Adding the same entries to a TStringHash took 3.12 seconds while it took 3.07 for a sorted TStringList. The same took 0.32 seconds for a TDictionary.

Looking them up with ValueOf took approximately the same amount of time for TStringHash and sorted TStringList, while for TDictionary it took 0.32 seconds.

I tried sizes of 1024 and 10240.

 

This is the timing code:

procedure TForm1.DoTiming(sl: TStringHash);
const
  CYCLES = 100;
var
  IdxCycles: Integer;
  Idx1: Integer;
  Idx2: Integer;
  Idx3: Integer;
  sw: TStopwatch;
  s: string;
  Idx: Integer;
begin
  sw := TStopwatch.StartNew;
  for IdxCycles := 1 to CYCLES do begin
    for Idx1 := Ord('A') to Ord('Z') do begin
      for Idx2 := Ord('A') to Ord('Z') do begin
        for Idx3 := Ord('A') to Ord('Z') do begin
          s := chr(Idx1) + chr(Idx2) + chr(Idx3);
          if sl.ValueOf(s) = -1 then
            sl.Add(s, Idx1 * 10000 + Idx2 * 100 + Idx3);
        end;
      end;
    end;
  end;
  sw.Stop;
  m_Output.Lines.Add('Add: ' + sw.Elapsed.ToString);

  sw.Reset;
  sw.Start;
  for IdxCycles := 1 to CYCLES do begin
    for Idx1 := Ord('A') to Ord('Z') do begin
      for Idx2 := Ord('A') to Ord('Z') do begin
        for Idx3 := Ord('A') to Ord('Z') do begin
          s := chr(Idx1) + chr(Idx2) + chr(Idx3);
          Idx := sl.ValueOf(s);
          Assert(Idx = Idx1 * 10000 + Idx2 * 100 + Idx3);
        end;
      end;
    end;
  end;
  m_Output.Lines.Add('IndexOf: ' + sw.Elapsed.ToString);
end;

The code for TStringList is very similar. It uses AddObject instead of Add and IndexOf instead of ValueOf.

procedure Tf_HashedStringListTest.DoTiming(sl: TStringList);
const
  CYCLES = 100;
var
  IdxCycles: Integer;
  Idx1: Integer;
  Idx2: Integer;
  Idx3: Integer;
  sw: TStopwatch;
  s: string;
  Idx: Integer;
begin
  sl.Sorted := True;
  sl.CaseSensitive := True;
  sl.Duplicates := dupError;
  sw := TStopwatch.StartNew;
  sl.BeginUpdate;
  for IdxCycles := 1 to CYCLES do begin
    for Idx1 := Ord('A') to Ord('Z') do begin
      for Idx2 := Ord('A') to Ord('Z') do begin
        for Idx3 := Ord('A') to Ord('Z') do begin
          s := chr(Idx1) + chr(Idx2) + chr(Idx3);
          if sl.IndexOf(s) = -1 then
            sl.AddObject(s, Pointer(Idx1 * 10000 + Idx2 * 100 + Idx3));
        end;
      end;
    end;
  end;
  sl.EndUpdate;
  sw.Stop;
  m_Output.Lines.Add(sl.Count.ToString + ': Add: ' + sw.Elapsed.ToString);

  sw.Reset;
  sw.Start;
  for IdxCycles := 1 to CYCLES do begin
    for Idx1 := Ord('A') to Ord('Z') do begin
      for Idx2 := Ord('A') to Ord('Z') do begin
        for Idx3 := Ord('A') to Ord('Z') do begin
          s := chr(Idx1) + chr(Idx2) + chr(Idx3);
          sl.IndexOf(s);
        end;
      end;
    end;
  end;
  m_Output.Lines.Add(sl.Count.ToString + ': IndexOf: ' + sw.Elapsed.ToString);

And the one for the TDictionary just calls TryGetValue instead of IndexOf and Add instead of AddObject.

 

I guess it has something to do with the limited length of the strings. Or I am overlooking something else entirely (it has been a long day and I'm not going to test the other solutions right now).

Share this post


Link to post
1 hour ago, dummzeuch said:

Or I am overlooking something else entirely

It's a bit hard to see what exactly it is you're benchmarking with that code.

 

Your first loop is supposedly meant to measure Add performance, but it also contains a Lookup. And after the first iteration of the outer loop it just produces duplicates, but then skips all those duplicates.

Also your test is too synthetic. Use realistic data instead. I'm guessing your actual data isn't all possible three character strings with perfect distribution...

And is sorting the stringlist after each insert really a realistic scenario?

 

If you're really that interested in the performance of your dictionary I suggest you profile with real data instead of this guesswork.

Share this post


Link to post
15 hours ago, Anders Melander said:

Your first loop is supposedly meant to measure Add performance, but it also contains a Lookup. And after the first iteration of the outer loop it just produces duplicates, but then skips all those duplicates.

Also your test is too synthetic. Use realistic data instead. I'm guessing your actual data isn't all possible three character strings with perfect distribution...

You are right about that this is not a good test. My goal was not to create a real performance test for it, just something easy to write that gives me an idea which possible solution to investigate further. But yes, inserting strings in ascending order is definitely going to create some bias.

15 hours ago, Anders Melander said:

And is sorting the stringlist after each insert really a realistic scenario?

Where am I sorting the string list? It's set to be sorted in the beginning and not changed. Actually I didn't post the code for the string list at all, just for the hash and the dictionary. (OK, i did)

Edited by dummzeuch

Share this post


Link to post
1 hour ago, dummzeuch said:

Where am I sorting the string list?

Here:

17 hours ago, dummzeuch said:

procedure Tf_HashedStringListTest.DoTiming(sl: TStringList);
...
begin
  sl.Sorted := True;
...

 

 

1 hour ago, dummzeuch said:

It's set to be sorted in the beginning and not changed. Actually I didn't post the code for the string list at all, just for the hash and the dictionary.

Okay, but setting Sorted=True and then adding to the list will keep the list sorted; It's using binary search to find the insertion point so it's not that bad, but it might be better to add unsorted and then sort at the end. It depends on the dataset.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×