Dictionaries, Hashing and Performance

Clément · February 14, 2020

Actually I guess is the other way around! The more expert I get the more refactoring I do!

As @Lars Fosdal said, cryptic code, WTF was I thinking code, Code that could benefit from modern RTL code if there's no performance penalty.
For example: For lPair in SomeObjectList is slower than for i:= 0 to SomeobjectList.count-1. So I won't refactor it. Depending where the code is placed, I will use one form or the other.
Replacing TStringList (used as dictionary) with a TDictionary/TDictionaryObject makes a huge diference!

dummzeuch · February 14, 2020

29 minutes ago, Clément said:

Replacing TStringList (used as dictionary) with a TDictionary/TDictionaryObject makes a huge difference!

What kind of difference do you mean here? Performance or readability?

Clément · February 14, 2020

36 minutes ago, dummzeuch said:

What kind of difference do you mean here? Performance or readability?

Actually both. I needed to write a small interface between SAP (text files with a +/- ten thousand records) and another system that needed to check if a RFID card number was in a list. Depending on some card property (returned by the associated class, I had to call a certain REST API). I was using a simple sorted StringList, when I switched to TDictionary the performance boost was noticeable. Routines dropped from minutes to seconds.

I replaced:
idx := fCardList.indexof( aCardNumber );
if idx>-1 then

lObj := TMyObject( fCardList.Objects[idx] )

with

if fCardlist.TryGetValue( aCardNumber, lObj ) then

case lObj.RoleType of

[..]

end;

Faster and more readable!

9 minutes ago, Dany Marmur said:

IMHO Readability is key (1) performance is a boon (2). In that order.

It depends. When performance is the key, I usually write a lot of comment to compensate the lack of readability. In those cases I know the routine must be lean and mean I will use the fastest data structure. When the routine is simpler, not called very much, or performance is not an issue then I move to a "more readable" coding approach.

Bill Meyer · February 14, 2020

3 hours ago, dummzeuch said:

What kind of difference do you mean here? Performance or readability?

From Primož Gabrijelčič's latest book:

image.png.4686cbf6bc256106a1d53c0d7092eb78.png

vs.

image.png.46ba69f2b497553386f5d600b97225df.png

Clément · February 15, 2020

17 hours ago, Bill Meyer said:

From Primož Gabrijelčič's latest book:

vs.

TStringList has a Direct access of O(1), if you are accessing it unsorted, and this is as fast as it gets. But, when used as a dictionary, TStringlist must be sorted, and that's another ball game!
When you add an element to a sorted list it goes through binary search ( O log n ) , same happens if you delete.

When you need a lookup structure, TDictionary is the way to go as it's always O(1).

David Heffernan · February 15, 2020

2 hours ago, Clément said:

When you need a lookup structure, TDictionary is the way to go as it's always O(1).

Not so simple. For small collections the dictionary can be slower.

dummzeuch · February 15, 2020

Oh, TDictionary is hash table based. I wasn't aware of that. In that case, of course it should be much faster for significantly large lists.

vfbb · February 15, 2020

20 hours ago, David Heffernan said:

Not so simple. For small collections the dictionary can be slower.

Exactly! The Dictionary is only O(1) in the best case, as there can be many collisions of the key hashes, and this creates a loop that can reach O (n * 3/4) in the worst case, although it is very difficult. But what makes losing more performance are the grows that can occur as teams are added because the implementation maintain the count 50% -75% of the capacity, which I consider high value.

But to avoid the many grows, and allow fewer collisions, just create it with the desired capacity. For example, for a collection of 100 items, I simply start with a capacity of 200.

LDicionary := TDictionary<TKey, TValue>.Create(200);

Remember, when you call Clear from the dictionary, it also removes the capacity, so you better recreate it instead of Clear.

Edited February 16, 2020 by vfbb

David Heffernan · February 16, 2020

Ofc, you don't have to use the RTL dictionary. The spring4d dictionary has a number of powerful features.

I don't think the growth is such an issue. Collisions can be, and better hash and proving would help. The issue for small collections is simply the constant of proportionality. O(1) can be slower than O(log n) for small enough n.

Edited February 16, 2020 by David Heffernan

vfbb · February 16, 2020

12 hours ago, David Heffernan said:

Ofc, you don't have to use the RTL dictionary. The spring4d dictionary has a number of powerful features.

I don't think the growth is such an issue. Collisions can be, and better hash and proving would help. The issue for small collections is simply the constant of proportionality. O(1) can be slower than O(log n) for small enough n.

It is not an issue, it is the cost benefit of higher performance x less memory. But the optimization of creating it with the capacity is valid considering that the performance increases is on average 30%, both for small collections and for large ones.

uses
  System.SysUtils, System.Generics.Collections, System.Diagnostics, FMX.Dialogs;

procedure TestWithInitialCapacity;
var
  I, J, LNewId: Integer;
  LDictionary: TDictionary<Integer, Boolean>;
  LStopwatch: TStopwatch;
begin
  LStopwatch := TStopwatch.StartNew;
  for I := 0 to 10000 do begin
    LDictionary := TDictionary<Integer, Boolean>.Create(200);
    for J := 0 to 100 do begin
      repeat
        LNewId := Random(High(Integer));
      until not LDictionary.ContainsKey(LNewId);
      LDictionary.Add(LNewId, True);
    end;
    FreeAndNil(LDictionary);
  end;
  showmessage(Format('Test with initial capacity: %g', [LStopwatch.Elapsed.TotalMilliseconds]));
end;

procedure TestWithoutInitialCapacity;
var
  I, J, LNewId: Integer;
  LDictionary: TDictionary<Integer, Boolean>;
  LStopwatch: TStopwatch;
begin
  LStopwatch := TStopwatch.StartNew;
  for I := 0 to 10000 do begin
    LDictionary := TDictionary<Integer, Boolean>.Create;
    for J := 0 to 100 do begin
      repeat
        LNewId := Random(High(Integer));
      until not LDictionary.ContainsKey(LNewId);
      LDictionary.Add(LNewId, True);
    end;
    FreeAndNil(LDictionary);
  end;
  showmessage(Format('Test without initial capacity: %g', [LStopwatch.Elapsed.TotalMilliseconds]));
end;

David Heffernan · February 16, 2020

2 hours ago, vfbb said:

But the optimization of creating it with the capacity is valid considering that the performance increases is on average 30%, both for small collections and for large ones.

I'm not suggesting that you don't preallocate with a sensible capacity if you know it. I would always do that. In multithreaded code especially, minimising heap allocation is always advisable.

Uwe Raabe · February 16, 2020

On 2/15/2020 at 7:36 PM, dummzeuch said:

Oh, TDictionary is hash table based. I wasn't aware of that. In that case, of course it should be much faster for significantly large lists.

Just a note to your article TStringList vs. THashedStringList vs. TDictionary: In Delphi 10.3.3 THashedStringLIst is not used to speed up TMemInifile anymore. AFAIK it is not used anywhere in the standard libraries and probably only there to stay compatible.

Btw, I agree that it is poorly implemented.

Lars Fosdal · February 17, 2020

What about the theory that dictionaries have fewer collisions if you set the capacity to be a prime number ?

Fact or fiction? Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing.

Edit: Funny enough, @Tommi Prami posted an alternative approach.

dummzeuch · February 17, 2020

58 minutes ago, Lars Fosdal said:

What about the theory that dictionaries have fewer collisions if you set the capacity to be a prime number ?

Fact or fiction? Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing.

That Wikipedia article links to an article by Thomas Wang Prime Double Hash Table from 1997 which explains what the point of choosing a prime number is. That sounds plausible to me, unless of course it is outdated by now due to different implementations of hash tables and hash functions.

Stefan Glienke · February 17, 2020

On 2/16/2020 at 2:34 AM, David Heffernan said:

The issue for small collections is simply the constant of proportionality. O(1) can be slower than O(log n) for small enough n.

Especially with the non highly optimized hash functions used in System.Generics.Defaults.

vfbb · February 17, 2020

4 hours ago, Lars Fosdal said:

What about the theory that dictionaries have fewer collisions if you set the capacity to be a prime number ?

It depends on the hash algorithm being used but in general it will not make a difference.

4 hours ago, Lars Fosdal said:

Fact or fiction? Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing.

For the dictionary implemented in delphi, the best capacity is always twice the number of items for known collections and 0 for unknown collections.

Edited February 17, 2020 by vfbb

Stefan Glienke · February 17, 2020

The capacity implementation of the RTL dictionary is a bit stupid. Typically the capacity is the number of items you can store in a collection without a reallocation/grow happening.

However that is not the case here (in Spring4D we actually implemented it that way - so if you want to store 100 items without a grow happening you pass capacity 100 - the internal mechanism handles overallocating to satisfy the max fill factor).

The RTL implementation takes the passed value, calculates the next power of 2 greater than the passed value, allocates as many buckets and sets the grow threshold to 75% of that value.

In the benchmark code above that means from the passed 200 it calculates 256 and sets the threshold to 192 which means you can store 192 items before it will grow.

Edited February 17, 2020 by Stefan Glienke

David Heffernan · February 17, 2020

10 hours ago, Lars Fosdal said:

Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing.

The latter.

Lars Fosdal · February 18, 2020

14 hours ago, David Heffernan said:

The latter.

Is there mathematical proof for that?

David Heffernan · February 18, 2020

51 minutes ago, Lars Fosdal said:

Is there mathematical proof for that?

Strange way round. Wouldn't it be normal for you to prove your prime claim?

February 18, 2020

On 2/17/2020 at 9:44 AM, Lars Fosdal said:

Fact or fiction?

Fiction, plain and simple.

Lars Fosdal · February 18, 2020

18 minutes ago, David Heffernan said:

Strange way round. Wouldn't it be normal for you to prove your prime claim?

I didn't make a claim. I asked a question. You answered. I asked if there was a mathematical proof that supported your answer.

The other thread I mentioned seems to indicate that prime sizes have a purpose, so I'd like to understand.

This is off topic for this thread, so please respond in the other thread or in DM.

February 18, 2020

To solve this proof thing for all of us, and specially for me after said it is fiction:

https://stackoverflow.com/questions/5929878/why-is-the-size-127-prime-better-than-128-for-a-hash-table/5930358#5930358

I can't write it better in English, the Edit of that answer explain it right on point.

Lars Fosdal · February 18, 2020

Which applies to the TDictionary hashing, I see. Thank you, @Kas Ob.

It is entirely possible that my prime size cargo cult stems from this article
https://web.archive.org/web/20080515141600/http://www.concentric.net/~Ttwang/tech/primehash.htm

which is a different algorithm than the Delphi one.

February 18, 2020

Just now, Lars Fosdal said:

It is entirely possible that my prime size cargo cult stems from this article
https://web.archive.org/web/20080515141600/http://www.concentric.net/~Ttwang/tech/primehash.htm

which is a different algorithm than the Delphi one.

No and Yes.

If you searched for "hash table size prime" you will find many pages and many SO answers, all of them ( most ) recommend a prime size, they all miss the truth intentionally or accidentally, the answer form SO i pasted explain it right and it does too miss something unintentionally,

Quote

If you have a distribution between 1 and 10,000 that is biased, because {x, 2x, 3x, ..} occur more often. Then a prime size will give a much, much better distribution as explained in this answer. (Unless x is exactly that prime size.)

And my correction would be for all those resources, if you have data that is biased then we are not talking Hash Table, if you expecting data to be x,2x,3x,.. or even you can predict it then you are not using Hash table and we should talk bucket,dictionary ...etc. and here prime can help, if fact it is recommended.

Hash by definition should not biased.

Sign In

Dictionaries, Hashing and Performance

Recommended Posts

Clément 153

Share this post

Link to post

dummzeuch 1737

Share this post

Link to post

Clément 153

Share this post

Link to post

Bill Meyer 339

Share this post

Link to post

Clément 153

Share this post

Link to post

David Heffernan 2492

Share this post

Link to post

dummzeuch 1737

Share this post

Link to post

vfbb 300

Share this post

Link to post

David Heffernan 2492

Share this post

Link to post

vfbb 300

Share this post

Link to post

David Heffernan 2492

Share this post

Link to post

Uwe Raabe 2226

Share this post

Link to post

Lars Fosdal 1956

Share this post

Link to post

dummzeuch 1737

Share this post

Link to post

Stefan Glienke 2190

Share this post

Link to post

vfbb 300

Share this post

Link to post

Stefan Glienke 2190

Share this post

Link to post

David Heffernan 2492

Share this post

Link to post

Lars Fosdal 1956

Share this post

Link to post

David Heffernan 2492

Share this post

Link to post

Guest

Share this post

Link to post

Lars Fosdal 1956

Share this post

Link to post

Guest

Share this post

Link to post

Lars Fosdal 1956

Share this post

Link to post

Guest

Share this post

Link to post

Create an account or sign in to comment

Create an account