Jump to content

Recommended Posts

Actually I guess is the other way around! The more expert I get the more refactoring I do! 
 

As @Lars Fosdal said, cryptic code, WTF was I thinking code, Code that could benefit from modern RTL code if there's no performance penalty. 
For example: For lPair in SomeObjectList is slower than for i:= 0 to SomeobjectList.count-1. So I won't refactor it. Depending where the code is placed, I will use one form or the other.
Replacing TStringList (used as dictionary) with a TDictionary/TDictionaryObject makes a huge diference!
 

 

  • Like 2

Share this post


Link to post
29 minutes ago, Clément said:

Replacing TStringList (used as dictionary) with a TDictionary/TDictionaryObject makes a huge difference!

What kind of difference do you mean here? Performance or readability?

Share this post


Link to post
36 minutes ago, dummzeuch said:

What kind of difference do you mean here? Performance or readability?

Actually both. I needed to write a small interface between SAP (text files with a +/- ten thousand records) and another system that needed to check if a RFID card number was in a list. Depending on some card property (returned by the associated class, I had to call a certain REST API). I was using a simple sorted StringList, when I switched to TDictionary the performance boost was noticeable. Routines dropped from  minutes to seconds. 

I replaced:
  idx := fCardList.indexof( aCardNumber );
 if idx>-1 then

    lObj := TMyObject( fCardList.Objects[idx] )

 

with

 

if fCardlist.TryGetValue( aCardNumber, lObj ) then

   case lObj.RoleType of

     [..]

   end;

Faster and more readable!

9 minutes ago, Dany Marmur said:

IMHO Readability is key (1) performance is a boon (2). In that order.

It depends. When performance is the key, I usually write a lot of comment to compensate the lack of readability. In those cases I know the routine must be lean and mean I will use the fastest data structure. When the routine is simpler, not called very much, or performance is not an issue then I move to a "more readable" coding approach.

  • Like 1

Share this post


Link to post
3 hours ago, dummzeuch said:

What kind of difference do you mean here? Performance or readability?

From Primož Gabrijelčič's latest book:

image.png.4686cbf6bc256106a1d53c0d7092eb78.png

vs.

image.png.46ba69f2b497553386f5d600b97225df.png

  • Thanks 1

Share this post


Link to post
17 hours ago, Bill Meyer said:

From Primož Gabrijelčič's latest book:

image.png.4686cbf6bc256106a1d53c0d7092eb78.png

vs.

image.png.46ba69f2b497553386f5d600b97225df.png


TStringList has a Direct access of O(1), if you are accessing it unsorted, and this is as fast as it gets. But,  when used as a dictionary, TStringlist must be sorted, and that's another ball game!
When you add an element to a sorted list it goes through  binary search ( O log n ) , same happens if you delete.

 

When you need a lookup structure, TDictionary is the way to go as it's always O(1).

 

Share this post


Link to post
2 hours ago, Clément said:


When you need a lookup structure, TDictionary is the way to go as it's always O(1).


 

Not so simple. For small collections the dictionary can be slower. 

  • Like 1

Share this post


Link to post

Oh, TDictionary is hash table based. I wasn't aware of that. In that case, of course it should be much faster for significantly large lists.

Share this post


Link to post
20 hours ago, David Heffernan said:

Not so simple. For small collections the dictionary can be slower. 

Exactly! The Dictionary is only O(1) in the best case, as there can be many collisions of the key hashes, and this creates a loop that can reach O (n * 3/4) in the worst case, although it is very difficult. But what makes losing more performance are the grows that can occur as teams are added because the implementation maintain the count 50% -75% of the capacity, which I consider high value.

But to avoid the many grows, and allow fewer collisions, just create it with the desired capacity. For example, for a collection of 100 items, I simply start with a capacity of 200.

 

LDicionary := TDictionary<TKey, TValue>.Create(200);

 

Remember, when you call Clear from the dictionary, it also removes the capacity, so you better recreate it instead of Clear.

Edited by vfbb

Share this post


Link to post

Ofc, you don't have to use the RTL dictionary. The spring4d dictionary has a number of powerful features. 

 

I don't think the growth is such an issue. Collisions can be, and better hash and proving would help. The issue for small collections is simply the constant of proportionality. O(1) can be slower than O(log n) for small enough n. 

Edited by David Heffernan

Share this post


Link to post
12 hours ago, David Heffernan said:

Ofc, you don't have to use the RTL dictionary. The spring4d dictionary has a number of powerful features. 

 

I don't think the growth is such an issue. Collisions can be, and better hash and proving would help. The issue for small collections is simply the constant of proportionality. O(1) can be slower than O(log n) for small enough n. 

It is not an issue, it is the cost benefit of higher performance x less memory. But the optimization of creating it with the capacity is valid considering that the performance increases is on average 30%, both for small collections and for large ones.

uses
  System.SysUtils, System.Generics.Collections, System.Diagnostics, FMX.Dialogs;

procedure TestWithInitialCapacity;
var
  I, J, LNewId: Integer;
  LDictionary: TDictionary<Integer, Boolean>;
  LStopwatch: TStopwatch;
begin
  LStopwatch := TStopwatch.StartNew;
  for I := 0 to 10000 do begin
    LDictionary := TDictionary<Integer, Boolean>.Create(200);
    for J := 0 to 100 do begin
      repeat
        LNewId := Random(High(Integer));
      until not LDictionary.ContainsKey(LNewId);
      LDictionary.Add(LNewId, True);
    end;
    FreeAndNil(LDictionary);
  end;
  showmessage(Format('Test with initial capacity: %g', [LStopwatch.Elapsed.TotalMilliseconds]));
end;

procedure TestWithoutInitialCapacity;
var
  I, J, LNewId: Integer;
  LDictionary: TDictionary<Integer, Boolean>;
  LStopwatch: TStopwatch;
begin
  LStopwatch := TStopwatch.StartNew;
  for I := 0 to 10000 do begin
    LDictionary := TDictionary<Integer, Boolean>.Create;
    for J := 0 to 100 do begin
      repeat
        LNewId := Random(High(Integer));
      until not LDictionary.ContainsKey(LNewId);
      LDictionary.Add(LNewId, True);
    end;
    FreeAndNil(LDictionary);
  end;
  showmessage(Format('Test without initial capacity: %g', [LStopwatch.Elapsed.TotalMilliseconds]));
end;

 

 

  • Like 1

Share this post


Link to post
2 hours ago, vfbb said:

But the optimization of creating it with the capacity is valid considering that the performance increases is on average 30%, both for small collections and for large ones.

I'm not suggesting that you don't preallocate with a sensible capacity if you know it. I would always do that. In multithreaded code especially, minimising heap allocation is always advisable. 

Share this post


Link to post
On 2/15/2020 at 7:36 PM, dummzeuch said:

Oh, TDictionary is hash table based. I wasn't aware of that. In that case, of course it should be much faster for significantly large lists.

Just a note to your article TStringList vs. THashedStringList vs. TDictionary: In Delphi 10.3.3 THashedStringLIst is not used to speed up TMemInifile anymore. AFAIK it is not used anywhere in the standard libraries and probably only there to stay compatible. 

 

Btw, I agree that it is poorly implemented.

  • Like 1

Share this post


Link to post

What about the theory that dictionaries have fewer collisions if you set the capacity to be a prime number ?

Fact or fiction? Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing.

 

Edit: Funny enough, @Tommi Prami posted an alternative approach.

 

  • Like 1

Share this post


Link to post
58 minutes ago, Lars Fosdal said:

What about the theory that dictionaries have fewer collisions if you set the capacity to be a prime number ?

Fact or fiction? Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing.

That Wikipedia article links to an article by Thomas Wang Prime Double Hash Table from 1997 which explains what the point of choosing a prime number is. That sounds plausible to me, unless of course it is outdated by now due to different implementations of hash tables and hash functions.

 

Share this post


Link to post
On 2/16/2020 at 2:34 AM, David Heffernan said:

The issue for small collections is simply the constant of proportionality. O(1) can be slower than O(log n) for small enough n. 

Especially with the non highly optimized hash functions used in System.Generics.Defaults.

Share this post


Link to post
4 hours ago, Lars Fosdal said:

What about the theory that dictionaries have fewer collisions if you set the capacity to be a prime number ?

It depends on the hash algorithm being used but in general it will not make a difference.

4 hours ago, Lars Fosdal said:

Fact or fiction? Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing. 

For the dictionary implemented in delphi, the best capacity is always twice the number of items for known collections and 0 for unknown collections.

 

Edited by vfbb

Share this post


Link to post

The capacity implementation of the RTL dictionary is a bit stupid. Typically the capacity is the number of items you can store in a collection without a reallocation/grow happening.

However that is not the case here (in Spring4D we actually implemented it that way - so if you want to store 100 items without a grow happening you pass capacity 100 - the internal mechanism handles overallocating to satisfy the max fill factor).

 

The RTL implementation takes the passed value, calculates the next power of 2 greater than the passed value, allocates as many buckets and sets the grow threshold to 75% of that value.

In the benchmark code above that means from the passed 200 it calculates 256 and sets the threshold to 192 which means you can store 192 items before it will grow.

Edited by Stefan Glienke
  • Like 3
  • Sad 3

Share this post


Link to post
10 hours ago, Lars Fosdal said:

Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing.

The latter. 

Share this post


Link to post
51 minutes ago, Lars Fosdal said:

Is there mathematical proof for that?

Strange way round. Wouldn't it be normal for you to prove your prime claim? 

  • Like 1

Share this post


Link to post
Guest
On 2/17/2020 at 9:44 AM, Lars Fosdal said:

Fact or fiction?

Fiction, plain and simple.

 

Share this post


Link to post
18 minutes ago, David Heffernan said:

Strange way round. Wouldn't it be normal for you to prove your prime claim? 

I didn't make a claim.  I asked a question. You answered. I asked if there was a mathematical proof that supported your answer.

The other thread I mentioned seems to indicate that prime sizes have a purpose, so I'd like to understand.

This is off topic for this thread, so please respond in the other thread or in DM.

  • Like 1

Share this post


Link to post
Guest
Just now, Lars Fosdal said:

It is entirely possible that my prime size cargo cult stems from this article
https://web.archive.org/web/20080515141600/http://www.concentric.net/~Ttwang/tech/primehash.htm

which is a different algorithm than the Delphi one.

No and Yes.

 

If you searched for "hash table size prime" you will find many pages and many SO answers, all of them ( most ) recommend a prime size, they all miss the truth intentionally or accidentally, the answer form SO i pasted explain it right and it does too miss something unintentionally,  

Quote

If you have a distribution between 1 and 10,000 that is biased, because {x, 2x, 3x, ..} occur more often. Then a prime size will give a much, much better distribution as explained in this answer. (Unless x is exactly that prime size.)

And my correction would be for all those resources, if you have data that is biased then we are not talking Hash Table, if you expecting data to be x,2x,3x,.. or even you can predict it then you are not using Hash table and we should talk bucket,dictionary ...etc. and here prime can help, if fact it is recommended.

Hash by definition should not biased.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×