Clément 148 Posted February 14, 2020 Actually I guess is the other way around! The more expert I get the more refactoring I do! As @Lars Fosdal said, cryptic code, WTF was I thinking code, Code that could benefit from modern RTL code if there's no performance penalty. For example: For lPair in SomeObjectList is slower than for i:= 0 to SomeobjectList.count-1. So I won't refactor it. Depending where the code is placed, I will use one form or the other. Replacing TStringList (used as dictionary) with a TDictionary/TDictionaryObject makes a huge diference! 2 Share this post Link to post
dummzeuch 1505 Posted February 14, 2020 29 minutes ago, Clément said: Replacing TStringList (used as dictionary) with a TDictionary/TDictionaryObject makes a huge difference! What kind of difference do you mean here? Performance or readability? Share this post Link to post
Clément 148 Posted February 14, 2020 36 minutes ago, dummzeuch said: What kind of difference do you mean here? Performance or readability? Actually both. I needed to write a small interface between SAP (text files with a +/- ten thousand records) and another system that needed to check if a RFID card number was in a list. Depending on some card property (returned by the associated class, I had to call a certain REST API). I was using a simple sorted StringList, when I switched to TDictionary the performance boost was noticeable. Routines dropped from minutes to seconds. I replaced: idx := fCardList.indexof( aCardNumber ); if idx>-1 then lObj := TMyObject( fCardList.Objects[idx] ) with if fCardlist.TryGetValue( aCardNumber, lObj ) then case lObj.RoleType of [..] end; Faster and more readable! 9 minutes ago, Dany Marmur said: IMHO Readability is key (1) performance is a boon (2). In that order. It depends. When performance is the key, I usually write a lot of comment to compensate the lack of readability. In those cases I know the routine must be lean and mean I will use the fastest data structure. When the routine is simpler, not called very much, or performance is not an issue then I move to a "more readable" coding approach. 1 Share this post Link to post
Bill Meyer 337 Posted February 14, 2020 3 hours ago, dummzeuch said: What kind of difference do you mean here? Performance or readability? From Primož Gabrijelčič's latest book: vs. 1 Share this post Link to post
Clément 148 Posted February 15, 2020 17 hours ago, Bill Meyer said: From Primož Gabrijelčič's latest book: vs. TStringList has a Direct access of O(1), if you are accessing it unsorted, and this is as fast as it gets. But, when used as a dictionary, TStringlist must be sorted, and that's another ball game! When you add an element to a sorted list it goes through binary search ( O log n ) , same happens if you delete. When you need a lookup structure, TDictionary is the way to go as it's always O(1). Share this post Link to post
David Heffernan 2345 Posted February 15, 2020 2 hours ago, Clément said: When you need a lookup structure, TDictionary is the way to go as it's always O(1). Not so simple. For small collections the dictionary can be slower. 1 Share this post Link to post
dummzeuch 1505 Posted February 15, 2020 Oh, TDictionary is hash table based. I wasn't aware of that. In that case, of course it should be much faster for significantly large lists. Share this post Link to post
vfbb 285 Posted February 15, 2020 (edited) 20 hours ago, David Heffernan said: Not so simple. For small collections the dictionary can be slower. Exactly! The Dictionary is only O(1) in the best case, as there can be many collisions of the key hashes, and this creates a loop that can reach O (n * 3/4) in the worst case, although it is very difficult. But what makes losing more performance are the grows that can occur as teams are added because the implementation maintain the count 50% -75% of the capacity, which I consider high value. But to avoid the many grows, and allow fewer collisions, just create it with the desired capacity. For example, for a collection of 100 items, I simply start with a capacity of 200. LDicionary := TDictionary<TKey, TValue>.Create(200); Remember, when you call Clear from the dictionary, it also removes the capacity, so you better recreate it instead of Clear. Edited February 16, 2020 by vfbb Share this post Link to post
David Heffernan 2345 Posted February 16, 2020 (edited) Ofc, you don't have to use the RTL dictionary. The spring4d dictionary has a number of powerful features. I don't think the growth is such an issue. Collisions can be, and better hash and proving would help. The issue for small collections is simply the constant of proportionality. O(1) can be slower than O(log n) for small enough n. Edited February 16, 2020 by David Heffernan Share this post Link to post
vfbb 285 Posted February 16, 2020 12 hours ago, David Heffernan said: Ofc, you don't have to use the RTL dictionary. The spring4d dictionary has a number of powerful features. I don't think the growth is such an issue. Collisions can be, and better hash and proving would help. The issue for small collections is simply the constant of proportionality. O(1) can be slower than O(log n) for small enough n. It is not an issue, it is the cost benefit of higher performance x less memory. But the optimization of creating it with the capacity is valid considering that the performance increases is on average 30%, both for small collections and for large ones. uses System.SysUtils, System.Generics.Collections, System.Diagnostics, FMX.Dialogs; procedure TestWithInitialCapacity; var I, J, LNewId: Integer; LDictionary: TDictionary<Integer, Boolean>; LStopwatch: TStopwatch; begin LStopwatch := TStopwatch.StartNew; for I := 0 to 10000 do begin LDictionary := TDictionary<Integer, Boolean>.Create(200); for J := 0 to 100 do begin repeat LNewId := Random(High(Integer)); until not LDictionary.ContainsKey(LNewId); LDictionary.Add(LNewId, True); end; FreeAndNil(LDictionary); end; showmessage(Format('Test with initial capacity: %g', [LStopwatch.Elapsed.TotalMilliseconds])); end; procedure TestWithoutInitialCapacity; var I, J, LNewId: Integer; LDictionary: TDictionary<Integer, Boolean>; LStopwatch: TStopwatch; begin LStopwatch := TStopwatch.StartNew; for I := 0 to 10000 do begin LDictionary := TDictionary<Integer, Boolean>.Create; for J := 0 to 100 do begin repeat LNewId := Random(High(Integer)); until not LDictionary.ContainsKey(LNewId); LDictionary.Add(LNewId, True); end; FreeAndNil(LDictionary); end; showmessage(Format('Test without initial capacity: %g', [LStopwatch.Elapsed.TotalMilliseconds])); end; 1 Share this post Link to post
David Heffernan 2345 Posted February 16, 2020 2 hours ago, vfbb said: But the optimization of creating it with the capacity is valid considering that the performance increases is on average 30%, both for small collections and for large ones. I'm not suggesting that you don't preallocate with a sensible capacity if you know it. I would always do that. In multithreaded code especially, minimising heap allocation is always advisable. Share this post Link to post
Uwe Raabe 2057 Posted February 16, 2020 On 2/15/2020 at 7:36 PM, dummzeuch said: Oh, TDictionary is hash table based. I wasn't aware of that. In that case, of course it should be much faster for significantly large lists. Just a note to your article TStringList vs. THashedStringList vs. TDictionary: In Delphi 10.3.3 THashedStringLIst is not used to speed up TMemInifile anymore. AFAIK it is not used anywhere in the standard libraries and probably only there to stay compatible. Btw, I agree that it is poorly implemented. 1 Share this post Link to post
Lars Fosdal 1792 Posted February 17, 2020 What about the theory that dictionaries have fewer collisions if you set the capacity to be a prime number ? Fact or fiction? Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing. Edit: Funny enough, @Tommi Prami posted an alternative approach. 1 Share this post Link to post
dummzeuch 1505 Posted February 17, 2020 58 minutes ago, Lars Fosdal said: What about the theory that dictionaries have fewer collisions if you set the capacity to be a prime number ? Fact or fiction? Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing. That Wikipedia article links to an article by Thomas Wang Prime Double Hash Table from 1997 which explains what the point of choosing a prime number is. That sounds plausible to me, unless of course it is outdated by now due to different implementations of hash tables and hash functions. Share this post Link to post
Stefan Glienke 2002 Posted February 17, 2020 On 2/16/2020 at 2:34 AM, David Heffernan said: The issue for small collections is simply the constant of proportionality. O(1) can be slower than O(log n) for small enough n. Especially with the non highly optimized hash functions used in System.Generics.Defaults. Share this post Link to post
vfbb 285 Posted February 17, 2020 (edited) 4 hours ago, Lars Fosdal said: What about the theory that dictionaries have fewer collisions if you set the capacity to be a prime number ? It depends on the hash algorithm being used but in general it will not make a difference. 4 hours ago, Lars Fosdal said: Fact or fiction? Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing. For the dictionary implemented in delphi, the best capacity is always twice the number of items for known collections and 0 for unknown collections. Edited February 17, 2020 by vfbb Share this post Link to post
Stefan Glienke 2002 Posted February 17, 2020 (edited) The capacity implementation of the RTL dictionary is a bit stupid. Typically the capacity is the number of items you can store in a collection without a reallocation/grow happening. However that is not the case here (in Spring4D we actually implemented it that way - so if you want to store 100 items without a grow happening you pass capacity 100 - the internal mechanism handles overallocating to satisfy the max fill factor). The RTL implementation takes the passed value, calculates the next power of 2 greater than the passed value, allocates as many buckets and sets the grow threshold to 75% of that value. In the benchmark code above that means from the passed 200 it calculates 256 and sets the threshold to 192 which means you can store 192 items before it will grow. Edited February 17, 2020 by Stefan Glienke 3 3 Share this post Link to post
David Heffernan 2345 Posted February 17, 2020 10 hours ago, Lars Fosdal said: Personally, I do actually set the capacity to a prime, but I am not quite sure if it because it is sound advice or if it is a cargo cult thing. The latter. Share this post Link to post
Lars Fosdal 1792 Posted February 18, 2020 14 hours ago, David Heffernan said: The latter. Is there mathematical proof for that? Share this post Link to post
David Heffernan 2345 Posted February 18, 2020 51 minutes ago, Lars Fosdal said: Is there mathematical proof for that? Strange way round. Wouldn't it be normal for you to prove your prime claim? 1 Share this post Link to post
Guest Posted February 18, 2020 On 2/17/2020 at 9:44 AM, Lars Fosdal said: Fact or fiction? Fiction, plain and simple. Share this post Link to post
Lars Fosdal 1792 Posted February 18, 2020 18 minutes ago, David Heffernan said: Strange way round. Wouldn't it be normal for you to prove your prime claim? I didn't make a claim. I asked a question. You answered. I asked if there was a mathematical proof that supported your answer. The other thread I mentioned seems to indicate that prime sizes have a purpose, so I'd like to understand. This is off topic for this thread, so please respond in the other thread or in DM. 1 Share this post Link to post
Guest Posted February 18, 2020 To solve this proof thing for all of us, and specially for me after said it is fiction: https://stackoverflow.com/questions/5929878/why-is-the-size-127-prime-better-than-128-for-a-hash-table/5930358#5930358 I can't write it better in English, the Edit of that answer explain it right on point. Share this post Link to post
Lars Fosdal 1792 Posted February 18, 2020 Which applies to the TDictionary hashing, I see. Thank you, @Kas Ob. It is entirely possible that my prime size cargo cult stems from this articlehttps://web.archive.org/web/20080515141600/http://www.concentric.net/~Ttwang/tech/primehash.htm which is a different algorithm than the Delphi one. Share this post Link to post
Guest Posted February 18, 2020 Just now, Lars Fosdal said: It is entirely possible that my prime size cargo cult stems from this articlehttps://web.archive.org/web/20080515141600/http://www.concentric.net/~Ttwang/tech/primehash.htm which is a different algorithm than the Delphi one. No and Yes. If you searched for "hash table size prime" you will find many pages and many SO answers, all of them ( most ) recommend a prime size, they all miss the truth intentionally or accidentally, the answer form SO i pasted explain it right and it does too miss something unintentionally, Quote If you have a distribution between 1 and 10,000 that is biased, because {x, 2x, 3x, ..} occur more often. Then a prime size will give a much, much better distribution as explained in this answer. (Unless x is exactly that prime size.) And my correction would be for all those resources, if you have data that is biased then we are not talking Hash Table, if you expecting data to be x,2x,3x,.. or even you can predict it then you are not using Hash table and we should talk bucket,dictionary ...etc. and here prime can help, if fact it is recommended. Hash by definition should not biased. Share this post Link to post