Stefan Glienke 2009 Posted June 14, 2021 https://delphisorcery.blogspot.com/2021/06/spring4d-20-sneak-peek-evolution-of.html 11 5 Share this post Link to post
Lars Fosdal 1792 Posted June 15, 2021 Interesting read. About avoiding RTTI as much as possible - how is this achieved? Edit: Or rather - what tradeoffs were necessary to eliminate the RTTI dependencies? Share this post Link to post
mvanrijnen 123 Posted June 15, 2021 there is a strange comment post at your blog Share this post Link to post
Arnaud Bouchez 407 Posted June 15, 2021 Nice reading. If you make some performance comparisons with other libraries, ensure you use the latest mORMot 2 source code - not mORMot 1, but mORMot 2. Some rewrite and optimization have been included in latest mORMot 2. And consider using TDynArrayHashed instead of TSynDictionary if you want a fair comparison - TSynDictionary has a lock to be thread-safe, and copy the retrieved data at lookup so it is not fair comparison with regular mapping classes. Share this post Link to post
David Heffernan 2347 Posted June 15, 2021 36 minutes ago, Arnaud Bouchez said: Nice reading. If you make some performance comparisons with other libraries, ensure you use the latest mORMot 2 source code - not mORMot 1, but mORMot 2. Some rewrite and optimization have been included in latest mORMot 2. And consider using TDynArrayHashed instead of TSynDictionary if you want a fair comparison - TSynDictionary has a lock to be thread-safe, and copy the retrieved data at lookup so it is not fair comparison with regular mapping classes. Are the collections in mORMot generic? Because if not then there's no way to make a fair comparison. Share this post Link to post
Arnaud Bouchez 407 Posted June 15, 2021 (edited) TDynArrayHashed is a wrapper around an existing dynamic array. It doesn't hold any data but works with TArray<T>. You could even maintain several hashes for several fields of the array item at the same time. TSynDictionary is a class holding two dynamic arrays defined at runtime (one for the keys, one for the values). So it is closer to the regular collections. But it has additional features (like thread safety, JSON or binary serialization, deprecation after timeout, embedded data search...), which could make the comparison unfair: it does more than lookups. The default syntax is using old fashioned TypeInfo() but you could wrap it into generics syntax instead, using TypeInfo(T). So I guess the comparison is fair. And the executable size would not be bloated because there is no generic involved for the actual implementation of the TSynDictionary class in mORMot. Removing generic/parametrized code as soon as possible is what every library is trying to do, both the Delphi RTL and Delphi Spring. The RTL and Delphi String were rewritten several times to keep the generic syntax as high as possible, and try to leverage compiler magic as much as possible to avoid duplicated or dead code. This is what String4D 2.0 is all about. mORMot just allows to avoid duplicated or dead code regeneration completely. Edited June 15, 2021 by Arnaud Bouchez Share this post Link to post
Vincent Parrett 754 Posted June 15, 2021 A few months ago I replaced all RTL collections (apart from TStringList) with spring4d 2.0 collections and I saw noticeably smaller binaries (exe/bpl) and faster overall application performance. Spring4D collections are just so much nicer to use, being interface based is a plus for me, and the LINQ like functionality makes it really easy to do sorting, filtering, projections etc - if only Delphi had lambdas so the predicates etc were not so damned verbose! I'm extremely thankful for the work @Stefan Glienke has put into Spring4D 👍 12 2 Share this post Link to post
Rollo62 538 Posted June 15, 2021 @Stefan Glienke Thanks for that nice library, its really outstanding. I still have to explore more of the billion features, and meanwhile I'm always looking "Spring-first" if I need a new feature What I always notice here, is that Spring.Collections.dcu is growing to be really by orders the biggest file in my projects, same as the compille time of Spring.Collections is the longest when compiling from source. In the Spring.Collections I see a lot of types defined, do you think that further separation and modularization of those types into separate units would help to limit the maximum file size to an acceptable level, if not reduce the overal size? Share this post Link to post
Stefan Glienke 2009 Posted June 15, 2021 Just now, Rollo62 said: What I always notice here, is that Spring.Collections.dcu is growing to be really by orders the biggest file in my projects, same as the compille time of Spring.Collections is the longest when compiling from source. In the Spring.Collections I see a lot of types defined, do you think that further separation and modularization of those types into separate units would help to limit the maximum file size to an acceptable level, if not reduce the overal size? The opposite - by forcing many specialized types into Spring.Collections.dcu it avoids stuffing them into each and every other dcu that might use them. Try following: create 10 units each with a simple class and a function that calls TCollections.CreateList<TThatClass> - now compile the project once with 1.2 and once with 2.0 and look at the dcus. Also I can only repeat my suggestion: precompile Spring and use its dcus and don't recompile it over and over every time you compile your project. Because then you will gain the main benefit - most commonly used specializations are in Spring.Collections.dcu and the compiler just needs to reference them. Compiletime and memory usage of the compiler for projects using Spring 2.0 have dropped significantly. 7 Share this post Link to post
Rollo62 538 Posted June 15, 2021 (edited) Yes, I have projects with precompiled S4D and some earlier projects with S4D from sources, which I hold in different VM's and not need to compile very often. Still that is pretty much OK for me, even when compiling from Sources, but once V2.0 is released I might migrate all of them to the precompiled version. Big promise Edited June 15, 2021 by Rollo62 Share this post Link to post
Arnaud Bouchez 407 Posted June 22, 2021 @Stefan Glienke If you can, please take a look at a new mORMot 2 unit: https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.collections.pas In respect to generics.collections, this unit uses interfaces as variable holders, and leverage them to reduce the generated code as much as possible, as the Spring4D 2.0 framework does, but for both Delphi and FPC. Most of the unit is in fact embedding some core collection types to mormot.core.collections.dcu to reduce the user units and executable size for Delphi XE7+ and FPC 3.2+. Thanks a lot for the ideas! It also publishes TDynArray and TSynDictionary high-level features like JSON/binary serialization or thread safety with Generics strong typing. More TDynArray features (like sorting and search) and also TDynArrayHasher features (an optional hash table included in ISynList<T>) are coming. 3 Share this post Link to post
Arnaud Bouchez 407 Posted June 23, 2021 (edited) Is it only me or awful and undocumented problems like " F2084 Internal Error: AV004513DE-R00024F47-0 " occur when working on Delphi with generics? @Stefan Glienke How did you manage to circumvent the Delphi compiler limitations? On XE7 for instance, as soon as I use intrinsics I encounter those problems - and the worse is that they are random: sometimes, the compilation succeed, but other times there is the Internal Error, sometimes on Win32 sometimes on Win64... 😞 I get rid of as many "inlined" code as possible, since I discovered generics do not like calling "inline" code. In comparison, FPC seems much more stable. The Lazarus IDE is somewhat lost within generics code (code navigation is not effective) - but at least, it doesn't crash and you can work as you expect. FPC 3.2 generics support is somewhat mature: I can see by disassembling the .o that intrinsics are properly handled with this compiler. Which is good news, since I rely heavily on it for folding base generics specialization using interface, as you did with Spring4D. Edited June 23, 2021 by Arnaud Bouchez Share this post Link to post
Stefan Glienke 2009 Posted June 23, 2021 Yes, developing generic code can be absolutely frustrating depending on the Delphi version. And codegen within generics can be less effective than it would if you would write the same identical code directly. Guess you could get a glimpse of what I have to endure Share this post Link to post
Arnaud Bouchez 407 Posted June 23, 2021 I just found out that Delphi has big troubles with static arrays like THash128 = array[0..15] of byte. If I remove those types, then the Delphi compiler does emit "F2084 Internal Error" any more... So I will stick with the same kind of types as you did (byte/word/integer/int64/string/interface/variant) and keep those bigger ordinals (THash128/TGUID) to use regular (bloated) generics code on Delphi (they work with no problem on FPC 3.2). Yes, I get a glimpse of what you endure, and I thank you much for having found such nice tricks in Spring4D. The interface + folded types pattern is awesome. 😄 1 Share this post Link to post
Edwin Yip 154 Posted June 24, 2021 12 hours ago, Arnaud Bouchez said: Is it only me or awful and undocumented problems like " F2084 Internal Error: AV004513DE-R00024F47-0 " occur when working on Delphi with generics? You are not alone! And sadly that's the reason I use as less as generics as possible... sometimes a re-build clears the issue, sometimes I have to re-boot the IDE, it's really unproductive... And, I thought the bug only exists in my old XE4 compiler, I didn't expect it also lives in the new IDEs... then I wonder, what are the new versions charge for??? ... Share this post Link to post
Edwin Yip 154 Posted June 24, 2021 On 6/23/2021 at 6:07 AM, Arnaud Bouchez said: @Stefan Glienke If you can, please take a look at a new mORMot 2 unit: https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.collections.pas In respect to generics.collections, this unit uses interfaces as variable holders, and leverage them to reduce the generated code as much as possible, as the Spring4D 2.0 framework does, but for both Delphi and FPC. Most of the unit is in fact embedding some core collection types to mormot.core.collections.dcu to reduce the user units and executable size for Delphi XE7+ and FPC 3.2+. Thanks a lot for the ideas! It also publishes TDynArray and TSynDictionary high-level features like JSON/binary serialization or thread safety with Generics strong typing. More TDynArray features (like sorting and search) and also TDynArrayHasher features (an optional hash table included in ISynList<T>) are coming. Wow! So the collections in mORMot2 is/will be even more powerful! Great to hear that, and as always, we can rely on mORMt/2 ;) Share this post Link to post
Arnaud Bouchez 407 Posted June 24, 2021 I discovered that using a record as TEnumerator makes the code faster, and allow some kind of inlining, with no memory allocation. My TSynEnumerator<T> record only uses 3 pointers on the stack with no temporary allocation. I prefer using pointers here to avoid any temporary storage e.g. for managed types. And the GetCurrent and MoveNext functions have no branch and inline very aggressively. And it works with Delphi 2010! 😉 Record here sounds like a much more efficient idea than a class + interface, as @Stefan Glienke did in Spring4D. Stefan, have you any reason to prefer interfaces also for the enumerator instead of good old record? From my finding, performance is better with a record, especially thanks to inlining - which is perfect on FPC, but Delphi still calls MoveNext and don't inline it. It also avoid a try...finally for most simple functions, nor any heap allocation. Please check https://github.com/synopse/mORMot2/commit/17b7a2753bb54057ad0b6d03bd757e370d80dce2 Share this post Link to post
Stefan Glienke 2009 Posted June 24, 2021 (edited) Latest Delphi versions inline MoveNext - since 10.3 or so The reason to use an interface is simple: API compatibility - try building something like IEnumerable<T> and all other collection types based upon and compatible with that with record-based enumerators that inline. It's not possible. Record-based enumerators are only possible if you make them specifically for each collection type as in your case because you don't have a common base type for your collections that is also enumerable. IList<T> in Spring can have multiple implementations (and indeed does) not all being wrappers around a dynamic array. FWIW here are some performance comparisons between RTL that uses a class for enumerator with inlined MoveNext GetCurrent, a modified RTL version via helper with GetEnumerator that returns a record, and Spring. The test just does iteration over differently sized lists and counting the odd items (the lists are filled with numbers 1..n in random order. You can indeed see the smaller list have way lower items/sec as the loop overhead is higher. However, I argue that the benefit of the overall design and its flexibility with all the things you can achieve with IEnumerable<T> compensates for the higher cost of setting up the loop and not being able to inline MoveNext and GetCurrent. Furthermore, the enumerator in Spring does an additional check in its MoveNext to prevent accidentally modifying the currently being iterated collection - which is a not so uncommon mistake to happen. Also since IEnumerable<T> and IEnumerator<T> in Spring are composable and are being used with the streaming operations an enumerator always holds the value after a MoveNext and does not only fetch it from the underlying collections like a as a naive and fast list iterator would do by just holding the lists array pointer and an index and the inlined GetCurrent be Result := items[index] Remember when I wrote in my article that Spring provides the best balance between the best possible speed and a rich API? This is one of the decisions I had to make and I am constantly exploring options to make this better. Especially on small lists, the loop overhead can be huge compared to the actual work inside the loop. FWIW especially for lists, I am currently looking into providing a similar (but safer!) API as the RTL does by giving direct raw access to the backing array. Using this API can use a record enumerator and blows the performance totally out of the water. 10.4.2 - win32 Run on (4 X 3410,01 MHz CPU s) CPU Caches: L1 Data 32 K (x4) L1 Instruction 32 K (x4) L2 Unified 256 K (x4) L3 Unified 6144 K (x1) ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------ iterate-rtl/10 79,6 ns 78,5 ns 8960000 items_per_second=127.431M/s iterate-rtl/100 291 ns 295 ns 2488889 items_per_second=338.913M/s iterate-rtl/1000 2286 ns 2302 ns 298667 items_per_second=434.425M/s iterate-rtl/10000 22287 ns 22461 ns 32000 items_per_second=445.217M/s iterate-rtl/100000 222090 ns 219702 ns 2987 items_per_second=455.162M/s iterate-rtl-record/10 17,0 ns 17,3 ns 40727273 items_per_second=579.232M/s iterate-rtl-record/100 185 ns 184 ns 3733333 items_per_second=543.03M/s iterate-rtl-record/1000 1737 ns 1716 ns 373333 items_per_second=582.764M/s iterate-rtl-record/10000 18495 ns 18415 ns 37333 items_per_second=543.025M/s iterate-rtl-record/100000 179492 ns 179983 ns 3733 items_per_second=555.609M/s iterate-spring/10 90,2 ns 90,0 ns 7466667 items_per_second=111.132M/s iterate-spring/100 410 ns 408 ns 1723077 items_per_second=245.06M/s iterate-spring/1000 3699 ns 3683 ns 186667 items_per_second=271.516M/s iterate-spring/10000 36136 ns 36098 ns 19478 items_per_second=277.02M/s iterate-spring/100000 365107 ns 368968 ns 1948 items_per_second=271.026M/s 10.4.2 - win64 Run on (4 X 3410,01 MHz CPU s) CPU Caches: L1 Data 32 K (x4) L1 Instruction 32 K (x4) L2 Unified 256 K (x4) L3 Unified 6144 K (x1) ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------ iterate-rtl/10 112 ns 112 ns 6400000 items_per_second=89.0435M/s iterate-rtl/100 538 ns 530 ns 1120000 items_per_second=188.632M/s iterate-rtl/1000 4570 ns 4499 ns 149333 items_per_second=222.263M/s iterate-rtl/10000 45814 ns 46527 ns 15448 items_per_second=214.929M/s iterate-rtl/100000 457608 ns 455097 ns 1545 items_per_second=219.733M/s iterate-rtl-record/10 20,1 ns 19,9 ns 34461538 items_per_second=501.259M/s iterate-rtl-record/100 197 ns 197 ns 3733333 items_per_second=508.369M/s iterate-rtl-record/1000 1863 ns 1842 ns 373333 items_per_second=543.03M/s iterate-rtl-record/10000 18664 ns 18834 ns 37333 items_per_second=530.958M/s iterate-rtl-record/100000 186418 ns 188354 ns 3733 items_per_second=530.916M/s iterate-spring/10 107 ns 107 ns 6400000 items_per_second=93.0909M/s iterate-spring/100 493 ns 500 ns 1000000 items_per_second=200M/s iterate-spring/1000 4298 ns 4332 ns 165926 items_per_second=230.854M/s iterate-spring/10000 42277 ns 42165 ns 14452 items_per_second=237.161M/s iterate-spring/100000 422194 ns 423825 ns 1659 items_per_second=235.947M/s Edited June 24, 2021 by Stefan Glienke 1 1 Share this post Link to post
Arnaud Bouchez 407 Posted June 24, 2021 Nice timings. Yes, you are right, in mORMot we only use basic iteration to be used in "for ... in .. do" statement with no further composition and flexibility as available with IEnumerable<T>. The difference for small loops is huge (almost 10 times) and for big loops is still relevant (2 times) when records are used. I guess mORMot pointer-based records could be slightly faster than RTL index-based values, especially when managed types are involved. In practice, I find "for .. in .. do" to be the main place for iterations. So to my understanding, records are the way to go for mORMot. Then nothing prevents another method returning complex and fluid IEnumerable<T>. We just don't go that way in mORMot yet. Share this post Link to post
Vincent Parrett 754 Posted June 24, 2021 4 hours ago, Stefan Glienke said: Remember when I wrote in my article that Spring provides the best balance between the best possible speed and a rich API? It's this rich api that makes Spring4D collections worthwhile. The perf hit is still worth it for this alone. In a real world application (FinalBuilder) I have not noticed any perf issues that could be attributed to the spring4d collections, but then I'm rarely enumerating huge collections - certainly in my profiling efforts there were plenty of areas to optimise, but spring4d wasn't showing up as an area of concern in the profiler (vtune). Now if only we could have non ref counted interfaces (ie no calls to add/release) and we could implement interfaces on records - best of both worlds 😉 2 Share this post Link to post
Stefan Glienke 2009 Posted June 25, 2021 (edited) FWIW the difference you saw between interfaced based enumerator and a record based one might have been bigger than what you see with Spring because some collection types (I am considering applying this to some more) are using not classic implemented-by-an-object interfaces for IEnumerator but a handcrafted IMT - seehttps://bitbucket.org/sglienke/spring4d/src/3e9160575e06b3956c0e90d2aebe5e57d931cd19/Source/Base/Collections/Spring.Collections.Lists.pas#lines-56 - this avoids the adjustor thunks which gives a noticeable performance improvement. I also looked into avoiding the heap allocation for the enumerator by embedding one into the list itself which it returns and only do the heap allocation once a second enumeration happens. I did not run any benchmarks on that yet to evaluate if that's something worthwhile. After all, if you want to have that maximum raw speed avoiding assignments I already mentioned another approach I am looking into - which also provides for-in loop capability but will not have T as loop element but a ^T. I have that in an experimental branch - I can add some benchmark of that tomorrow. Edited June 25, 2021 by Stefan Glienke 1 Share this post Link to post
Arnaud Bouchez 407 Posted June 25, 2021 Yes, I have seen the handcrafted IMT. But I am not convinced the "sub rcx, xxx; jmp xxx" code block makes a noticeable performance penalty - perhaps only on irrelevant micro benchmarks. The CPU lock involved in calling a Delphi virtual method through an interface has a cost for sure https://www.idefixpack.de/blog/2016/05/whats-wrong-with-virtual-methods-called-through-an-interface - but not a sub + jmp. Also the enumerator instance included into the list itself seems a premature optimization to me if we want to be cross-platform as we want. Calling GetThreadID on a non Windows platform has a true cost if you call the pthread library. And the resulting code complexity makes me wonder if it is really worth it in practice. Better switch to a better memory manager, or just use a record and rely on inlining + stack allocation of an unmanaged enumerator. Since you implemented both of those optimizations: OK, just continue to use them. But I won't go that way with mORMot. Whereas I still find your interface + pre-compiled folded classes an effective and nice trick (even if I just disable byte and word stubs for dictionaries: if you have a dictionary, the hash table will be bigger than the size of the byte/word data itself - so just use integers instead of byte/word for dictionaries in end-user code; but I still have byte/word specializations for IList<> which does not have any hash by default, but may on demand). Share this post Link to post