Jump to content
Stefan Glienke

Spring4D 2.0 sneak peek - the evolution of performance

Recommended Posts

Interesting read.  About avoiding RTTI as much as possible - how is this achieved?

Edit: Or rather - what tradeoffs were necessary to eliminate the RTTI dependencies?

Share this post


Link to post

Nice reading.

 

If you make some performance comparisons with other libraries, ensure you use the latest mORMot 2 source code - not mORMot 1, but mORMot 2.
Some rewrite and optimization have been included in latest mORMot 2.
And consider using TDynArrayHashed instead of TSynDictionary if you want a fair comparison - TSynDictionary has a lock to be thread-safe, and copy the retrieved data at lookup so it is not fair comparison with regular mapping classes.

Share this post


Link to post
36 minutes ago, Arnaud Bouchez said:

Nice reading.

 

If you make some performance comparisons with other libraries, ensure you use the latest mORMot 2 source code - not mORMot 1, but mORMot 2.
Some rewrite and optimization have been included in latest mORMot 2.
And consider using TDynArrayHashed instead of TSynDictionary if you want a fair comparison - TSynDictionary has a lock to be thread-safe, and copy the retrieved data at lookup so it is not fair comparison with regular mapping classes.

Are the collections in mORMot generic? Because if not then there's no way to make a fair comparison.

Share this post


Link to post
Posted (edited)

TDynArrayHashed is a wrapper around an existing dynamic array. It doesn't hold any data but works with TArray<T>. You could even maintain several hashes for several fields of the array item at the same time.

TSynDictionary is a class holding two dynamic arrays defined at runtime (one for the keys, one for the values). So it is closer to the regular collections. But it has additional features (like thread safety, JSON or binary serialization, deprecation after timeout, embedded data search...), which could make the comparison unfair: it does more than lookups.

 

The default syntax is using old fashioned TypeInfo() but you could wrap it into generics syntax instead, using TypeInfo(T).
So I guess the comparison is fair. And the executable size would not be bloated because there is no generic involved for the actual implementation of the TSynDictionary class in mORMot.

Removing generic/parametrized code as soon as possible is what every library is trying to do, both the Delphi RTL and Delphi Spring. The RTL and Delphi String were rewritten several times to keep the generic syntax as high as possible, and try to leverage compiler magic as much as possible to avoid duplicated or dead code. This is what String4D 2.0 is all about. mORMot just allows to avoid duplicated or dead code regeneration completely.

Edited by Arnaud Bouchez

Share this post


Link to post

A few months ago I replaced all RTL collections (apart from TStringList) with spring4d 2.0 collections and I saw noticeably smaller binaries (exe/bpl) and faster overall application performance. Spring4D collections are just so much nicer to use, being interface based is a plus for me, and the LINQ like functionality makes it really easy to do sorting, filtering, projections etc - if only Delphi had lambdas so the predicates etc were not so damned verbose!

 

I'm extremely thankful for the work @Stefan Glienke has put into Spring4D 👍

  

  • Like 10
  • Thanks 2

Share this post


Link to post

@Stefan Glienke

Thanks for that nice library, its really outstanding.

I still have to explore more of the billion features, and meanwhile I'm always looking "Spring-first" if I need a new feature :classic_biggrin:

What I always notice here, is that Spring.Collections.dcu is growing to be really by orders the biggest file in my projects,
same as the compille time of Spring.Collections is the longest when compiling from source.

In the Spring.Collections I see a lot of types defined,
do you think that further separation and modularization of those types into separate units would help to limit the maximum file size to an acceptable level,

if not reduce the overal size?

 

Share this post


Link to post
Just now, Rollo62 said:

What I always notice here, is that Spring.Collections.dcu is growing to be really by orders the biggest file in my projects,
same as the compille time of Spring.Collections is the longest when compiling from source.

In the Spring.Collections I see a lot of types defined,
do you think that further separation and modularization of those types into separate units would help to limit the maximum file size to an acceptable level,

if not reduce the overal size?

The opposite - by forcing many specialized types into Spring.Collections.dcu it avoids stuffing them into each and every other dcu that might use them.

Try following: create 10 units each with a simple class and a function that calls TCollections.CreateList<TThatClass> - now compile the project once with 1.2 and once with 2.0 and look at the dcus.

 

Also I can only repeat my suggestion: precompile Spring and use its dcus and don't recompile it over and over every time you compile your project. Because then you will gain the main benefit - most commonly used specializations are in Spring.Collections.dcu and the compiler just needs to reference them.

Compiletime and memory usage of the compiler for projects using Spring 2.0 have dropped significantly.

  • Like 7

Share this post


Link to post
Posted (edited)

Yes, I have projects with precompiled S4D and some earlier projects with S4D from sources, which I hold in different VM's and not need to compile very often.

Still that is pretty much OK for me, even when compiling from Sources, but once V2.0 is released I might migrate all of them to the precompiled version.

Big promise :classic_biggrin:

Edited by Rollo62

Share this post


Link to post

@Stefan Glienke

If you can, please take a look at a new mORMot 2 unit:
https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.collections.pas

 

In respect to generics.collections, this unit uses interfaces as variable holders, and leverage them to reduce the generated code as much as possible,  as the Spring4D 2.0 framework does, but for both Delphi and FPC.

Most of the unit is in fact embedding some core collection types to mormot.core.collections.dcu to reduce the user units and executable size for Delphi XE7+ and FPC 3.2+.
Thanks a lot for the ideas!

 

It also publishes TDynArray and TSynDictionary high-level features like JSON/binary serialization or thread safety with Generics strong typing.
More TDynArray features (like sorting and search) and also TDynArrayHasher features (an optional hash table included in ISynList<T>) are coming.

  • Like 4

Share this post


Link to post
Posted (edited)

Is it only me or awful and undocumented problems like " F2084 Internal Error: AV004513DE-R00024F47-0 " occur when working on Delphi with generics?

@Stefan Glienke How did you manage to circumvent the Delphi compiler limitations?

On XE7 for instance, as soon as I use intrinsics I encounter those problems - and the worse is that they are random: sometimes, the compilation succeed, but other times there is the Internal Error, sometimes on Win32 sometimes on Win64... 😞

I get rid of as many "inlined" code as possible, since I discovered generics do not like calling "inline" code.

 

In comparison, FPC seems much more stable. The Lazarus IDE is somewhat lost within generics code (code navigation is not effective) - but at least, it doesn't crash and you can work as you expect.

FPC 3.2 generics support is somewhat mature: I can see by disassembling the .o that intrinsics are properly handled with this compiler. Which is good news, since I rely heavily on it for folding base generics specialization using interface, as you did with Spring4D.

 

Edited by Arnaud Bouchez

Share this post


Link to post

Yes, developing generic code can be absolutely frustrating depending on the Delphi version.

And codegen within generics can be less effective than it would if you would write the same identical code directly.

Guess you could get a glimpse of what I have to endure :classic_laugh:

Share this post


Link to post

I just found out that Delphi has big troubles with static arrays like THash128 = array[0..15] of byte.
If I remove those types, then the Delphi compiler does emit "F2084 Internal Error" any more...

 

So I will stick with the same kind of types as you did (byte/word/integer/int64/string/interface/variant) and keep those bigger ordinals (THash128/TGUID) to use regular (bloated) generics code on Delphi (they work with no problem on FPC 3.2).

 

Yes, I get a glimpse of what you endure, and I thank you much for having found such nice tricks in Spring4D. The interface + folded types pattern is awesome. 😄

  • Thanks 1

Share this post


Link to post
12 hours ago, Arnaud Bouchez said:

Is it only me or awful and undocumented problems like " F2084 Internal Error: AV004513DE-R00024F47-0 " occur when working on Delphi with generics?

 

You are not alone! And sadly that's the reason I use as less as generics as possible... sometimes a re-build clears the issue, sometimes I have to re-boot the IDE, it's really unproductive...

 

And, I thought the bug only exists in my old XE4 compiler, I didn't expect it also lives in the new IDEs... then I wonder, what are the new versions charge for??? ...

Share this post


Link to post
On 6/23/2021 at 6:07 AM, Arnaud Bouchez said:

@Stefan Glienke

If you can, please take a look at a new mORMot 2 unit:
https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.collections.pas

 

In respect to generics.collections, this unit uses interfaces as variable holders, and leverage them to reduce the generated code as much as possible,  as the Spring4D 2.0 framework does, but for both Delphi and FPC.

Most of the unit is in fact embedding some core collection types to mormot.core.collections.dcu to reduce the user units and executable size for Delphi XE7+ and FPC 3.2+.
Thanks a lot for the ideas!

 

It also publishes TDynArray and TSynDictionary high-level features like JSON/binary serialization or thread safety with Generics strong typing.
More TDynArray features (like sorting and search) and also TDynArrayHasher features (an optional hash table included in ISynList<T>) are coming.

Wow! So the collections in mORMot2 is/will be even more powerful!

Great to hear that, and as always, we can rely on mORMt/2 ;)

Share this post


Link to post

I discovered that using a record as TEnumerator makes the code faster, and allow some kind of inlining, with no memory allocation.

My TSynEnumerator<T> record only uses 3 pointers on the stack with no temporary allocation. I prefer using pointers here to avoid any temporary storage e.g. for managed types. And the GetCurrent and MoveNext functions have no branch and inline very aggressively.
And it works with Delphi 2010! 😉

 

Record here sounds like a much more efficient idea than a class + interface, as @Stefan Glienke did in Spring4D. Stefan, have you any reason to prefer interfaces also for the enumerator instead of good old record?
From my finding, performance is better with a record, especially thanks to inlining - which is perfect on FPC, but Delphi still calls MoveNext and don't inline it. It also avoid a try...finally for most simple functions, nor any heap allocation.

 

Please check https://github.com/synopse/mORMot2/commit/17b7a2753bb54057ad0b6d03bd757e370d80dce2

Share this post


Link to post
Posted (edited)

Latest Delphi versions inline MoveNext - since 10.3 or so

 

The reason to use an interface is simple: API compatibility - try building something like IEnumerable<T> and all other collection types based upon and compatible with that with record-based enumerators that inline. It's not possible.

Record-based enumerators are only possible if you make them specifically for each collection type as in your case because you don't have a common base type for your collections that is also enumerable.

IList<T> in Spring can have multiple implementations (and indeed does) not all being wrappers around a dynamic array.

 

FWIW here are some performance comparisons between RTL that uses a class for enumerator with inlined MoveNext GetCurrent, a modified RTL version via helper with GetEnumerator that returns a record, and Spring.

The test just does iteration over differently sized lists and counting the odd items (the lists are filled with numbers 1..n in random order. You can indeed see the smaller list have way lower items/sec as the loop overhead is higher.

However, I argue that the benefit of the overall design and its flexibility with all the things you can achieve with IEnumerable<T> compensates for the higher cost of setting up the loop and not being able to inline MoveNext and GetCurrent.

 

Furthermore, the enumerator in Spring does an additional check in its MoveNext to prevent accidentally modifying the currently being iterated collection - which is a not so uncommon mistake to happen.

Also since IEnumerable<T> and IEnumerator<T> in Spring are composable and are being used with the streaming operations an enumerator always holds the value after a MoveNext and does not only fetch it from the underlying collections like a as a naive and fast list iterator would do by just holding the lists array pointer and an index and the inlined GetCurrent be Result := items[index]

 

Remember when I wrote in my article that Spring provides the best balance between the best possible speed and a rich API? This is one of the decisions I had to make and I am constantly exploring options to make this better.

Especially on small lists, the loop overhead can be huge compared to the actual work inside the loop. FWIW especially for lists, I am currently looking into providing a similar (but safer!) API as the RTL does by giving direct raw access to the backing array. Using this API can use a record enumerator and blows the performance totally out of the water.

 

10.4.2 - win32

Run on (4 X 3410,01 MHz CPU s)
CPU Caches:
  L1 Data 32 K (x4)
  L1 Instruction 32 K (x4)
  L2 Unified 256 K (x4)
  L3 Unified 6144 K (x1)
------------------------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
iterate-rtl/10                  79,6 ns         78,5 ns      8960000 items_per_second=127.431M/s
iterate-rtl/100                  291 ns          295 ns      2488889 items_per_second=338.913M/s
iterate-rtl/1000                2286 ns         2302 ns       298667 items_per_second=434.425M/s
iterate-rtl/10000              22287 ns        22461 ns        32000 items_per_second=445.217M/s
iterate-rtl/100000            222090 ns       219702 ns         2987 items_per_second=455.162M/s
iterate-rtl-record/10           17,0 ns         17,3 ns     40727273 items_per_second=579.232M/s
iterate-rtl-record/100           185 ns          184 ns      3733333 items_per_second=543.03M/s
iterate-rtl-record/1000         1737 ns         1716 ns       373333 items_per_second=582.764M/s
iterate-rtl-record/10000       18495 ns        18415 ns        37333 items_per_second=543.025M/s
iterate-rtl-record/100000     179492 ns       179983 ns         3733 items_per_second=555.609M/s
iterate-spring/10               90,2 ns         90,0 ns      7466667 items_per_second=111.132M/s
iterate-spring/100               410 ns          408 ns      1723077 items_per_second=245.06M/s
iterate-spring/1000             3699 ns         3683 ns       186667 items_per_second=271.516M/s
iterate-spring/10000           36136 ns        36098 ns        19478 items_per_second=277.02M/s
iterate-spring/100000         365107 ns       368968 ns         1948 items_per_second=271.026M/s


10.4.2 - win64

Run on (4 X 3410,01 MHz CPU s)
CPU Caches:
  L1 Data 32 K (x4)
  L1 Instruction 32 K (x4)
  L2 Unified 256 K (x4)
  L3 Unified 6144 K (x1)
------------------------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
iterate-rtl/10                   112 ns          112 ns      6400000 items_per_second=89.0435M/s
iterate-rtl/100                  538 ns          530 ns      1120000 items_per_second=188.632M/s
iterate-rtl/1000                4570 ns         4499 ns       149333 items_per_second=222.263M/s
iterate-rtl/10000              45814 ns        46527 ns        15448 items_per_second=214.929M/s
iterate-rtl/100000            457608 ns       455097 ns         1545 items_per_second=219.733M/s
iterate-rtl-record/10           20,1 ns         19,9 ns     34461538 items_per_second=501.259M/s
iterate-rtl-record/100           197 ns          197 ns      3733333 items_per_second=508.369M/s
iterate-rtl-record/1000         1863 ns         1842 ns       373333 items_per_second=543.03M/s
iterate-rtl-record/10000       18664 ns        18834 ns        37333 items_per_second=530.958M/s
iterate-rtl-record/100000     186418 ns       188354 ns         3733 items_per_second=530.916M/s
iterate-spring/10                107 ns          107 ns      6400000 items_per_second=93.0909M/s
iterate-spring/100               493 ns          500 ns      1000000 items_per_second=200M/s
iterate-spring/1000             4298 ns         4332 ns       165926 items_per_second=230.854M/s
iterate-spring/10000           42277 ns        42165 ns        14452 items_per_second=237.161M/s
iterate-spring/100000         422194 ns       423825 ns         1659 items_per_second=235.947M/s
Edited by Stefan Glienke
  • Thanks 1

Share this post


Link to post

Nice timings.

Yes, you are right, in mORMot we only use basic iteration to be used in "for ... in ..  do" statement with no further composition and flexibility as available with IEnumerable<T>.

 

The difference for small loops is huge (almost 10 times) and for big loops is still relevant (2 times) when records are used. I guess mORMot pointer-based records could be slightly faster than RTL index-based values, especially when managed types are involved.


In practice, I find "for .. in .. do" to be the main place for iterations. So to my understanding, records are the way to go for mORMot.

Then nothing prevents another method returning complex and fluid IEnumerable<T>.

We just don't go that way in mORMot yet.
 

Share this post


Link to post
4 hours ago, Stefan Glienke said:

Remember when I wrote in my article that Spring provides the best balance between the best possible speed and a rich API?

It's this rich api that makes Spring4D collections worthwhile. The perf hit is still worth it for this alone. In a real world application (FinalBuilder) I have not noticed any perf issues that could be attributed to the spring4d collections, but then I'm rarely enumerating huge collections - certainly in my profiling efforts there were plenty of areas to optimise, but spring4d wasn't showing up as an area of concern in the profiler (vtune). 

 

Now if only we could have non ref counted interfaces (ie no calls to add/release) and we could implement interfaces on records - best of both worlds 😉 

  • Like 1

Share this post


Link to post
Posted (edited)

FWIW the difference you saw between interfaced based enumerator and a record based one might have been bigger than what you see with Spring because some collection types (I am considering applying this to some more) are using not classic implemented-by-an-object interfaces for IEnumerator but a handcrafted IMT - seehttps://bitbucket.org/sglienke/spring4d/src/3e9160575e06b3956c0e90d2aebe5e57d931cd19/Source/Base/Collections/Spring.Collections.Lists.pas#lines-56 - this avoids the adjustor thunks which gives a noticeable performance improvement. I also looked into avoiding the heap allocation for the enumerator by embedding one into the list itself which it returns and only do the heap allocation once a second enumeration happens. I did not run any benchmarks on that yet to evaluate if that's something worthwhile. After all, if you want to have that maximum raw speed avoiding assignments I already mentioned another approach I am looking into - which also provides for-in loop capability but will not have T as loop element but a ^T. I have that in an experimental branch - I can add some benchmark of that tomorrow.

Edited by Stefan Glienke

Share this post


Link to post

Yes, I have seen the handcrafted IMT.

But I am not convinced the "sub rcx, xxx; jmp xxx" code block makes a noticeable performance penalty - perhaps only on irrelevant micro benchmarks.

The CPU lock involved in calling a Delphi virtual method through an interface has a cost for sure https://www.idefixpack.de/blog/2016/05/whats-wrong-with-virtual-methods-called-through-an-interface - but not a sub + jmp.

 

Also the enumerator instance included into the list itself seems a premature optimization to me if we want to be cross-platform as we want.

Calling GetThreadID on a non Windows platform has a true cost if you call the pthread library. And the resulting code complexity makes me wonder if it is really worth it in practice.

Better switch to a better memory manager, or just use a record and rely on inlining + stack allocation of an unmanaged enumerator.

 

Since you implemented both of those optimizations: OK, just continue to use them.

But I won't go that way with mORMot.

Whereas I still find your interface + pre-compiled folded classes an effective and nice trick (even if I just disable byte and word stubs for dictionaries: if you have a dictionary, the hash table will be bigger than the size of the byte/word data itself - so just use integers instead of byte/word for dictionaries in end-user code; but I still have byte/word specializations for IList<> which does not have any hash by default, but may on demand).

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×