Jump to content
Erik@Grijjy

Allocation-Free Collections

Recommended Posts

A list not allocated in the heap? How is it even possible? :classic_huh:

Start reading... Good idea! I know where I can try it in my code.

Edited by Kryvich

Share this post


Link to post

In fact, it is not allocation-free, it is heap-allocation free, since it needs to allocate its memory from the stack.

 

And note that the stack size can be pretty limited...
For instance, if the process is run within IIS, stack size is only 256KB (on 32-bit), which could be too small for using allocation-free collections everywhere.
Under Linux, the default thread stack size can be pretty limited in comparison to Windows, too.

 

What we do in practice in our servers code when we need such temporary storage, is to use a 4KB stack-allocated buffer, with optional heap allocation.

See https://synopse.info/files/html/api-1.18/SynCommons.html#TSYNTEMPBUFFER
There are some methods to pre-initialize the buffer with common patterns (zero bytes, increasing integers, random numbers)....
But you need to use pointer arithmetic, whereas this article uses generics so may a little safer.
4KB is OS memory page size, so should already be there, and avoid to reach 256KB numbers.

 

 

  • Like 2

Share this post


Link to post

You are right, it is only heap-allocation free. I think I clarified that in the post.

 

And as I said at the end, "use if appropriate, not just because you can". So if you are running on 32-bit in IIS, then don't use it with large buffers. Small stack-based collections (256 bytes up to 1KB) can still be useful in those situations though...

Share this post


Link to post

And if you then write the Add and GetItem methods better so they don't do unnecessary jumps every time and can be inlined it will be even faster.

Edited by Stefan Glienke

Share this post


Link to post

You mean by moving those checks to a separate method? I noticed that method calls are pretty expensive on ARM devices, so I don’t know if it would increase performance in that case. Worth some testing...

 

We could also assume that the programmer knows what (s)he is doing and remove the check or replace it with an assertion if speed is really that important...

Share this post


Link to post
On 1/25/2019 at 6:35 PM, Erik@Grijjy said:

You mean by moving those checks to a separate method?

No. I don't know about the ARM codegen but on windows if you do the pattern:

 

if out of range then

  raise exception;

do stuff;

 

This always generates a jump above the raise to get to do stuff for the common case of "in range".

When you write:

 

if in range then

  do stuff

else

  raise exception;

 

In this case you don't have a jump for the common case of being in range. Depending on how you structure the code you then have two returns in each branch of the if so no jump inside the method.

If you like to inline this method it is also a good thing to put the raise exception into its own subroutine to not bloat the inlined code.

  

Share this post


Link to post

@Stefan Glienke Interesting... Do modern processors can not predict the branch that will be executed in the common case?

Share this post


Link to post

Yes, and performance of these things sometimes heavily differs between the different CPUs (for example I had a case where I got some severe slowdown in some code on my ivy bridge i5 whereas it was not there on a skylake i7 one).

Anyhow not having a jump is still better than a predicted one. But that is micro optimization (which I did invest some time into as the code I was testing is core framework code, which I usually consider collection types part of).

Edited by Stefan Glienke
  • Like 1

Share this post


Link to post

Has anyone done benchmarking against regular Delphi lists (Generic or not). Would be interesting to see how much this helps. 

 

I was just pondering that I can't even make guess what the difference will be.

Share this post


Link to post

Any pure speed benchmark would be rather useless/biased because it all depends on what you are in fact doing apart from creating the list. If this workload is small enough the object creation would mean a significant overhead.

In the end you would just benchmark object creation. Also what affects performance is pointer indirection and memory, CPU cache. In some micro benchmark this would not occur as all memory fits into L2 or even L1 cache.

 

Benchmarking the effect on heap usage/fragmentation is way harder because that cannot be done in pure isolation and usually only manifests in production.

 

FWIW performance for System.Generics.Collections.TList<T>.Add (and other methods) between 10.2 and 10.3 also differs quite significantly (got better).

 

And another thing that sometimes significantly improves performance when adding items to a list is to preset its Capacity to avoid reallocations - in the example given in the blog article with adding 64 integers this already improves performance by 33%.

Edited by Stefan Glienke
  • Like 1

Share this post


Link to post
58 minutes ago, Stefan Glienke said:

Any pure speed benchmark would be rather useless/biased because it all depends on what you are in fact doing apart from creating the list.

You are mostly right about this. If you have a small list size, you are mostly measuring creation/destruction time. But that is exactly one of the reasons you may want to use a stack-based list instead.

Although my main motivation wasn't about speed (it was more about memory fragmentation, which is hard to test as you mentioned), I added some speed tests to the repo anyway.

 

Like you said, any speed test in this area is biased, and does not represent a real-life situation, so you should take the results with a grain of salt. I tested 4 stack list implementations against Delphi's TList<Integer>:

  1. TStackList1 is the original stack list from example 2 (with a fixed configurable size, but doesn't allow managed types).
  2. TStackList2: as TStackList1 but takes Stefan's "put exception at the bottom" suggestion into account.
  3. TStackList3: as TStackList2 but uses inlining
  4. TStackList4: as TStackList1 but uses inlining

I tested these lists with a buffer of 256 bytes on Win32 using Delphi Rio. This shows how much faster each list is compared to Delphi's TList<Integer>.

  1. TStackList1: 2.78 times faster
  2. TStackList2: 2.77 times faster
  3. TStackList3: 2.81 times faster
  4. TStackList4: 3.03 times faster

Note that measurements vary a bit from session to session. Sometimes, "putting the exception at the bottom" has a positive impact, sometimes it doesn't. TStackList4 is always the fastest on Win32 though, but the differences aren't that big. On Win64, the results are about the same, but Delphi's TList<Integer> seems to perform a little bit better than on Win32.

 

Also, as you said, performance depends a lot on the list size. For a list size of 32 bytes, the stack lists are about 10x faster, but for a list size of 2048 bytes, they are only about 1.7x faster. This reinforces my suggestion that stack lists are most suitable for smaller temporary lists. For larger lists, just stick to heap-based collections instead. And you can tweak the capacity as you mentioned to avoid re-allocations.

 

You can't really test heap usage and fragmentation with these tests. Because we are creating and destroying lists in a loop, a smart memory manager just reuses the same heap memory over and over again without fragmentation. In real-life however, there will be lots of other things going on between the recreation of lists, and fragmentation will be more likely.

  • Like 2

Share this post


Link to post

As I mentioned before the jump optimization highly depends on the CPU - and of course it only affects methods that are being called and not inlined. When inlined you have a jump no matter what. Only when being called it can benefit from two returns in each branch to avoid the jump over. 

Edit: Clarification - when inlined, putting the raise first is better because it only needs one conditional jump instruction (jump over raise call or not), taken in the common case.

When putting it below it still has the conditional jump which is not taken in the common case but has a jump over this code in all cases when the common path is being used.

 

What I would do regardless is to put the exception raising code into a non inlined subroutine to reduce generated binary code because then the code being jumped over is just a call (5bytes on 32bit).

 

Another optimization is to make use of {$POINTERMATH ON} for your pointer type because then you can write:

 

  Target := @FData;
  Target[FCount] := AItem;

 

And another neat trick to check if an Integer index is in range that only uses one compare:

 

if Cardinal(AIndex) >= Cardinal(FCount) then

 

By the way - putting all cases into the same binary often influences results and even the order in which you execute them changes the output - I had a similar case where just changing order made one or another faster or slower.

Measuring performance is not as easy at it sounds :)

Edited by Stefan Glienke
  • Like 3

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×