Jump to content
angusj

Profiling Clipper2

Recommended Posts

Anders said in another thread ...

8 hours ago, Anders Melander said:

O.T. but have you tried profiling your Delphi code? I just did and there appears to be quite a few low hanging fruits that could improve the performance.

image.png.f7118335206007825c9bcdabada590e9.png

Hi Anders.

Thanks for the feedback, much appreciated. 

Yes, I have profiled my Clipper library's Delphi code (and not infrequently), but until now I haven't profiled into Delphi's Runtime Library.

And I see your point re TList.Get, if you're alluding to its relatively expensive range checking?

However, I can't see an easy hack (eg by overriding Get, since it isn't virtual).

But I could write a wrapper Get function (using TList's public List property) and bypass the default TList.items property altogether.

Is this what you're suggesting, or would you suggest I do something else?

And are you also seeing other "low hanging fruit" that would improve performance?

 

Edit: 

I've replaced all implicit calls to TList.Get with the following function and as a consequence have seen quite a substantial improvement in performance (and Delphi now slightly out-performs C#).

function UnsafeGet(List: TList; Index: Integer): Pointer; inline; 
begin
 // caution: no bounds checking
 Result := List.List[Index];
end;

 

 

 

Edited by angusj
  • Like 1

Share this post


Link to post

I simply don't use the RTL collections anywhere in my code and use my own implementation so I have full control of that. This is a serious undertaking, and not for everyone. So I'd recommend using the spring4d collections. 

Share this post


Link to post
50 minutes ago, David Heffernan said:

So I'd recommend using the spring4d collections. 

While I appreciate your recommendation this would not solve anything here - the performance difference is not from a poor getter implementation but from the fact that it's a call instead of directly indexing into the backing array.

Since spring4d collections are all interfaces they hardly are any faster wrt to calling the getter.

Share this post


Link to post
5 hours ago, angusj said:

I could write a wrapper Get function (using TList's public List property) and bypass the default TList.items property altogether.

Is this what you're suggesting, or would you suggest I do something else?

And are you also seeing other "low hanging fruit" that would improve performance?

I hadn't thought about using TLists.List so I would probably have tried avoiding TList altogether. Maybe TLists.List is good enough and it's certainly an easier fix that replacing TList. A new profile run would answer that.

 

I only spent 5 minutes looking at the profile result yesterday so I can't give much details, but one thing I noticed is that you're iterating a list sequentially in ProcessIntersectList. Since TList is basically just a wrapper around a contiguous array, a sequential iteration can be done by just incrementing a pointer inside the loop thus bypassing the getter altogether.

 

The profile results also hints that some things might be made slightly faster just by moving things around a bit. A better optimizer would have done that for us but alas we have to work with what we got.

I will try to find time to look at this more closely this week end.

Share this post


Link to post
7 minutes ago, Anders Melander said:

one thing I noticed is that you're iterating a list sequentially in ProcessIntersectList. Since TList is basically just a wrapper around a contiguous array, a sequential iteration can be done by just incrementing a pointer inside the loop thus bypassing the getter altogether.

Yes, of course, that'd be better still, and I've done that in lots of places too.

Perhaps I'm getting infected with C# and now and again I'm forgetting to use pointers 😜.

But I will indeed swap over to pointers in due course, but even with just bypassing TList.Get the performance gains I got really surprised me.

15 minutes ago, Anders Melander said:

I will try to find time to look at this more closely this week end.

Great. And thanks.

Share this post


Link to post
4 hours ago, David Heffernan said:

I simply don't use the RTL collections anywhere in my code and use my own implementation so I have full control of that.

Yes, sometimes the metaphorical "wheel" isn't very round and while handcrafting one (not so much reinventing it) is time consuming, it will often perform much better.

 

Edit: I've also avoided many of Delphi's newer language features because I'm targeting the library at a wider user base. (It's my impression that many Delphi die-hards are no longer spending their money upgrading, especially when any improvements are arguably marginal at best. I can't remember seeing any language features since Delphi 7 that are really game changing (apart from Unicode support, and 64bit compiling - so perhaps I'm exagerating slightly), but I expect the compiler has made at least incremental gains since then.) And looking at the TListEnumerator class just now, I'm pretty sure it'd also be less efficient than my UnsafeGet function above since it compares Index with Count for every increment.

Edited by angusj

Share this post


Link to post

That's one of the great things they've been doing in the JIT/compiler if you read through the blog article - it can detect when list access won't be out of its range so it can omit any superfluous range checks.

 

A few additional improvements to your source code:

 

function IntersectListSort(node1, node2: Pointer): Integer;
var
  pt1, pt2: ^TPoint64;
  i: Int64;
begin
  // note to self - can't return int64 values :)
  pt1 := @PIntersectNode(node1).pt;
  pt2 := @PIntersectNode(node2).pt;
  i := pt2.Y - pt1.Y;
  if (i = 0) then
  begin
    if (pt1 = pt2) then
    begin
      Result := 0;
      Exit;
    end;
    // Sort by X too. Not essential, but it significantly
    // speeds up the secondary sort in ProcessIntersectList .
    i := pt1.X - pt2.X;
  end;

  if i > 0 then Result := 1
  else if i < 0 then Result := -1
  else result := 0;
end;

This eliminates as many repeatedly indirections as possible - that reduces register pressure (especially on x86)

 

Next improvement is in TClipperBase.ProcessIntersectList which takes the majority of overall time in the benchmark - namely the inner loop that increments j:

 

  for i := 0 to highI do
  begin
    // make sure edges are adjacent, otherwise
    // change the intersection order before proceeding
    node := UnsafeGet(FIntersectList, i);
    if not EdgesAdjacentInAEL(node) then
    begin
      j := i;

      repeat
        inc(j);
      until EdgesAdjacentInAEL(UnsafeGet(FIntersectList, j));

      // now swap intersection order
      FIntersectList.List[i] := UnsafeGet(FIntersectList, j);
      FIntersectList.List[j] := node;
    end;

First we store the node at i because its the same after the loop - so no repeatly getting it necessary.

Since you incremented j by 1 anyway I chose to use a repeat loop because I know that it won't do a tail jump to check the condition first. Also it generates slightly better code with an inlined function as its condition. Which leads the the third improvement:

 

function EdgesAdjacentInAEL(node: PIntersectNode): Boolean;
  {$IFDEF INLINING} inline; {$ENDIF}
var
  active1, active2: PActive;
begin
  active1 := node.active1;
  active2 := node.active2;
  Result := (active1.nextInAEL = active2) or (active1.prevInAEL = active2);
end;

Same story as before: eliminate any repeatly indirections, fetch active1 and active2 only once instead of accessing both two times (that with you used there before does not change that)

 

Another big chunk of time can probably be shaved off by using a hand-coded quicksort that does not use the repeated call into IntersectListSort but avoids that call altogether.

Edited by Stefan Glienke
  • Thanks 1

Share this post


Link to post

What Stefan wrote 🙂

 

I had my focus on the exact same three areas and had made almost the same changes. When you look at the profiling of the generated asm it becomes pretty clear what the bottlenecks are - down to the individual statements.

In addition to Stefan's modifications I would try to see if it made any noticeable difference to test for (i = 0) last in IntersectListSort because presumably the two other conditions are met more often. It could eliminate one or two branches.

I would also consider if quick sort is the best algorithm to use. For example if the list is already "almost sorted" (for example because a sorted list was modified), insertions sort might be a better choice.

Share this post


Link to post

I've just added in Stefan's changes and, while it's certainly better code, it's made very little difference to performance.

(Unlike avoiding TList.Get which made a very big difference.)

Share this post


Link to post
2 minutes ago, Anders Melander said:

In addition to Stefan's modifications I would try to see if it made any noticeable difference to test for (i = 0) last in IntersectListSort because presumably the two other conditions are met more often. It could eliminate one or two branches.

First of all, using Int64 on x86 causes quite some impact already. You notice that simply compiling the benchmark as is on x64 makes it run faster - that is not because the x64 compiler is so amazing but simply because now all the Int64 operations are simple register operations.

Second - and that is something I have been asking for quite some time: let the compiler generate less conditional jump instructions in favor or conditional mov instruction - especially in comparers that can improve performance quite significantly. To a certain degree one can code that in pascal but that's usually not very pretty.

As for the sorting: I did a try using IntroSort which gave a slight improvement over the RTL Quicksort.

Share this post


Link to post
3 minutes ago, Stefan Glienke said:

First of all, using Int64 on x86 causes quite some impact already. You notice that simply compiling the benchmark as is on x64 makes it run faster - that is not because the x64 compiler is so amazing but simply because now all the Int64 operations are simple register operations.

With Delphi 10.3 the benchmark is slower when compiled for 64-bit compared to 32-bit. Are you seeing something else?

 

8 minutes ago, angusj said:

I've just added in Stefan's changes and, while it's certainly better code, it's made very little difference to performance.

They make a pretty big difference on my system when compiled for 32-bit. I'd say around 20% improvement. I could be wrong because I'm working on my day job at the same time.

 

I just tried using a pointer instead of the list getter inside the loop in ProcessIntersectList and that seems pay off:

image.thumb.png.ebc29bee284212e5c19a8f4037f6953b.png

 

procedure TClipperBase.ProcessIntersectList;
var
  ...
  node: PIntersectNode;
  node2ptr: ^PIntersectNode;
  ...
begin
  ...
  for i := 0 to highI do
  begin
    node := UnsafeGet(FIntersectList, i);
    if not EdgesAdjacentInAEL(node) then
    begin
      j := i;
      node2ptr := @FIntersectList.List[j];

      repeat
        inc(node2ptr);
      until EdgesAdjacentInAEL(node2ptr^);

      // now swap intersection order
      FIntersectList.List[i] := node2ptr^;
      node2ptr^ := node;
    end;
...

 

Share this post


Link to post

FWIW I see more improvement after my changes - but that can have various reasons.

 

Before optimization

Win32
Testing edge count: 1000 time: 145 msecs
Testing edge count: 2000 time: 689 msecs
Testing edge count: 3000 time: 2.585 msecs

Win64
Testing edge count: 1000 time: 128 msecs
Testing edge count: 2000 time: 573 msecs
Testing edge count: 3000 time: 2.087 msecs


Commit "Improved Delphi performance"

Win32
Testing edge count: 1000 time: 149 msecs
Testing edge count: 2000 time: 626 msecs
Testing edge count: 3000 time: 2.379 msecs

Win64
Testing edge count: 1000 time: 127 msecs
Testing edge count: 2000 time: 497 msecs
Testing edge count: 3000 time: 1.767 msecs


Further improvements

Win32
Testing edge count: 1000 time: 141 msecs
Testing edge count: 2000 time: 552 msecs
Testing edge count: 3000 time: 1.840 msecs

Win64
Testing edge count: 1000 time: 124 msecs
Testing edge count: 2000 time: 493 msecs
Testing edge count: 3000 time: 1.630 msecs

What we can clearly see from the results though is that your code has O(n²) - that is where you might want to invest some time into

Share this post


Link to post
17 minutes ago, Stefan Glienke said:

FWIW I see more improvement after my changes - but that can have various reasons.

I've only tested using 64bit compiles so far (intersecting up to 6000 edges), but I couldn't detect measurable differences between compiles (cleaned and release builds), and just the Chrome browser and Windows Explorer visibly running. Nevertheless, it certainly won't be slower so it's now ready to be committed and uploaded.

 

17 minutes ago, Stefan Glienke said:

What we can clearly see from the results though is that your code has O(n²) - that is where you might want to invest some time into

I thought it's closer to O(n³) but yeah, I'd love to fix that but I haven't been clever enough to figure out how.

Edited by angusj

Share this post


Link to post
23 minutes ago, Anders Melander said:

I just tried using a pointer instead of the list getter inside the loop in ProcessIntersectList and that seems pay off:

Sure did!

Share this post


Link to post

Pointer with increment vs direct array indexing should be very similar. In anything I think the latter tends to be a little quicker. 

Share this post


Link to post
4 minutes ago, David Heffernan said:

Pointer with increment vs direct array indexing should be very similar. In anything I think the latter tends to be a little quicker. 

Yes it should be but I think there might be too much register pressure here for the compiler to generate the code we want. Haven't tried it though. Also we need the pointer anyway after the iteration.

Left as an exercise to the reader :-)

Share this post


Link to post
2 hours ago, Anders Melander said:

there might be too much register pressure here

That can be the reason: array accessing needs (at least) two registers: array pointer and index, incrementing pointer only needs one

With how for-to loops are working we need 3, the array pointer, the incrementing index and one compiler generated one that counts down to 0 which is actually being used for the loop.

Doing a for i := 1 to count loop not actually using i but the shifting pointer we need 2 registers.

 

However, I assume the original code has another issue: too many indirections.

It does not access into the dynamic array but it first accesses the TList reference.

That means we have three indirections: first the field access, second the backing array access, regardless of using a getter or the List property, third indexing into the array. (you see the 3 mov eax, ... instructions following each other)

These indirections are causing a data dependency - modern CPUs can execute multiple instructions at the same time if they don't depend on each other - in this case they do so these instructions cannot execute in parallel leaving part of the CPU without anything to do.

 

That is the main difference you will see in the code below! If you would store the TPointerList (not directly as that type because then you have implicit finally block generated by the compiler because its a dynamic array, doh) then you would have probably similar runtime because in this code there are enough registers available.


Also, make sure to use NativeInteger for index variables whenever possible to avoid unnecessary move with sign-extention instructions on 64bit.

 

j being ^PIntersectNode the asm looks like this:

x86

Clipper.Engine.pas.3206: inc(j);
00BD954C 83C104           add ecx,$04
Clipper.Engine.pas.3205: repeat
00BD954F 8B01             mov eax,[ecx]
00BD9551 8B10             mov edx,[eax]
00BD9553 8B4004           mov eax,[eax+$04]
00BD9556 3B4244           cmp eax,[edx+$44]
00BD9559 7405             jz $00bd9560
00BD955B 3B4240           cmp eax,[edx+$40]
00BD955E 75EC             jnz $00bd954c

x64

Clipper.Engine.pas.3206: inc(j);
0000000000E9CDCB 4883C008         add rax,$08
Clipper.Engine.pas.3205: repeat
0000000000E9CDCF 488B10           mov rdx,[rax]
0000000000E9CDD2 4C8B02           mov r8,[rdx]
0000000000E9CDD5 488B5208         mov rdx,[rdx+$08]
0000000000E9CDD9 49395050         cmp [r8+$50],rdx
0000000000E9CDDD 7406             jz TClipperBase.ProcessIntersectList + $75
0000000000E9CDDF 49395048         cmp [r8+$48],rdx
0000000000E9CDE3 75E6             jnz TClipperBase.ProcessIntersectList + $5B

with indexing its this:

 

x86

Clipper.Engine.pas.3206: inc(j);
004C954B 42               inc edx
Clipper.Engine.pas.3205: repeat
004C954C 8B461C           mov eax,[esi+$1c]
004C954F 8B4004           mov eax,[eax+$04]
004C9552 8B0490           mov eax,[eax+edx*4]
004C9555 8B28             mov ebp,[eax]
004C9557 8B4004           mov eax,[eax+$04]
004C955A 3B4544           cmp eax,[ebp+$44]
004C955D 7405             jz $004c9564
004C955F 3B4540           cmp eax,[ebp+$40]
004C9562 75E7             jnz $004c954b

x64

Clipper.Engine.pas.3206: inc(j);
000000000031CDD2 4883C001         add rax,$01
Clipper.Engine.pas.3205: repeat
000000000031CDD6 488B5320         mov rdx,[rbx+$20]
000000000031CDDA 488B5208         mov rdx,[rdx+$08]
000000000031CDDE 488B14C2         mov rdx,[rdx+rax*8]
000000000031CDE2 4C8B02           mov r8,[rdx]
000000000031CDE5 488B5208         mov rdx,[rdx+$08]
000000000031CDE9 49395050         cmp [r8+$50],rdx
000000000031CDED 7406             jz TClipperBase.ProcessIntersectList + $85
000000000031CDEF 49395048         cmp [r8+$48],rdx
000000000031CDF3 75DD             jnz TClipperBase.ProcessIntersectList + $62

However, incrementing pointer is not always faster because when using indexing into an array the CPU sometimes can better pipeline those instructions.

 

Now we use a local variable of type ^Pointer (with pointermath on to be able to index into it) like this:

 

  list := Pointer(FIntersectList.List);
  for i := 0 to FIntersectList.Count-1 do
  begin
    // make sure edges are adjacent, otherwise
    // change the intersection order before proceeding
    if not EdgesAdjacentInAEL(list[i]) then
    begin
      j := i;

      repeat
        inc(j);
      until EdgesAdjacentInAEL(list[j]);

      // now swap intersection order
      node := list[i];
      list[i] := list[j];
      list[j] := node;
    end;

and we get this asm for the inner repeat loop (performance is basically equal to using the pointer) - there can easily be some variations as soon as one uses an additional local variable at the "wrong" spot because the register allocator of the Delphi compiler sucks:

Clipper.Engine.pas.3207: inc(j);
00E89554 42               inc edx
Clipper.Engine.pas.3206: repeat
00E89555 8B0496           mov eax,[esi+edx*4]
00E89558 8B08             mov ecx,[eax]
00E8955A 8B4004           mov eax,[eax+$04]
00E8955D 3B4144           cmp eax,[ecx+$44]
00E89560 7405             jz $00e89567
00E89562 3B4140           cmp eax,[ecx+$40]
00E89565 75ED             jnz $00e89554

I cannot find it right now to verify but I think I've read somewhere that as previously noted this way of instructions might perform better (modifying index and addressing into array over shifting pointer) because the CPU is able to fuse these instructions

 

FWIW it's kinda interesting to see what some C++ compilers emit for some loop: https://godbolt.org/z/oq9Gzexa3

Edited by Stefan Glienke

Share this post


Link to post

Thanks again Stephan & Anders.

Unfortunately, the POINTERMATH directive was only introduced with D2009, and I'm still targetting the library back to D7.

 

So this is where I'm currently at ...


 

  nodeQ: PIntersectNode;
  nodeI, nodeJ: ^PIntersectNode;
  op1, op2: POutpt;
begin
  FIntersectList.Sort(IntersectListSort);
  for i := 0 to FIntersectList.Count - 1 do
  begin
    nodeI := @FIntersectList.List[i];
    if not EdgesAdjacentInAEL(nodeI^) then
    begin
      nodeJ := nodeI;
      repeat
        inc(nodeJ);
      until EdgesAdjacentInAEL(nodeJ^);

      // now swap intersection order
      nodeQ := nodeI^;
      nodeI^ := nodeJ^;
      nodeJ^ := nodeQ;
    end;

 

Share this post


Link to post
11 minutes ago, angusj said:

I'm still targetting the library back to D7

Why?

I get it that there are people who are stuck on Delphi 7 and refuse to upgrade but do those people really need your library?

 

14 minutes ago, angusj said:

So this is where I'm currently at ...

You can move the assignment of nodeI outside the loop and increment it inside the loop. I doubt that it makes any difference in performance but at least it will not trigger my OCD.

 

I'd be interested in seeing how the performance compares against C++ and C# at this point. Can you post the updated numbers, please?

Share this post


Link to post
1 hour ago, Anders Melander said:

I get it that there are people who are stuck on Delphi 7 and refuse to upgrade but do those people really need your library?

Well surprisingly they do.

And honestly I haven't encountered any newer language features (eg simpler code, better performance) that compel me to ditch support for these very old compilers.

 

1 hour ago, Anders Melander said:

I'd be interested in seeing how the performance compares against C++ and C# at this point. Can you post the updated numbers, please?

 

Performance2.pdf

Edited by angusj

Share this post


Link to post
34 minutes ago, angusj said:

And honestly I haven't encountered any newer language features (eg simpler code, better performance) that compel me to ditch support for these very old compilers.

Um... Unicode, 64 bit, generics, anonymous methods, inlining, record methods....?

 

39 minutes ago, angusj said:

Performance2.pdf

Thanks. Halfway there 🙂

Unfortunately I don't think it's a realistic goal to beat C++ without rewriting the hotspots in assembler.

Share this post


Link to post
6 minutes ago, Anders Melander said:

Um... Unicode, 64 bit, generics, anonymous methods, inlining, record methods....?

I was referring to the Clipper2 library. And Inlining is there for compilers that support it.

Can these other things significantly improve the library's performance?

Share this post


Link to post
On 9/3/2022 at 1:13 AM, angusj said:

Unfortunately, the POINTERMATH directive was only introduced with D2009, and I'm still targetting the library back to D7.

Directive is just a syntax sugar.

 

THugeArray = array[0..999999] of X;

PHugeArray = ^THugeArray;

 

PHugeArray(list)^[N] :=...

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×