angusj 126 Posted September 2, 2022 (edited) Anders said in another thread ... 8 hours ago, Anders Melander said: O.T. but have you tried profiling your Delphi code? I just did and there appears to be quite a few low hanging fruits that could improve the performance. Hi Anders. Thanks for the feedback, much appreciated. Yes, I have profiled my Clipper library's Delphi code (and not infrequently), but until now I haven't profiled into Delphi's Runtime Library. And I see your point re TList.Get, if you're alluding to its relatively expensive range checking? However, I can't see an easy hack (eg by overriding Get, since it isn't virtual). But I could write a wrapper Get function (using TList's public List property) and bypass the default TList.items property altogether. Is this what you're suggesting, or would you suggest I do something else? And are you also seeing other "low hanging fruit" that would improve performance? Edit: I've replaced all implicit calls to TList.Get with the following function and as a consequence have seen quite a substantial improvement in performance (and Delphi now slightly out-performs C#). function UnsafeGet(List: TList; Index: Integer): Pointer; inline; begin // caution: no bounds checking Result := List.List[Index]; end; Edited September 2, 2022 by angusj 1 Share this post Link to post
David Heffernan 2345 Posted September 2, 2022 I simply don't use the RTL collections anywhere in my code and use my own implementation so I have full control of that. This is a serious undertaking, and not for everyone. So I'd recommend using the spring4d collections. Share this post Link to post
Stefan Glienke 2002 Posted September 2, 2022 50 minutes ago, David Heffernan said: So I'd recommend using the spring4d collections. While I appreciate your recommendation this would not solve anything here - the performance difference is not from a poor getter implementation but from the fact that it's a call instead of directly indexing into the backing array. Since spring4d collections are all interfaces they hardly are any faster wrt to calling the getter. Share this post Link to post
Anders Melander 1783 Posted September 2, 2022 5 hours ago, angusj said: I could write a wrapper Get function (using TList's public List property) and bypass the default TList.items property altogether. Is this what you're suggesting, or would you suggest I do something else? And are you also seeing other "low hanging fruit" that would improve performance? I hadn't thought about using TLists.List so I would probably have tried avoiding TList altogether. Maybe TLists.List is good enough and it's certainly an easier fix that replacing TList. A new profile run would answer that. I only spent 5 minutes looking at the profile result yesterday so I can't give much details, but one thing I noticed is that you're iterating a list sequentially in ProcessIntersectList. Since TList is basically just a wrapper around a contiguous array, a sequential iteration can be done by just incrementing a pointer inside the loop thus bypassing the getter altogether. The profile results also hints that some things might be made slightly faster just by moving things around a bit. A better optimizer would have done that for us but alas we have to work with what we got. I will try to find time to look at this more closely this week end. Share this post Link to post
angusj 126 Posted September 2, 2022 7 minutes ago, Anders Melander said: one thing I noticed is that you're iterating a list sequentially in ProcessIntersectList. Since TList is basically just a wrapper around a contiguous array, a sequential iteration can be done by just incrementing a pointer inside the loop thus bypassing the getter altogether. Yes, of course, that'd be better still, and I've done that in lots of places too. Perhaps I'm getting infected with C# and now and again I'm forgetting to use pointers 😜. But I will indeed swap over to pointers in due course, but even with just bypassing TList.Get the performance gains I got really surprised me. 15 minutes ago, Anders Melander said: I will try to find time to look at this more closely this week end. Great. And thanks. Share this post Link to post
angusj 126 Posted September 2, 2022 (edited) 4 hours ago, David Heffernan said: I simply don't use the RTL collections anywhere in my code and use my own implementation so I have full control of that. Yes, sometimes the metaphorical "wheel" isn't very round and while handcrafting one (not so much reinventing it) is time consuming, it will often perform much better. Edit: I've also avoided many of Delphi's newer language features because I'm targeting the library at a wider user base. (It's my impression that many Delphi die-hards are no longer spending their money upgrading, especially when any improvements are arguably marginal at best. I can't remember seeing any language features since Delphi 7 that are really game changing (apart from Unicode support, and 64bit compiling - so perhaps I'm exagerating slightly), but I expect the compiler has made at least incremental gains since then.) And looking at the TListEnumerator class just now, I'm pretty sure it'd also be less efficient than my UnsafeGet function above since it compares Index with Count for every increment. Edited September 2, 2022 by angusj Share this post Link to post
Stefan Glienke 2002 Posted September 2, 2022 (edited) That's one of the great things they've been doing in the JIT/compiler if you read through the blog article - it can detect when list access won't be out of its range so it can omit any superfluous range checks. A few additional improvements to your source code: function IntersectListSort(node1, node2: Pointer): Integer; var pt1, pt2: ^TPoint64; i: Int64; begin // note to self - can't return int64 values :) pt1 := @PIntersectNode(node1).pt; pt2 := @PIntersectNode(node2).pt; i := pt2.Y - pt1.Y; if (i = 0) then begin if (pt1 = pt2) then begin Result := 0; Exit; end; // Sort by X too. Not essential, but it significantly // speeds up the secondary sort in ProcessIntersectList . i := pt1.X - pt2.X; end; if i > 0 then Result := 1 else if i < 0 then Result := -1 else result := 0; end; This eliminates as many repeatedly indirections as possible - that reduces register pressure (especially on x86) Next improvement is in TClipperBase.ProcessIntersectList which takes the majority of overall time in the benchmark - namely the inner loop that increments j: for i := 0 to highI do begin // make sure edges are adjacent, otherwise // change the intersection order before proceeding node := UnsafeGet(FIntersectList, i); if not EdgesAdjacentInAEL(node) then begin j := i; repeat inc(j); until EdgesAdjacentInAEL(UnsafeGet(FIntersectList, j)); // now swap intersection order FIntersectList.List[i] := UnsafeGet(FIntersectList, j); FIntersectList.List[j] := node; end; First we store the node at i because its the same after the loop - so no repeatly getting it necessary. Since you incremented j by 1 anyway I chose to use a repeat loop because I know that it won't do a tail jump to check the condition first. Also it generates slightly better code with an inlined function as its condition. Which leads the the third improvement: function EdgesAdjacentInAEL(node: PIntersectNode): Boolean; {$IFDEF INLINING} inline; {$ENDIF} var active1, active2: PActive; begin active1 := node.active1; active2 := node.active2; Result := (active1.nextInAEL = active2) or (active1.prevInAEL = active2); end; Same story as before: eliminate any repeatly indirections, fetch active1 and active2 only once instead of accessing both two times (that with you used there before does not change that) Another big chunk of time can probably be shaved off by using a hand-coded quicksort that does not use the repeated call into IntersectListSort but avoids that call altogether. Edited September 2, 2022 by Stefan Glienke 1 Share this post Link to post
Anders Melander 1783 Posted September 2, 2022 What Stefan wrote 🙂 I had my focus on the exact same three areas and had made almost the same changes. When you look at the profiling of the generated asm it becomes pretty clear what the bottlenecks are - down to the individual statements. In addition to Stefan's modifications I would try to see if it made any noticeable difference to test for (i = 0) last in IntersectListSort because presumably the two other conditions are met more often. It could eliminate one or two branches. I would also consider if quick sort is the best algorithm to use. For example if the list is already "almost sorted" (for example because a sorted list was modified), insertions sort might be a better choice. Share this post Link to post
angusj 126 Posted September 2, 2022 I've just added in Stefan's changes and, while it's certainly better code, it's made very little difference to performance. (Unlike avoiding TList.Get which made a very big difference.) Share this post Link to post
Stefan Glienke 2002 Posted September 2, 2022 2 minutes ago, Anders Melander said: In addition to Stefan's modifications I would try to see if it made any noticeable difference to test for (i = 0) last in IntersectListSort because presumably the two other conditions are met more often. It could eliminate one or two branches. First of all, using Int64 on x86 causes quite some impact already. You notice that simply compiling the benchmark as is on x64 makes it run faster - that is not because the x64 compiler is so amazing but simply because now all the Int64 operations are simple register operations. Second - and that is something I have been asking for quite some time: let the compiler generate less conditional jump instructions in favor or conditional mov instruction - especially in comparers that can improve performance quite significantly. To a certain degree one can code that in pascal but that's usually not very pretty. As for the sorting: I did a try using IntroSort which gave a slight improvement over the RTL Quicksort. Share this post Link to post
Anders Melander 1783 Posted September 2, 2022 3 minutes ago, Stefan Glienke said: First of all, using Int64 on x86 causes quite some impact already. You notice that simply compiling the benchmark as is on x64 makes it run faster - that is not because the x64 compiler is so amazing but simply because now all the Int64 operations are simple register operations. With Delphi 10.3 the benchmark is slower when compiled for 64-bit compared to 32-bit. Are you seeing something else? 8 minutes ago, angusj said: I've just added in Stefan's changes and, while it's certainly better code, it's made very little difference to performance. They make a pretty big difference on my system when compiled for 32-bit. I'd say around 20% improvement. I could be wrong because I'm working on my day job at the same time. I just tried using a pointer instead of the list getter inside the loop in ProcessIntersectList and that seems pay off: procedure TClipperBase.ProcessIntersectList; var ... node: PIntersectNode; node2ptr: ^PIntersectNode; ... begin ... for i := 0 to highI do begin node := UnsafeGet(FIntersectList, i); if not EdgesAdjacentInAEL(node) then begin j := i; node2ptr := @FIntersectList.List[j]; repeat inc(node2ptr); until EdgesAdjacentInAEL(node2ptr^); // now swap intersection order FIntersectList.List[i] := node2ptr^; node2ptr^ := node; end; ... Share this post Link to post
Stefan Glienke 2002 Posted September 2, 2022 FWIW I see more improvement after my changes - but that can have various reasons. Before optimization Win32 Testing edge count: 1000 time: 145 msecs Testing edge count: 2000 time: 689 msecs Testing edge count: 3000 time: 2.585 msecs Win64 Testing edge count: 1000 time: 128 msecs Testing edge count: 2000 time: 573 msecs Testing edge count: 3000 time: 2.087 msecs Commit "Improved Delphi performance" Win32 Testing edge count: 1000 time: 149 msecs Testing edge count: 2000 time: 626 msecs Testing edge count: 3000 time: 2.379 msecs Win64 Testing edge count: 1000 time: 127 msecs Testing edge count: 2000 time: 497 msecs Testing edge count: 3000 time: 1.767 msecs Further improvements Win32 Testing edge count: 1000 time: 141 msecs Testing edge count: 2000 time: 552 msecs Testing edge count: 3000 time: 1.840 msecs Win64 Testing edge count: 1000 time: 124 msecs Testing edge count: 2000 time: 493 msecs Testing edge count: 3000 time: 1.630 msecs What we can clearly see from the results though is that your code has O(n²) - that is where you might want to invest some time into Share this post Link to post
angusj 126 Posted September 2, 2022 (edited) 17 minutes ago, Stefan Glienke said: FWIW I see more improvement after my changes - but that can have various reasons. I've only tested using 64bit compiles so far (intersecting up to 6000 edges), but I couldn't detect measurable differences between compiles (cleaned and release builds), and just the Chrome browser and Windows Explorer visibly running. Nevertheless, it certainly won't be slower so it's now ready to be committed and uploaded. 17 minutes ago, Stefan Glienke said: What we can clearly see from the results though is that your code has O(n²) - that is where you might want to invest some time into I thought it's closer to O(n³) but yeah, I'd love to fix that but I haven't been clever enough to figure out how. Edited September 2, 2022 by angusj Share this post Link to post
angusj 126 Posted September 2, 2022 23 minutes ago, Anders Melander said: I just tried using a pointer instead of the list getter inside the loop in ProcessIntersectList and that seems pay off: Sure did! Share this post Link to post
David Heffernan 2345 Posted September 2, 2022 Pointer with increment vs direct array indexing should be very similar. In anything I think the latter tends to be a little quicker. Share this post Link to post
Anders Melander 1783 Posted September 2, 2022 4 minutes ago, David Heffernan said: Pointer with increment vs direct array indexing should be very similar. In anything I think the latter tends to be a little quicker. Yes it should be but I think there might be too much register pressure here for the compiler to generate the code we want. Haven't tried it though. Also we need the pointer anyway after the iteration. Left as an exercise to the reader :-) Share this post Link to post
Stefan Glienke 2002 Posted September 2, 2022 (edited) 2 hours ago, Anders Melander said: there might be too much register pressure here That can be the reason: array accessing needs (at least) two registers: array pointer and index, incrementing pointer only needs one With how for-to loops are working we need 3, the array pointer, the incrementing index and one compiler generated one that counts down to 0 which is actually being used for the loop. Doing a for i := 1 to count loop not actually using i but the shifting pointer we need 2 registers. However, I assume the original code has another issue: too many indirections. It does not access into the dynamic array but it first accesses the TList reference. That means we have three indirections: first the field access, second the backing array access, regardless of using a getter or the List property, third indexing into the array. (you see the 3 mov eax, ... instructions following each other) These indirections are causing a data dependency - modern CPUs can execute multiple instructions at the same time if they don't depend on each other - in this case they do so these instructions cannot execute in parallel leaving part of the CPU without anything to do. That is the main difference you will see in the code below! If you would store the TPointerList (not directly as that type because then you have implicit finally block generated by the compiler because its a dynamic array, doh) then you would have probably similar runtime because in this code there are enough registers available. Also, make sure to use NativeInteger for index variables whenever possible to avoid unnecessary move with sign-extention instructions on 64bit. j being ^PIntersectNode the asm looks like this: x86 Clipper.Engine.pas.3206: inc(j); 00BD954C 83C104 add ecx,$04 Clipper.Engine.pas.3205: repeat 00BD954F 8B01 mov eax,[ecx] 00BD9551 8B10 mov edx,[eax] 00BD9553 8B4004 mov eax,[eax+$04] 00BD9556 3B4244 cmp eax,[edx+$44] 00BD9559 7405 jz $00bd9560 00BD955B 3B4240 cmp eax,[edx+$40] 00BD955E 75EC jnz $00bd954c x64 Clipper.Engine.pas.3206: inc(j); 0000000000E9CDCB 4883C008 add rax,$08 Clipper.Engine.pas.3205: repeat 0000000000E9CDCF 488B10 mov rdx,[rax] 0000000000E9CDD2 4C8B02 mov r8,[rdx] 0000000000E9CDD5 488B5208 mov rdx,[rdx+$08] 0000000000E9CDD9 49395050 cmp [r8+$50],rdx 0000000000E9CDDD 7406 jz TClipperBase.ProcessIntersectList + $75 0000000000E9CDDF 49395048 cmp [r8+$48],rdx 0000000000E9CDE3 75E6 jnz TClipperBase.ProcessIntersectList + $5B with indexing its this: x86 Clipper.Engine.pas.3206: inc(j); 004C954B 42 inc edx Clipper.Engine.pas.3205: repeat 004C954C 8B461C mov eax,[esi+$1c] 004C954F 8B4004 mov eax,[eax+$04] 004C9552 8B0490 mov eax,[eax+edx*4] 004C9555 8B28 mov ebp,[eax] 004C9557 8B4004 mov eax,[eax+$04] 004C955A 3B4544 cmp eax,[ebp+$44] 004C955D 7405 jz $004c9564 004C955F 3B4540 cmp eax,[ebp+$40] 004C9562 75E7 jnz $004c954b x64 Clipper.Engine.pas.3206: inc(j); 000000000031CDD2 4883C001 add rax,$01 Clipper.Engine.pas.3205: repeat 000000000031CDD6 488B5320 mov rdx,[rbx+$20] 000000000031CDDA 488B5208 mov rdx,[rdx+$08] 000000000031CDDE 488B14C2 mov rdx,[rdx+rax*8] 000000000031CDE2 4C8B02 mov r8,[rdx] 000000000031CDE5 488B5208 mov rdx,[rdx+$08] 000000000031CDE9 49395050 cmp [r8+$50],rdx 000000000031CDED 7406 jz TClipperBase.ProcessIntersectList + $85 000000000031CDEF 49395048 cmp [r8+$48],rdx 000000000031CDF3 75DD jnz TClipperBase.ProcessIntersectList + $62 However, incrementing pointer is not always faster because when using indexing into an array the CPU sometimes can better pipeline those instructions. Now we use a local variable of type ^Pointer (with pointermath on to be able to index into it) like this: list := Pointer(FIntersectList.List); for i := 0 to FIntersectList.Count-1 do begin // make sure edges are adjacent, otherwise // change the intersection order before proceeding if not EdgesAdjacentInAEL(list[i]) then begin j := i; repeat inc(j); until EdgesAdjacentInAEL(list[j]); // now swap intersection order node := list[i]; list[i] := list[j]; list[j] := node; end; and we get this asm for the inner repeat loop (performance is basically equal to using the pointer) - there can easily be some variations as soon as one uses an additional local variable at the "wrong" spot because the register allocator of the Delphi compiler sucks: Clipper.Engine.pas.3207: inc(j); 00E89554 42 inc edx Clipper.Engine.pas.3206: repeat 00E89555 8B0496 mov eax,[esi+edx*4] 00E89558 8B08 mov ecx,[eax] 00E8955A 8B4004 mov eax,[eax+$04] 00E8955D 3B4144 cmp eax,[ecx+$44] 00E89560 7405 jz $00e89567 00E89562 3B4140 cmp eax,[ecx+$40] 00E89565 75ED jnz $00e89554 I cannot find it right now to verify but I think I've read somewhere that as previously noted this way of instructions might perform better (modifying index and addressing into array over shifting pointer) because the CPU is able to fuse these instructions FWIW it's kinda interesting to see what some C++ compilers emit for some loop: https://godbolt.org/z/oq9Gzexa3 Edited September 2, 2022 by Stefan Glienke Share this post Link to post
angusj 126 Posted September 2, 2022 Thanks again Stephan & Anders. Unfortunately, the POINTERMATH directive was only introduced with D2009, and I'm still targetting the library back to D7. So this is where I'm currently at ... nodeQ: PIntersectNode; nodeI, nodeJ: ^PIntersectNode; op1, op2: POutpt; begin FIntersectList.Sort(IntersectListSort); for i := 0 to FIntersectList.Count - 1 do begin nodeI := @FIntersectList.List[i]; if not EdgesAdjacentInAEL(nodeI^) then begin nodeJ := nodeI; repeat inc(nodeJ); until EdgesAdjacentInAEL(nodeJ^); // now swap intersection order nodeQ := nodeI^; nodeI^ := nodeJ^; nodeJ^ := nodeQ; end; Share this post Link to post
Anders Melander 1783 Posted September 2, 2022 11 minutes ago, angusj said: I'm still targetting the library back to D7 Why? I get it that there are people who are stuck on Delphi 7 and refuse to upgrade but do those people really need your library? 14 minutes ago, angusj said: So this is where I'm currently at ... You can move the assignment of nodeI outside the loop and increment it inside the loop. I doubt that it makes any difference in performance but at least it will not trigger my OCD. I'd be interested in seeing how the performance compares against C++ and C# at this point. Can you post the updated numbers, please? Share this post Link to post
angusj 126 Posted September 2, 2022 (edited) 1 hour ago, Anders Melander said: I get it that there are people who are stuck on Delphi 7 and refuse to upgrade but do those people really need your library? Well surprisingly they do. And honestly I haven't encountered any newer language features (eg simpler code, better performance) that compel me to ditch support for these very old compilers. 1 hour ago, Anders Melander said: I'd be interested in seeing how the performance compares against C++ and C# at this point. Can you post the updated numbers, please? Performance2.pdf Edited September 2, 2022 by angusj Share this post Link to post
Anders Melander 1783 Posted September 3, 2022 34 minutes ago, angusj said: And honestly I haven't encountered any newer language features (eg simpler code, better performance) that compel me to ditch support for these very old compilers. Um... Unicode, 64 bit, generics, anonymous methods, inlining, record methods....? 39 minutes ago, angusj said: Performance2.pdf Thanks. Halfway there 🙂 Unfortunately I don't think it's a realistic goal to beat C++ without rewriting the hotspots in assembler. Share this post Link to post
angusj 126 Posted September 3, 2022 6 minutes ago, Anders Melander said: Um... Unicode, 64 bit, generics, anonymous methods, inlining, record methods....? I was referring to the Clipper2 library. And Inlining is there for compilers that support it. Can these other things significantly improve the library's performance? Share this post Link to post
Fr0sT.Brutal 900 Posted September 19, 2022 On 9/3/2022 at 1:13 AM, angusj said: Unfortunately, the POINTERMATH directive was only introduced with D2009, and I'm still targetting the library back to D7. Directive is just a syntax sugar. THugeArray = array[0..999999] of X; PHugeArray = ^THugeArray; PHugeArray(list)^[N] :=... Share this post Link to post