-
Content Count
1428 -
Joined
-
Last visited
-
Days Won
141
Everything posted by Stefan Glienke
-
As a reaction to one of his answers during Q&A I wrote a blog post. Having said that and personally loving generics for various use cases (as shown in the blog post) there are also things that are solved sub optimal - which I also wrote about. Also if you have used generics in C# then you will likely miss co- and contravariance - oh, look, I also wrote about that. If you are going really fancy with generics and code that uses RTTI you have to be aware about some percularities - guess what: wrote about it. Now because in generics you basically have the lowest common denominator and we are lacking quite some ways to specify some traits of the supported types via constraints there are several things that you cannot do or have to fall back to indirections: most common example is using comparer interfaces for generic sorting algorithm or hashtables. That is mostly where a naively implemented generic algorithm might be slower than some handcrafted code for the specific type unless you heavily optimize for various cases (as I have done in Spring).
-
Sure, gonna work on that silver tag badge
-
In a single-threaded application, FastMM5 will not give you any noticeable improvements. It was designed to overcome the issues of V4 under heavy multithreading.
-
64bit RTL patches with Intel OneApi and TBB
Stefan Glienke replied to RDP1974's topic in RTL and Delphi Object Pascal
It still surprises me that people are surprised by how much performance improves under heavy multithreading when not using the default MM. AFAIK mORMot does not use the default MM anyway. -
I can tell you from experience that traveling with an airplane has significant overhead if you live approx 100km from an airport that has flights going to your destination. You missed the point - when David mentioned that nobody should be using Extended he most likely stated a well-meant suggestion and did not include the "I started using Extended like decades ago and don't wanna change it" case.
-
Carriages also once were the fastest way to travel - yet nobody today complains that he can't go onto the highway with one. On topic: If the 64bit Delphi compiler(s) (and significant parts of the RTL) would not be even worse than the 32bit compilers and the debugging experience would not be an absolute nightmare I would switch instantly - it's proven that even though a 64bit application might use a bit more memory because the pointers are double the size due to architectural differences it simply can perform better. Simply having more registers is already a huge gain.
-
Too bad it's the implicit default type in many places - be it float literals or parameter types.
-
64bit RTL patches with Intel OneApi and TBB
Stefan Glienke replied to RDP1974's topic in RTL and Delphi Object Pascal
Tests that run for approx one second are really the way to go when deciding on the proper memory manager. š -
64bit RTL patches with Intel OneApi and TBB
Stefan Glienke replied to RDP1974's topic in RTL and Delphi Object Pascal
Been using FastMM5 in production for over 2 years now and never looked back. I don't know of any reliability or fragmentation issues. -
64bit RTL patches with Intel OneApi and TBB
Stefan Glienke replied to RDP1974's topic in RTL and Delphi Object Pascal
That's another reason why precompiled binaries are bad - if I had to guess I would say they are compiled for CPUs that support AVX which Nehalem did not have. -
That can be the reason: array accessing needs (at least) two registers: array pointer and index, incrementing pointer only needs one With how for-to loops are working we need 3, the array pointer, the incrementing index and one compiler generated one that counts down to 0 which is actually being used for the loop. Doing a for i := 1 to count loop not actually using i but the shifting pointer we need 2 registers. However, I assume the original code has another issue: too many indirections. It does not access into the dynamic array but it first accesses the TList reference. That means we have three indirections: first the field access, second the backing array access, regardless of using a getter or the List property, third indexing into the array. (you see the 3 mov eax, ... instructions following each other) These indirections are causing a data dependency - modern CPUs can execute multiple instructions at the same time if they don't depend on each other - in this case they do so these instructions cannot execute in parallel leaving part of the CPU without anything to do. That is the main difference you will see in the code below! If you would store the TPointerList (not directly as that type because then you have implicit finally block generated by the compiler because its a dynamic array, doh) then you would have probably similar runtime because in this code there are enough registers available. Also, make sure to use NativeInteger for index variables whenever possible to avoid unnecessary move with sign-extention instructions on 64bit. j being ^PIntersectNode the asm looks like this: x86 Clipper.Engine.pas.3206: inc(j); 00BD954C 83C104 add ecx,$04 Clipper.Engine.pas.3205: repeat 00BD954F 8B01 mov eax,[ecx] 00BD9551 8B10 mov edx,[eax] 00BD9553 8B4004 mov eax,[eax+$04] 00BD9556 3B4244 cmp eax,[edx+$44] 00BD9559 7405 jz $00bd9560 00BD955B 3B4240 cmp eax,[edx+$40] 00BD955E 75EC jnz $00bd954c x64 Clipper.Engine.pas.3206: inc(j); 0000000000E9CDCB 4883C008 add rax,$08 Clipper.Engine.pas.3205: repeat 0000000000E9CDCF 488B10 mov rdx,[rax] 0000000000E9CDD2 4C8B02 mov r8,[rdx] 0000000000E9CDD5 488B5208 mov rdx,[rdx+$08] 0000000000E9CDD9 49395050 cmp [r8+$50],rdx 0000000000E9CDDD 7406 jz TClipperBase.ProcessIntersectList + $75 0000000000E9CDDF 49395048 cmp [r8+$48],rdx 0000000000E9CDE3 75E6 jnz TClipperBase.ProcessIntersectList + $5B with indexing its this: x86 Clipper.Engine.pas.3206: inc(j); 004C954B 42 inc edx Clipper.Engine.pas.3205: repeat 004C954C 8B461C mov eax,[esi+$1c] 004C954F 8B4004 mov eax,[eax+$04] 004C9552 8B0490 mov eax,[eax+edx*4] 004C9555 8B28 mov ebp,[eax] 004C9557 8B4004 mov eax,[eax+$04] 004C955A 3B4544 cmp eax,[ebp+$44] 004C955D 7405 jz $004c9564 004C955F 3B4540 cmp eax,[ebp+$40] 004C9562 75E7 jnz $004c954b x64 Clipper.Engine.pas.3206: inc(j); 000000000031CDD2 4883C001 add rax,$01 Clipper.Engine.pas.3205: repeat 000000000031CDD6 488B5320 mov rdx,[rbx+$20] 000000000031CDDA 488B5208 mov rdx,[rdx+$08] 000000000031CDDE 488B14C2 mov rdx,[rdx+rax*8] 000000000031CDE2 4C8B02 mov r8,[rdx] 000000000031CDE5 488B5208 mov rdx,[rdx+$08] 000000000031CDE9 49395050 cmp [r8+$50],rdx 000000000031CDED 7406 jz TClipperBase.ProcessIntersectList + $85 000000000031CDEF 49395048 cmp [r8+$48],rdx 000000000031CDF3 75DD jnz TClipperBase.ProcessIntersectList + $62 However, incrementing pointer is not always faster because when using indexing into an array the CPU sometimes can better pipeline those instructions. Now we use a local variable of type ^Pointer (with pointermath on to be able to index into it) like this: list := Pointer(FIntersectList.List); for i := 0 to FIntersectList.Count-1 do begin // make sure edges are adjacent, otherwise // change the intersection order before proceeding if not EdgesAdjacentInAEL(list[i]) then begin j := i; repeat inc(j); until EdgesAdjacentInAEL(list[j]); // now swap intersection order node := list[i]; list[i] := list[j]; list[j] := node; end; and we get this asm for the inner repeat loop (performance is basically equal to using the pointer) - there can easily be some variations as soon as one uses an additional local variable at the "wrong" spot because the register allocator of the Delphi compiler sucks: Clipper.Engine.pas.3207: inc(j); 00E89554 42 inc edx Clipper.Engine.pas.3206: repeat 00E89555 8B0496 mov eax,[esi+edx*4] 00E89558 8B08 mov ecx,[eax] 00E8955A 8B4004 mov eax,[eax+$04] 00E8955D 3B4144 cmp eax,[ecx+$44] 00E89560 7405 jz $00e89567 00E89562 3B4140 cmp eax,[ecx+$40] 00E89565 75ED jnz $00e89554 I cannot find it right now to verify but I think I've read somewhere that as previously noted this way of instructions might perform better (modifying index and addressing into array over shifting pointer) because the CPU is able to fuse these instructions FWIW it's kinda interesting to see what some C++ compilers emit for some loop: https://godbolt.org/z/oq9Gzexa3
-
FWIW I see more improvement after my changes - but that can have various reasons. Before optimization Win32 Testing edge count: 1000 time: 145 msecs Testing edge count: 2000 time: 689 msecs Testing edge count: 3000 time: 2.585 msecs Win64 Testing edge count: 1000 time: 128 msecs Testing edge count: 2000 time: 573 msecs Testing edge count: 3000 time: 2.087 msecs Commit "Improved Delphi performance" Win32 Testing edge count: 1000 time: 149 msecs Testing edge count: 2000 time: 626 msecs Testing edge count: 3000 time: 2.379 msecs Win64 Testing edge count: 1000 time: 127 msecs Testing edge count: 2000 time: 497 msecs Testing edge count: 3000 time: 1.767 msecs Further improvements Win32 Testing edge count: 1000 time: 141 msecs Testing edge count: 2000 time: 552 msecs Testing edge count: 3000 time: 1.840 msecs Win64 Testing edge count: 1000 time: 124 msecs Testing edge count: 2000 time: 493 msecs Testing edge count: 3000 time: 1.630 msecs What we can clearly see from the results though is that your code has O(nĀ²) - that is where you might want to invest some time into
-
First of all, using Int64 on x86 causes quite some impact already. You notice that simply compiling the benchmark as is on x64 makes it run faster - that is not because the x64 compiler is so amazing but simply because now all the Int64 operations are simple register operations. Second - and that is something I have been asking for quite some time: let the compiler generate less conditional jump instructions in favor or conditional mov instruction - especially in comparers that can improve performance quite significantly. To a certain degree one can code that in pascal but that's usually not very pretty. As for the sorting: I did a try using IntroSort which gave a slight improvement over the RTL Quicksort.
-
That's one of the great things they've been doing in the JIT/compiler if you read through the blog article - it can detect when list access won't be out of its range so it can omit any superfluous range checks. A few additional improvements to your source code: function IntersectListSort(node1, node2: Pointer): Integer; var pt1, pt2: ^TPoint64; i: Int64; begin // note to self - can't return int64 values :) pt1 := @PIntersectNode(node1).pt; pt2 := @PIntersectNode(node2).pt; i := pt2.Y - pt1.Y; if (i = 0) then begin if (pt1 = pt2) then begin Result := 0; Exit; end; // Sort by X too. Not essential, but it significantly // speeds up the secondary sort in ProcessIntersectList . i := pt1.X - pt2.X; end; if i > 0 then Result := 1 else if i < 0 then Result := -1 else result := 0; end; This eliminates as many repeatedly indirections as possible - that reduces register pressure (especially on x86) Next improvement is in TClipperBase.ProcessIntersectList which takes the majority of overall time in the benchmark - namely the inner loop that increments j: for i := 0 to highI do begin // make sure edges are adjacent, otherwise // change the intersection order before proceeding node := UnsafeGet(FIntersectList, i); if not EdgesAdjacentInAEL(node) then begin j := i; repeat inc(j); until EdgesAdjacentInAEL(UnsafeGet(FIntersectList, j)); // now swap intersection order FIntersectList.List[i] := UnsafeGet(FIntersectList, j); FIntersectList.List[j] := node; end; First we store the node at i because its the same after the loop - so no repeatly getting it necessary. Since you incremented j by 1 anyway I chose to use a repeat loop because I know that it won't do a tail jump to check the condition first. Also it generates slightly better code with an inlined function as its condition. Which leads the the third improvement: function EdgesAdjacentInAEL(node: PIntersectNode): Boolean; {$IFDEF INLINING} inline; {$ENDIF} var active1, active2: PActive; begin active1 := node.active1; active2 := node.active2; Result := (active1.nextInAEL = active2) or (active1.prevInAEL = active2); end; Same story as before: eliminate any repeatly indirections, fetch active1 and active2 only once instead of accessing both two times (that with you used there before does not change that) Another big chunk of time can probably be shaved off by using a hand-coded quicksort that does not use the repeated call into IntersectListSort but avoids that call altogether.
-
While I appreciate your recommendation this would not solve anything here - the performance difference is not from a poor getter implementation but from the fact that it's a call instead of directly indexing into the backing array. Since spring4d collections are all interfaces they hardly are any faster wrt to calling the getter.
-
64bit RTL patches with Intel OneApi and TBB
Stefan Glienke replied to RDP1974's topic in RTL and Delphi Object Pascal
Then please put the DLL projects into the repo with documentation on how to get the missing pieces to produce them ourselves instead of distributing binaries. As for the RTL routines - as mentioned FillChar already has been improved in 11.1 (and hopefully will get another final improvements - some of which are detailed here: https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines/). Pos has also been improved (mostly Win64 is affected by this as the new code is basically the purepascal code of what was the asm implementation for Win32). Hopefully for 12, we will get an improved Move I was working on (using SSE2 as that is the supported instruction set RTL can assume). Because Move handles forward/backward and overlap it covers what memcopy and memmove do. That leaves CompareMem which I already looked into in context of faster string comparison, UpperCase and Replace. -
Probably with https://github.com/MahdiSafsafi/DDetours
-
I Broke Delphi ->> Low bound exceeds high bound & Duplicate case label if you swap'em (Case Z of) Table error
Stefan Glienke replied to Al T's topic in Algorithms, Data Structures and Class Design
Because its wrong - you are handling the result of v shr 32 as if it was v div 10000000000 -
64bit RTL patches with Intel OneApi and TBB
Stefan Glienke replied to RDP1974's topic in RTL and Delphi Object Pascal
First of there is no 10.5 - I assume you meant 11. That's like the mother of pointless benchmarks. We know that default MM is prone to problems with heavy multithreading. Test with FastMM5 because that has addressed that issue. For using TBB, I think it was mentioned in another thread that also the memory footprint should be considered. Part of the improvement might come from the better system routines such as Move (FillChar has already been improved in 11.1) - I have been working on an improved version but I am afraid we will not get it before Delphi 12. Also - I think I mentioned this before: in the age of supply chain attacks and malicious code being distributed via open source platforms, I would be very careful about using some binaries. You mentioned before how you compiled them: why are you reluctant to post the code on GitHub so everyone can compile it themselves? -
I Broke Delphi ->> Low bound exceeds high bound & Duplicate case label if you swap'em (Case Z of) Table error
Stefan Glienke replied to Al T's topic in Algorithms, Data Structures and Class Design
What? A case is slower than chained if statements in such a case because it uses a series of add and sub instructions combined with conditional jumps. That usually causes data dependency. Also the function using case is wrong - you must not use shr 32 and then compare that to base10 values. As for the other version: when checking if the number is in integer range, do the less than 0 check once and then do an Integer hardcast for the other checks - that avoids the full 64bit compare that needs two cmp and conditional jump instructions on 32bit. Also FWIW the check against maxlongint is unnecessary. -
I Broke Delphi ->> Low bound exceeds high bound & Duplicate case label if you swap'em (Case Z of) Table error
Stefan Glienke replied to Al T's topic in Algorithms, Data Structures and Class Design
That code reminds me of what I have seen in SysUtils _IntToStr64 and _IntToStr32 (getting the number of digits) Also see https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10 -
Because that is not pascal-ish! /s š P.S. Yes I know about proposals to make it look pascal-ish - I even think we discussed that in another thread already.
-
Ask for comments to improve this code
Stefan Glienke replied to Berocoder's topic in Algorithms, Data Structures and Class Design
People overuse generics and forget that the implementation often does not need generics at all: type TSupport = class helper for TObject private function __Supports(out ResObj: TObject; cls: TClass): Boolean; public function Supports<T: class>(out ResObj: T): Boolean; inline; end; function TSupport.__Supports(out ResObj: TObject; cls: TClass): Boolean; begin if Assigned(Self) and Self.InheritsFrom(cls) and not ((Self.InheritsFrom(TBoldObject)) and TBoldObject(Self).BoldObjectIsDeleted) then begin ResObj := Self; Exit(True); end; ResObj := nil; Result := False; end; function TSupport.Supports<T>(out ResObj: T): Boolean; begin Result := __Supports(TObject(ResObj), T); end; This code will not cause any bloat at all (in the final executable - see below for the situation during compilation though). What @ventiseis wrote is correct - the issues with long compile-time and possibly running ouf of memory (more so under the LLVM based compiler because they have a larger memory usage already) are real. They are caused by the compiler first compiling unit by unit and stuffing everything generic being used into each dcu just to (possibly) remove it again later during link time. Linker however will only remove exact identical duplicates such as if you have TList<TFoo> in multiple units executable will only contain it once, but it will also contain TList<TBar> and TList<TQux> even though they are also just lists for objects. Since XE7 most methods on TList<T> are marked as inline and unless you are using runtime packages they are not being called in your executable but the internal TListHelper ones but since all RTL classes have $RTTI turned on the binary will still contain all these methods. You only partially get rid of them by using {$WEAKLINKRTTI ON} which you need to put into each and every unit (years ago you could put that into the dpr and it affected all units but that was actually a compiler bug that made flags with scope local being global - but any unit that changed them then also changed them globally which caused quite some weird defects) That is why I did the huge refactoring in Spring4D regarding collections and the generic container registration API. If you use a list from spring with a thousand different classes the binary will only contain the code for the list once unlike with the RTL where you would have it a thousand times The issues are known and reported but obviously are not severe enough or too complex (especially the second one would require some architectural changes to how the compiler deals with generics): https://quality.embarcadero.com/browse/RSP-16520 https://quality.embarcadero.com/browse/RSP-18080 -
Looks to be an issue with https - over http it works
-
Function with 2 return values ?
Stefan Glienke replied to Henry Olive's topic in RTL and Delphi Object Pascal
A static array of Double with the size 3 - exact type name would be context-specific and since you are not giving any context I cannot name it.