Jump to content

Stefan Glienke

Members
  • Content Count

    1365
  • Joined

  • Last visited

  • Days Won

    130

Everything posted by Stefan Glienke

  1. Stefan Glienke

    IsZero or SameValue

    Certainly not like that because this will not consider a value next to the potential find that is a closer match - when the list has a different count of elements the result could be different.
  2. Stefan Glienke

    Function with 2 return values ?

    And what exact benefit would a non-ref counted interface have? Once you pass down some you lose any control over its lifetime.
  3. Stefan Glienke

    Function with 2 return values ?

    With the current memory model that would be a complete disaster. The main appeal of using interfaces when doing some DI architecture is the ref counting. And the mantra "program to an interface, not an implementation" is factually wrong: read https://blog.ploeh.dk/2010/12/02/Interfacesarenotabstractions/
  4. Stefan Glienke

    Array size 64bits

    The difference in performance is clear - accessing a dynamic array which is a field inside the class is two indirections while accessing a static array which is a field inside the class is only one indirection. If you inspect the generated assembly code you will see that every access to the dynamic array has more instructions. This is the case every time you have repeated access to a field inside a method because the compiler does not store it away as if it was a local variable and then directly reads it but basically does Self.Table every time. For this exact reason I have explicitly written code that first reads the dynamic array into a local pointer variable (to avoid the extra reference counting) and then operate on that one via hardcast back to dynamic array (or via pointermath). That way the compiler could keep that in a register and directly index into it rather than dereferencing Self every time to read that dynamic array. To try out, add this code to your Button1Click: {$POINTERMATH ON} Table: ^DWord; {$POINTERMATH OFF} begin SetLength(Self.Table, NbPrime); Table := @Self.Table[0]; Now the code accesses the local Table variable which most likely is stored in a register.
  5. Stefan Glienke

    A gem from the past (Goto)

    If you are using XE2 as in your profile you could be affected by this: https://quality.embarcadero.com/browse/RSP-27375 And as David mentions depending on what is inside the try the compiler easily throws any register usage overboard and operates via stack.
  6. Stefan Glienke

    Use of inline variables..

    https://quality.embarcadero.com/browse/RSP-23096
  7. That Poker Benchmark is completely pointless as it has almost zero memory allocations - the majority of CPU time is spent sorting cards and stuff.
  8. Stefan Glienke

    Implement logic in TListView

    I would probably design it like this (enabling and disabling the Checkboxes depending on RadioButton3 Checked)
  9. Stefan Glienke

    Implement logic in TListView

    Use the OnChanging event to allow or disallow selecting an item depending on the existing selection
  10. Stefan Glienke

    Delphifeeds.com??

    That is exactly what I respond when someone tells me that Embarcadero should integrate TestInsight or Spring. Parnassus plugins ...
  11. Stefan Glienke

    generics

    Doesn't that make it a container? šŸ˜œ
  12. Stefan Glienke

    generics

    Think of collections as "algorithms and datatypes for any type" - then you know the use case of generics. For any algorithm and/or datatype that is not just specific for one exact type.
  13. Stefan Glienke

    The Delphi 11.2 release thread

    11.2 is a nightmare - prior to LSP the worst was that ctrl+click did not work. Now, most of the time nothing at all works because LSP dies all the time.
  14. Stefan Glienke

    generics

    As a reaction to one of his answers during Q&A I wrote a blog post. Having said that and personally loving generics for various use cases (as shown in the blog post) there are also things that are solved sub optimal - which I also wrote about. Also if you have used generics in C# then you will likely miss co- and contravariance - oh, look, I also wrote about that. If you are going really fancy with generics and code that uses RTTI you have to be aware about some percularities - guess what: wrote about it. Now because in generics you basically have the lowest common denominator and we are lacking quite some ways to specify some traits of the supported types via constraints there are several things that you cannot do or have to fall back to indirections: most common example is using comparer interfaces for generic sorting algorithm or hashtables. That is mostly where a naively implemented generic algorithm might be slower than some handcrafted code for the specific type unless you heavily optimize for various cases (as I have done in Spring).
  15. Stefan Glienke

    Best place for Spring4D questions

    Sure, gonna work on that silver tag badge
  16. Stefan Glienke

    Any advantage to using FastMM5??

    In a single-threaded application, FastMM5 will not give you any noticeable improvements. It was designed to overcome the issues of V4 under heavy multithreading.
  17. It still surprises me that people are surprised by how much performance improves under heavy multithreading when not using the default MM. AFAIK mORMot does not use the default MM anyway.
  18. Stefan Glienke

    32bit vs 64bit

    I can tell you from experience that traveling with an airplane has significant overhead if you live approx 100km from an airport that has flights going to your destination. You missed the point - when David mentioned that nobody should be using Extended he most likely stated a well-meant suggestion and did not include the "I started using Extended like decades ago and don't wanna change it" case.
  19. Stefan Glienke

    32bit vs 64bit

    Carriages also once were the fastest way to travel - yet nobody today complains that he can't go onto the highway with one. On topic: If the 64bit Delphi compiler(s) (and significant parts of the RTL) would not be even worse than the 32bit compilers and the debugging experience would not be an absolute nightmare I would switch instantly - it's proven that even though a 64bit application might use a bit more memory because the pointers are double the size due to architectural differences it simply can perform better. Simply having more registers is already a huge gain.
  20. Stefan Glienke

    32bit vs 64bit

    Too bad it's the implicit default type in many places - be it float literals or parameter types.
  21. Tests that run for approx one second are really the way to go when deciding on the proper memory manager. šŸ˜‚
  22. Been using FastMM5 in production for over 2 years now and never looked back. I don't know of any reliability or fragmentation issues.
  23. That's another reason why precompiled binaries are bad - if I had to guess I would say they are compiled for CPUs that support AVX which Nehalem did not have.
  24. Stefan Glienke

    Profiling Clipper2

    That can be the reason: array accessing needs (at least) two registers: array pointer and index, incrementing pointer only needs one With how for-to loops are working we need 3, the array pointer, the incrementing index and one compiler generated one that counts down to 0 which is actually being used for the loop. Doing a for i := 1 to count loop not actually using i but the shifting pointer we need 2 registers. However, I assume the original code has another issue: too many indirections. It does not access into the dynamic array but it first accesses the TList reference. That means we have three indirections: first the field access, second the backing array access, regardless of using a getter or the List property, third indexing into the array. (you see the 3 mov eax, ... instructions following each other) These indirections are causing a data dependency - modern CPUs can execute multiple instructions at the same time if they don't depend on each other - in this case they do so these instructions cannot execute in parallel leaving part of the CPU without anything to do. That is the main difference you will see in the code below! If you would store the TPointerList (not directly as that type because then you have implicit finally block generated by the compiler because its a dynamic array, doh) then you would have probably similar runtime because in this code there are enough registers available. Also, make sure to use NativeInteger for index variables whenever possible to avoid unnecessary move with sign-extention instructions on 64bit. j being ^PIntersectNode the asm looks like this: x86 Clipper.Engine.pas.3206: inc(j); 00BD954C 83C104 add ecx,$04 Clipper.Engine.pas.3205: repeat 00BD954F 8B01 mov eax,[ecx] 00BD9551 8B10 mov edx,[eax] 00BD9553 8B4004 mov eax,[eax+$04] 00BD9556 3B4244 cmp eax,[edx+$44] 00BD9559 7405 jz $00bd9560 00BD955B 3B4240 cmp eax,[edx+$40] 00BD955E 75EC jnz $00bd954c x64 Clipper.Engine.pas.3206: inc(j); 0000000000E9CDCB 4883C008 add rax,$08 Clipper.Engine.pas.3205: repeat 0000000000E9CDCF 488B10 mov rdx,[rax] 0000000000E9CDD2 4C8B02 mov r8,[rdx] 0000000000E9CDD5 488B5208 mov rdx,[rdx+$08] 0000000000E9CDD9 49395050 cmp [r8+$50],rdx 0000000000E9CDDD 7406 jz TClipperBase.ProcessIntersectList + $75 0000000000E9CDDF 49395048 cmp [r8+$48],rdx 0000000000E9CDE3 75E6 jnz TClipperBase.ProcessIntersectList + $5B with indexing its this: x86 Clipper.Engine.pas.3206: inc(j); 004C954B 42 inc edx Clipper.Engine.pas.3205: repeat 004C954C 8B461C mov eax,[esi+$1c] 004C954F 8B4004 mov eax,[eax+$04] 004C9552 8B0490 mov eax,[eax+edx*4] 004C9555 8B28 mov ebp,[eax] 004C9557 8B4004 mov eax,[eax+$04] 004C955A 3B4544 cmp eax,[ebp+$44] 004C955D 7405 jz $004c9564 004C955F 3B4540 cmp eax,[ebp+$40] 004C9562 75E7 jnz $004c954b x64 Clipper.Engine.pas.3206: inc(j); 000000000031CDD2 4883C001 add rax,$01 Clipper.Engine.pas.3205: repeat 000000000031CDD6 488B5320 mov rdx,[rbx+$20] 000000000031CDDA 488B5208 mov rdx,[rdx+$08] 000000000031CDDE 488B14C2 mov rdx,[rdx+rax*8] 000000000031CDE2 4C8B02 mov r8,[rdx] 000000000031CDE5 488B5208 mov rdx,[rdx+$08] 000000000031CDE9 49395050 cmp [r8+$50],rdx 000000000031CDED 7406 jz TClipperBase.ProcessIntersectList + $85 000000000031CDEF 49395048 cmp [r8+$48],rdx 000000000031CDF3 75DD jnz TClipperBase.ProcessIntersectList + $62 However, incrementing pointer is not always faster because when using indexing into an array the CPU sometimes can better pipeline those instructions. Now we use a local variable of type ^Pointer (with pointermath on to be able to index into it) like this: list := Pointer(FIntersectList.List); for i := 0 to FIntersectList.Count-1 do begin // make sure edges are adjacent, otherwise // change the intersection order before proceeding if not EdgesAdjacentInAEL(list[i]) then begin j := i; repeat inc(j); until EdgesAdjacentInAEL(list[j]); // now swap intersection order node := list[i]; list[i] := list[j]; list[j] := node; end; and we get this asm for the inner repeat loop (performance is basically equal to using the pointer) - there can easily be some variations as soon as one uses an additional local variable at the "wrong" spot because the register allocator of the Delphi compiler sucks: Clipper.Engine.pas.3207: inc(j); 00E89554 42 inc edx Clipper.Engine.pas.3206: repeat 00E89555 8B0496 mov eax,[esi+edx*4] 00E89558 8B08 mov ecx,[eax] 00E8955A 8B4004 mov eax,[eax+$04] 00E8955D 3B4144 cmp eax,[ecx+$44] 00E89560 7405 jz $00e89567 00E89562 3B4140 cmp eax,[ecx+$40] 00E89565 75ED jnz $00e89554 I cannot find it right now to verify but I think I've read somewhere that as previously noted this way of instructions might perform better (modifying index and addressing into array over shifting pointer) because the CPU is able to fuse these instructions FWIW it's kinda interesting to see what some C++ compilers emit for some loop: https://godbolt.org/z/oq9Gzexa3
  25. Stefan Glienke

    Profiling Clipper2

    FWIW I see more improvement after my changes - but that can have various reasons. Before optimization Win32 Testing edge count: 1000 time: 145 msecs Testing edge count: 2000 time: 689 msecs Testing edge count: 3000 time: 2.585 msecs Win64 Testing edge count: 1000 time: 128 msecs Testing edge count: 2000 time: 573 msecs Testing edge count: 3000 time: 2.087 msecs Commit "Improved Delphi performance" Win32 Testing edge count: 1000 time: 149 msecs Testing edge count: 2000 time: 626 msecs Testing edge count: 3000 time: 2.379 msecs Win64 Testing edge count: 1000 time: 127 msecs Testing edge count: 2000 time: 497 msecs Testing edge count: 3000 time: 1.767 msecs Further improvements Win32 Testing edge count: 1000 time: 141 msecs Testing edge count: 2000 time: 552 msecs Testing edge count: 3000 time: 1.840 msecs Win64 Testing edge count: 1000 time: 124 msecs Testing edge count: 2000 time: 493 msecs Testing edge count: 3000 time: 1.630 msecs What we can clearly see from the results though is that your code has O(nĀ²) - that is where you might want to invest some time into
Ɨ