Jump to content

Arnaud Bouchez

Members
  • Content Count

    316
  • Joined

  • Last visited

  • Days Won

    22

Posts posted by Arnaud Bouchez


  1. To benchmarks instructions, you need specific SW tooling, and also proper HW.

    The reference is https://www.agner.org/optimize/#testp

     

    To benchmark Sleep() doesn't make any sense, especially on Windows.

    On windows, the Sleep() resolution is around the system timer which is typically between 14-20ms.

    Sleep() waits "at least" for the number of milliseconds specified.
    So in a waiting loop, you should never count the number of Sleep() iterations, but call GetTickCount64, with a timeout.

    • Thanks 1

  2. 8 hours ago, Mike Torrettinni said:

    TSynDictionary (from mORMot) is also very fast, but I don't use mORMotand license is not friendly for my commercial project.

    As David wrote, mORMot has a 3 license - if you use MPL it is very commercial-project-friendly.

    TL&LR: Nothing to pay, just mention somewhere in your software that you used it, and publish any modification you may do to the source code.

     

    Another article worth looking at:
    https://www.delphitools.info/2015/03/17/long-strings-hash-vs-sorted-vs-unsorted/

    It depends what you expect.
    Also note that for long strings, hashing may have a cost - this is why we implemented https://blog.synopse.info/?post/2021/02/12/New-AesNiHash-for-mORMot-2

    • Like 3
    • Thanks 1

  3. Personal note: each time I see GetIt involved, I remember that mORMot was never accepted as part of it because it was "breaking their license policy". In short, you could use any Delphi version (even the free edition) and create Client-Server apps with it. I guess this is the same reason the great ZEOS library or even UniDAC are not part of it, if I checked correctly their registration.

    That's why I prefer more open package solutions like Delphinus and I hope in the future the very promising https://github.com/DelphiPackageManager/DPM

    Old but still relevant discussion at https://synopse.info/forum/viewtopic.php?pid=17453#p17453
     

    • Thanks 1

  4. Side note: inlining is not necessary faster.

    Sometimes, after inlining the compiler has troubles assigning properly the registers: so a sub-function with a loop is sometimes faster when NOT inlined, because the loop index and pointer could be in a register, whereas once inlined the stack may be used.

     

    The worse use of "inline;" I have seen is perhaps the inlining of FileCreate/FileClose/DeleteFile/RenameFile from SysUtils which required the Windows unit to be part of any unit calling it.

    With obviously no performance benefit because those calls are slow by nature.
    Embarcadero made this mistake in early versions of Delphi, then they fixed some of it (but RenameFile is still inlined!), and re-used "inline;" when POSIX was introduced... 😞
    I had to redefine those functions in mormot.core.os.pas so that I was not required to write {$ifdef MsWindows} Windows, {$endif} in my units when writing cross-platform code...

     

     

    • Like 1

  5. Your code is sometimes not correct.

    For instance, CustomSplitWithPrecount() exit directly without setting result := nil so it won't change the value passed as input (remember than an array result is in fact a "var" appended argument).

     

    All those are microoptimisations - not worth it unless you really need it.
    I would not use TStringList for sure. But any other method is good enough in most cases.

     

    Also no need to use a PChar and increment it.
    In practice, a loop with an index of the string is safer - and slightly faster since you use only the i variable which is already incremented each time.

     

    To optimize any further, I would use PosEx() to find the delimiter which may be faster than your manual search on some targets.

     

    The golden rule is to make it right first.

    Then make it fast - only if it is worth it, and I don't see why it would be worth it.

    • Thanks 1

  6. First, all managed types (string, variable, dynamic arrays) will be already initialized with zero by the compiler.

     

    What you can do is define all local variables inside a record, then call FillChar() on it.
    There won't be any performance penalty

     

    procedure MyFunction;
    var
      i: integer;
      loc: record
        x,y,z: integer;
        a: array[0..10] of double;
      end;
    begin
      writeln(loc.x); // write random value (may be 0)
      FillChar(loc, SizeOf(loc), 0);
      writeln(loc.x); // write 0 for sure
      for i := 0 to high(loc.a) do
        writeln(loc.a[i]); // will write 0 values
    end;

        

     

    But as drawback, all those variables will be forced to be on stack, so the compiler won't be able to optimize and use register for them - which is the case for the "i" variable above.
    So don't put ALL local variables in the record, only those needed to be initialized.

     

    Anyway, if you have a lot of variables and a lot of code in a method, it may be time to refactor it, and use a dedicated class to implement the logic.
    This class could contain the variables of the "record" in my sample code.
    You could keep this class in the implementation section of your unit, for safety.
    It will be the safer way to debug - and test!
    One huge benefit of a dedicated class for any complex process is that it could be tested.


  7. If you look at the asm - at least on FPC - in fact CtrNistCarryBigEndian() is inlined so has very little impact. It is called 1/256th times, and only add a two inc/test opcodes.
    Using branchless instructions seems pointless in this part of the loop: DoBlock() takes dozen of cycles for sure, and the bottleneck is likely to be the critical section.

    Also note that 2^24 depends on the re-seed parameter, which may be set to something more than 2^24*16 bytes (even NIST seems to allow up to 2^48), so a 3 bytes counter won't be enough.

    CtrNistCarryBigEndian() is a nice and readable solution, in the context of filling a single block of 16 bytes.


    Current 32MB default for the reseed value is still far below from the NIST advice of 2^48. We used 32MB from user perspective - previous limit was 1MB which was really paranoid.
    Anyway, if an application needs a lot of random values, then it will instantiate its own TAesPrng, with a proper reseed, for each huge random need.


  8. On 1/19/2021 at 1:04 PM, RDP1974 said:

    I'm using with great satisfaction Delphi x Linux compiler with Firedac pooling, SOAP indy based custom SSL webservices -> very small and very fast, nobody is using the same toolchain?

    Nope: FPC Linux + mORMot DB and SOA layer since years. With high performance and stability - we had servers handling thousands of requests per seconds receiving TB of data running for months with no restart and no problem. Especially with our MM which uses much less memory than TBB.

     

    One problem I noticed on Linux with C memory managers running FPC services is that they are subject to SIGABRT if they encounter any memory problem.
    This is why we worked on our own https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas which consumes much less memory than TBB, and if there is a problem in our code, we have a GPF exception we can trace, and not a SIGABRT which kills the process. I can tell you that a SIGABRT for a service is a disaster - it always happen when you are far AFK and can't react quickly. And if you need to install something like https://mmonit.com/monit/ on your server, it becomes complicated...


  9. Two blog posts to share:

     

    https://blog.synopse.info/?post/2021/02/13/Fastest-AES-PRNG%2C-AES-CTR-and-AES-GCM-Delphi-implementation

     

    https://blog.synopse.info/?post/2021/02/12/New-AesNiHash-for-mORMot-2

     

    TL&DR: new AES assembly code burst AES-CTR AES-GCM AES-PRNG and AES-HASH implementation, especially on x86_64, for mORMot 2.
    It outperforms OpenSSL for AES-CTR and AES-PRNG, and is magnitude times faster than every other Delphi library I know about.

    • Like 4

  10. New hasher in town, to test and benchmark:
    https://blog.synopse.info/?post/2021/02/12/New-AesNiHash-for-mORMot-2

     

    Murmur and xxHash just far away from it, in terms of speed, and also in terms of collisions I guess... 15GB/s on my Core i3, on both Win32 and Win64.

     

    The smallest length of 0-15 bytes are taken without any branch, 16-128 bytes have no loop, and 129+ are hashed with 128 bytes per iteration.
    Also note its anti-DOS abilities, thanks to its random seed at process startup.
    So it was especially tuned for a hashmap/dictionary.

    • Like 1
    • Thanks 1

  11. 1. Use RawByteString instead of AnsiString if you don't want to force any conversion.

    2. Note that Ansi*() functions are not all meant to deal with AnsiString, but they expect string/UnicodeString types, and deal with current system locale e.g. for comparison or case folding...
    A bit confusing indeed...

     

    3. Consider using your own version of functions- as we did with https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.pas - so you are sure there is no hidden conversion.

     

    4. The main trick is indeed to never let any 'Implicit string cast' warning unfixed.
    And sometimes use Alt+F2 to see the generated asm, and check there is no hidden "call" during the conversion.

     

    5. Another good idea is to write some unit tests of your core process, uncoupled from TCP itself: write them in the original pre-Unicode Delphi, then recompile the code with the Unicode version of Delphi and ensure they do pass.
    It will save you a lot of time!

     

     


  12. From my experiment, Delphi has a lot of troubles running under Wine. IIRC Delphi 7 starts, but debugging is not possible. Newer versions didn't start without some dependencies.

    So Wine is not an option for the Delphi IDE itself.

     

    On the contrary, regular VCL apps work well on Wine, if the UI components are almost standard.
    You may also check https://winebottler.kronenberg.org/ which is a way of packaging a Win executable into a Mac app, embedding Wine with the package.


  13. 21 hours ago, Fr0sT.Brutal said:

    If that "eaten" memory would be unused otherwise why you bother about that consumption? I suspect they just dynamically reserve as much memory as possible for internal needs.

    No, it was not just "reserved", there was a lot more of dirty pages with Intel TBB.

    We tried it on production on Linux, on high-end servers with heavy multi-thread process, and the resident size (RES) was much bigger - not only the virtual/shared memory (VIRT/SHR).

     

    Also the guys from https://unitybase.info - which have very high demanding services - evaluated and rejected the Intel TBB use. Either the glibc MM https://sourceware.org/glibc/wiki/MallocInternals or our https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas give good results on Linux, with low memory consumption.


    Anyway, I wouldn't use Windows to host demanding services. So if you have a Windows server with a lot of memory, you are free to use Intel TBB if you prefer.

    • Like 1

  14. The 3rd party dll are Intel TBB if I am correct.

    So you should at least mention it, with the proper licence terms, and provide a link.

     

    About memory management, from my tests the Intel TBB MM is indeed fast, but eats all memory, so it is not usable for any serious server-side software, running a long time.

    Some numbers, tested on FPC/Linux, but you got the idea:

        - FPC default heap
         500000 interning 8 KB in 77.34ms i.e. 6,464,959/s, aver. 0us, 98.6 MB/s
         500000 direct 7.6 MB in 100.73ms i.e. 4,963,518/s, aver. 0us, 75.7 MB/s
        - glibc 2.23
         500000 interning 8 KB in 76.06ms i.e. 6,573,152/s, aver. 0us, 100.2 MB/s
         500000 direct 7.6 MB in 36.64ms i.e. 13,645,915/s, aver. 0us, 208.2 MB/s
        - jemalloc 3.6
         500000 interning 8 KB in 78.60ms i.e. 6,361,323/s, aver. 0us, 97 MB/s
         500000 direct 7.6 MB in 58.08ms i.e. 8,608,667/s, aver. 0us, 131.3 MB/s
        - Intel TBB 4.4
         500000 interning 8 KB in 61.96ms i.e. 8,068,810/s, aver. 0us, 123.1 MB/s
         500000 direct 7.6 MB in 36.46ms i.e. 13,711,402/s, aver. 0us, 209.2 MB/s
        for multi-threaded process, we observed best scaling with TBB on this system
        BUT memory consumption raised to 60 more space (gblic=2.6GB vs TBB=170GB)!
        -> so for serious server work, glibc (FPC_SYNCMEM) sounds the best candidate

     

    • Like 1

  15. If a method does two diverse actions, define two methods.

     

    If a method performs an action which is something on/off or enabled/disabled then you can use a boolean, if the false/true statement is clearly defined by the naming of the method.

     

    If a method performs something, but with a custom behavior, don't use boolean (or booleans) but an enumeration or even better a set.

    It will be much more easy to understand what it does, without looking into the parameter names, and it will be more open to new options/behaviors.

     

    function TMyObject.SaveTo(json: boolean): string;
    // what is the behavior with json=false?
    
    function TMyObject.SaveToJson(expanded: boolean): string;
    // what does SaveToJson(true/false) mean without knowing the parameter name?
    
    function TMyObject.SaveToJson(expanded, usecache: boolean): string;
    // what does SaveToJson(true/false, true/false) mean without knowing the parameters names?
    
    type
      TMyObjectSaveToJsonOptions = set of (sjoExpanded, sjoUseCache);
      
    function TMyObject.SaveToJson(options: TMyObjectSaveToJsonOptions): string;
    // you understand what does SaveToJson([]) or SaveToJson([sjoExpanded]) or SaveToJson[sjoExpanded, sjoUserCache]) mean

     

    • Like 1
    • Thanks 1

  16. All this is a void discussion.

    This code is just broken and should be fixed. It has nothing to do with const or whatever. The compiler is doing what it should, but the code is plain wrong.

    I fully agree with @David Heffernan here.

     

    About the "address", it should be pointer(Value) not @Value.

    pointer(value) returns the actual pointer of the string content in heap, so will change. It is a faster alternative to @value[1] which works also with value='' -> pointer(value)=nil.

    @Value returns the memory adress of the local Value variable on the stack, so won't change.

    • Like 1

  17. Binary is not text, so it is pointless for your problem. You need the integers to be written as text, not as 4 bytes binary.

     

    You could directly write to the TWriteCachedFileStream, too, without any temporary string.
    And append the integer values using a shortstring and old str() instead of IntToStr() which uses the heap.


  18. Signing the executable is the key here.
    And also make a minimal security audit. A password should be hashed, and never stored in the executable itself.

    It has nothing to do with Delphi. It was a poor security design of the application. 

     

    About logic security, and reverse engeniering, Java or C# are much worse than Delphi.
    You can easily decompile Java or C# executable.... unless it has been obfuscated explicitly.

    I can tell you that I "hacked" so many C# dlls which we lost the source... 😉 

    Whereas a Delphi exe is compiled, and lack a lot of RTTI, so it is much more difficult to get something about it.

    • Like 1

  19. Note that you don't store the content, you re-assign each new line.

    So you are testing something non reallistic, which is worthless trying to optimize it.

     

    What is slow is not data moving, but memory allocation.

    One performance problem is the temporary allocation of strings, if you call Integer.ToString.

    For our mORMot TTextWriter, we don't use any temporary allocation, and even have pre-computed values for the smallest integers.

     

    Note that Delphi TStringBuilder will actually be slower on Win32 than naive concatenation.

    It also allocate a temporary string for integer appending... 😞

    https://www.delphitools.info/2013/10/30/efficient-string-building-in-delphi/3/

     

    I would stick with naive string concatenation, and I guess it will be fast enough in practice.

    It would be premature optimization otherwise.

     

     

    • Thanks 1

  20. Git for Desktop is just a bloated Electron app.. I would not recommend it.

     

    I don't use any gui tool for git.

    For simple git process, I use

    - on Linux, I use some simple scripts: https://github.com/synopse/mORMot2/blob/master/commit.sh and https://github.com/synopse/mORMot2/blob/master/kompare.sh on Linux

    - for mORMot, I made a simple VCL app which calls a source comparison tool, then call some scripts. https://github.com/synopse/mORMot/tree/master/SQLite3/Documentation/SourceCodeRep
    (which also update a fossil repository altogether with github - for standalone/private repositories, https://fossil-scm.org/home/doc/trunk/www/fossil-v-git.wiki is awesome, Windows native, with much more features than git and it has a build-in web ui, and the ability to mirror to git)

×