Arnaud Bouchez

June 15, 2021

The stack is always "unloaded" when the method returns. That is a fact for sure.
There is a "mov rbp, rsp" in the function prolog, and a reversed "mov rsp, rbp" in the function epilog. Nice and easy.

Look at the stack trace in the debugger when you reach the stack overflow problem.
You will find out the exact context.

And switch to a dynamic array.
Using proper copy() if you want to work on a local copy.

June 15, 2021

@Kas Ob.

Win32 code is fine, it fills the stack variables with zeros, and my guess is that the size fits the TProgrammaCalcio size.
The problem is really a stack overflow, not infinite recursion. The compiler has no bug here. User code is faulty.

@marcocir

Did you read my answer?

There is no unexpected recursion involved. The problem is that this TProgrammaCalcio is too huge to fit on the stack.
On Win64, this is what mov [rsp+rax],al does: it reserves some space for the stack, by increasing the stack pointer by 4KB chunks (the memory page size), and writing one single byte to it to trigger a page fault and a stack overflow if there is no stack place available. And you are using too much of it with your local variable.
On Win32, it doesn't reserve the memory by 4KB chunks, but it fills the stack with zeros - which is slower, but does exactly the same.

What does SizeOf(TProgrammaCalcio) return? Is it really a static array?
You should switch to a dynamic array instead.

A good practice is not to allocate more than 64KB on the stack, especially if there are some internal calls to other functions.
In some execution context (e.g. a dll hosted by IIS), the per-thread stack size could be as little as 128KB.

June 15, 2021

TDynArrayHashed is a wrapper around an existing dynamic array. It doesn't hold any data but works with TArray<T>. You could even maintain several hashes for several fields of the array item at the same time.

TSynDictionary is a class holding two dynamic arrays defined at runtime (one for the keys, one for the values). So it is closer to the regular collections. But it has additional features (like thread safety, JSON or binary serialization, deprecation after timeout, embedded data search...), which could make the comparison unfair: it does more than lookups.

The default syntax is using old fashioned TypeInfo() but you could wrap it into generics syntax instead, using TypeInfo(T).
So I guess the comparison is fair. And the executable size would not be bloated because there is no generic involved for the actual implementation of the TSynDictionary class in mORMot.

Removing generic/parametrized code as soon as possible is what every library is trying to do, both the Delphi RTL and Delphi Spring. The RTL and Delphi String were rewritten several times to keep the generic syntax as high as possible, and try to leverage compiler magic as much as possible to avoid duplicated or dead code. This is what String4D 2.0 is all about. mORMot just allows to avoid duplicated or dead code regeneration completely.

June 15, 2021

The Win64 assembly is not inefficient. It is almost the same code as the Win32 version. And my guess is that Win64 code would be faster.

The on-stack TPanel is initialized with explicit zero fill of all bytes in Win32 in a loop, whereas it is with zero fill only of some managed fields within TPanel in Win64, if I understand well the generated asm.

Then the RTL calls are comparable.

So at execution, my guess is that the Win64 version is likely to be faster.

So the compiler is not faulty. On the other hand, your code could be enhanced.
One simple optimization would be to write "const Programma: TProgrammaCalcio;" for your program parameter to avoid a RTL array copy call (stack allocation + movsd/movsq + AddRefArray).
Using "const" for managed types (string, arrays, records) is always a good idea.

Actual stack variable use may be the real cause of the Stack Overflow problem. My guess is that you are abusing of huge variable on stack. Each calls seems to consume $8cb08 bytes on the stack. Much too big!

So to fix it:

0) read more about how records and static arrays are used by Delphi, especially about memory allocation and parameter passing

1) use const to avoid a temporary copy on the local stack
2) don't use fixed size arrays for TProgrammaCalcio but a dynamic array allocated on heap

3) TPanel may also be somewhat big - consider using classes and not records to use the heap instead of the stack

4) If you are sure that heap allocation is a bottleneck (I doubt it here), you may create a temporary TPanel within the main TfrmFixedCalcio instance and reuse it

Before wondering about the code generated by the Delphi compiler, please review your own code, which is likely to be the real bottleneck for performance.
And in such a method, the FMX process and the GUI computation is likely to be the real bottleneck, not your Delphi code, and especially not the prolog/epilog generated by the Delphi compiler.
"Premature Optimization is the root of all evil"! (DK)

June 15, 2021

Nice reading.

If you make some performance comparisons with other libraries, ensure you use the latest mORMot 2 source code - not mORMot 1, but mORMot 2.
Some rewrite and optimization have been included in latest mORMot 2.
And consider using TDynArrayHashed instead of TSynDictionary if you want a fair comparison - TSynDictionary has a lock to be thread-safe, and copy the retrieved data at lookup so it is not fair comparison with regular mapping classes.

June 3, 2021

Here is for instance how FPC compile our PosExPas() function for Linux x86_64.

https://gist.github.com/synopse/1e30b30a77f6b0288310115085401c1e

You can see the resulting asm is very efficient.
Thanks to the goto used in the pascal code, and proper use of pointers/registers, to be fair. 😉

You may find some inspiring string process in our https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.unicode.pas#L1404 unit.

This link give you some efficient case-insensitive process over not only latin A-Z chars, but on the whole Unicode 10.0 case folding tables (i.e. work with greek, cyrilic, or other folds).

This code is UTF-8 focused, because we use it instead of UTF-16 in our framework for faster processing with minimal memory allocation and usage.
But you would find some pascal code which is as fast as manual asm with no SIMD support. For better performance, branchless SIMD is the key, but it is much more complex and error prone.
The main trick about case insensitive search is that a branchless version using a lookup table for A-Z chars is faster than cascaded cmp/ja/jb branches, on modern CPUs. We just enhanced this idea to Unicode case folding.

June 3, 2021

On 6/1/2021 at 10:54 PM, Stefan Glienke said:

Certainly better than the RTL - however if you want it really fast and justify implementing in asm then SSE should be used.

For reference: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2

In practice, SSE 4.2 is not faster than regular SSE 2 code for strcmp and strlen. More complex process may benefit of SSE 4.2 - but the fastest JSON parser I have seen doesn't use it, and focuses on micro-parallelized branchless process with regular SIMD instructions - see https://github.com/simdjson/simdjson
Memory access is the bottleneck. This is what Agner measured.

About any asm, it is mandatory to refer to https://agner.org/optimize

There are reference code and reference documentation about how modern asm should be written.

The PosEx_Sha_Pas_2 version is one of the fastest, and probably faster than your version, even if it is written in pure pascal. For instance, reading a register then shr + cmp is not the fastest pattern today.
Pascal version will also work on Mac and Linux, whereas your asm version would need additional code to support the POSIX ABI.
We included it (with minimal tweaks like using NativeInt instead of integer, and using an inline stub for register usage) in https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.pas#L7974

First thing is to benchmark and compare your code with proper timing, and regular tests.
Try with some complex process, not in naive loops which tends to be biased because in naive tests the data remains in the CPU L1 cache, so numbers are not realistic.

April 18, 2021

Without a JIT, I don't see any much interest for WebAsm, to be honnest...
I know that https://github.com/wasm3/wasm3 projects are good enough to run most simple processes.
Having a Delphi version - sad it is not FPC compatible - is a good idea. Even if it is Wintel only.

In terms of interpreting some logic, with proper sandboxing, a transpiler from other languages to JavaScript could be of a better interest, since most high-level features (like object mapping/dictionary, or strings) are already coded in compiled/native code.

Since everyone is throughin away references, take a look at https://github.com/saghul/txiki.js coupled quickjs (fastest interpreter, but slower than Chakra, V8 or SM) and wasm3. 🙂

Edit: the FPC webassembly backend is more a proof of concept, just like the LLVM backend - on which it is based somehow, at least to produce the binary output. It lacks some pascal features, and is not ready to be used on any project sharing the same code than regular FPC pascal code. Also in this respect, the JavaScript FPC transpiler is much more advanced and usable.

April 6, 2021

On Windows, you can use the UniScribe API to get the glyphs from UTF-16 sequences.
We did that for our https://github.com/synopse/mORMot/blob/master/SynPdf.pas library.

But it is not cross-platform at all!
ICU is huge, but the cross-platform way of using it. This is what we did in mORMot 2 - not for glyphs but for case folding, comparison and codepage conversions.

Check also https://github.com/BeRo1985/pucu/tree/master/src which is a pure-pascal Unicode library...
I recently found https://github.com/bellard/quickjs/blob/master/libunicode.h which is very well written, and eventually available in our mORMot2 QuickJS wrapper.

March 28, 2021

To benchmarks instructions, you need specific SW tooling, and also proper HW.

The reference is https://www.agner.org/optimize/#testp

To benchmark Sleep() doesn't make any sense, especially on Windows.

On windows, the Sleep() resolution is around the system timer which is typically between 14-20ms.

Sleep() waits "at least" for the number of milliseconds specified.
So in a waiting loop, you should never count the number of Sleep() iterations, but call GetTickCount64, with a timeout.

March 10, 2021

What is the error?

I guess you have in fact an https/TLS error: the certificate is for the domain name, and you use the IP which doesn't match the certificate.
So the request is rejected.
Try to relax the HTTPS certification validation.

March 6, 2021

8 hours ago, Mike Torrettinni said:

TSynDictionary (from mORMot) is also very fast, but I don't use mORMotand license is not friendly for my commercial project.

As David wrote, mORMot has a 3 license - if you use MPL it is very commercial-project-friendly.

TL&LR: Nothing to pay, just mention somewhere in your software that you used it, and publish any modification you may do to the source code.

Another article worth looking at:
https://www.delphitools.info/2015/03/17/long-strings-hash-vs-sorted-vs-unsorted/

It depends what you expect.
Also note that for long strings, hashing may have a cost - this is why we implemented https://blog.synopse.info/?post/2021/02/12/New-AesNiHash-for-mORMot-2

February 26, 2021

Personal note: each time I see GetIt involved, I remember that mORMot was never accepted as part of it because it was "breaking their license policy". In short, you could use any Delphi version (even the free edition) and create Client-Server apps with it. I guess this is the same reason the great ZEOS library or even UniDAC are not part of it, if I checked correctly their registration.

That's why I prefer more open package solutions like Delphinus and I hope in the future the very promising https://github.com/DelphiPackageManager/DPM

Old but still relevant discussion at https://synopse.info/forum/viewtopic.php?pid=17453#p17453

February 25, 2021

Side note: inlining is not necessary faster.

Sometimes, after inlining the compiler has troubles assigning properly the registers: so a sub-function with a loop is sometimes faster when NOT inlined, because the loop index and pointer could be in a register, whereas once inlined the stack may be used.

The worse use of "inline;" I have seen is perhaps the inlining of FileCreate/FileClose/DeleteFile/RenameFile from SysUtils which required the Windows unit to be part of any unit calling it.

With obviously no performance benefit because those calls are slow by nature.
Embarcadero made this mistake in early versions of Delphi, then they fixed some of it (but RenameFile is still inlined!), and re-used "inline;" when POSIX was introduced... 😞
I had to redefine those functions in mormot.core.os.pas so that I was not required to write {$ifdef MsWindows} Windows, {$endif} in my units when writing cross-platform code...

February 21, 2021

Your code is sometimes not correct.

For instance, CustomSplitWithPrecount() exit directly without setting result := nil so it won't change the value passed as input (remember than an array result is in fact a "var" appended argument).

All those are microoptimisations - not worth it unless you really need it.
I would not use TStringList for sure. But any other method is good enough in most cases.

Also no need to use a PChar and increment it.
In practice, a loop with an index of the string is safer - and slightly faster since you use only the i variable which is already incremented each time.

To optimize any further, I would use PosEx() to find the delimiter which may be faster than your manual search on some targets.

The golden rule is to make it right first.

Then make it fast - only if it is worth it, and I don't see why it would be worth it.

February 17, 2021

First, all managed types (string, variable, dynamic arrays) will be already initialized with zero by the compiler.

What you can do is define all local variables inside a record, then call FillChar() on it.
There won't be any performance penalty

procedure MyFunction;
var
  i: integer;
  loc: record
    x,y,z: integer;
    a: array[0..10] of double;
  end;
begin
  writeln(loc.x); // write random value (may be 0)
  FillChar(loc, SizeOf(loc), 0);
  writeln(loc.x); // write 0 for sure
  for i := 0 to high(loc.a) do
    writeln(loc.a[i]); // will write 0 values
end;

But as drawback, all those variables will be forced to be on stack, so the compiler won't be able to optimize and use register for them - which is the case for the "i" variable above.
So don't put ALL local variables in the record, only those needed to be initialized.

Anyway, if you have a lot of variables and a lot of code in a method, it may be time to refactor it, and use a dedicated class to implement the logic.
This class could contain the variables of the "record" in my sample code.
You could keep this class in the implementation section of your unit, for safety.
It will be the safer way to debug - and test!
One huge benefit of a dedicated class for any complex process is that it could be tested.

February 13, 2021

If you look at the asm - at least on FPC - in fact CtrNistCarryBigEndian() is inlined so has very little impact. It is called 1/256th times, and only add a two inc/test opcodes.
Using branchless instructions seems pointless in this part of the loop: DoBlock() takes dozen of cycles for sure, and the bottleneck is likely to be the critical section.

Also note that 2^24 depends on the re-seed parameter, which may be set to something more than 2^24*16 bytes (even NIST seems to allow up to 2^48), so a 3 bytes counter won't be enough.

CtrNistCarryBigEndian() is a nice and readable solution, in the context of filling a single block of 16 bytes.

Current 32MB default for the reseed value is still far below from the NIST advice of 2^48. We used 32MB from user perspective - previous limit was 1MB which was really paranoid.
Anyway, if an application needs a lot of random values, then it will instantiate its own TAesPrng, with a proper reseed, for each huge random need.

February 13, 2021

On 1/19/2021 at 1:04 PM, RDP1974 said:

I'm using with great satisfaction Delphi x Linux compiler with Firedac pooling, SOAP indy based custom SSL webservices -> very small and very fast, nobody is using the same toolchain?

Nope: FPC Linux + mORMot DB and SOA layer since years. With high performance and stability - we had servers handling thousands of requests per seconds receiving TB of data running for months with no restart and no problem. Especially with our MM which uses much less memory than TBB.

One problem I noticed on Linux with C memory managers running FPC services is that they are subject to SIGABRT if they encounter any memory problem.
This is why we worked on our own https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas which consumes much less memory than TBB, and if there is a problem in our code, we have a GPF exception we can trace, and not a SIGABRT which kills the process. I can tell you that a SIGABRT for a service is a disaster - it always happen when you are far AFK and can't react quickly. And if you need to install something like https://mmonit.com/monit/ on your server, it becomes complicated...

February 13, 2021

Two blog posts to share:

https://blog.synopse.info/?post/2021/02/13/Fastest-AES-PRNG%2C-AES-CTR-and-AES-GCM-Delphi-implementation

https://blog.synopse.info/?post/2021/02/12/New-AesNiHash-for-mORMot-2

TL&DR: new AES assembly code burst AES-CTR AES-GCM AES-PRNG and AES-HASH implementation, especially on x86_64, for mORMot 2.
It outperforms OpenSSL for AES-CTR and AES-PRNG, and is magnitude times faster than every other Delphi library I know about.

February 12, 2021

New hasher in town, to test and benchmark:
https://blog.synopse.info/?post/2021/02/12/New-AesNiHash-for-mORMot-2

Murmur and xxHash just far away from it, in terms of speed, and also in terms of collisions I guess... 15GB/s on my Core i3, on both Win32 and Win64.

The smallest length of 0-15 bytes are taken without any branch, 16-128 bytes have no loop, and 129+ are hashed with 128 bytes per iteration.
Also note its anti-DOS abilities, thanks to its random seed at process startup.
So it was especially tuned for a hashmap/dictionary.

February 2, 2021

1. Use RawByteString instead of AnsiString if you don't want to force any conversion.

2. Note that Ansi*() functions are not all meant to deal with AnsiString, but they expect string/UnicodeString types, and deal with current system locale e.g. for comparison or case folding...
A bit confusing indeed...

3. Consider using your own version of functions- as we did with https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.pas - so you are sure there is no hidden conversion.

4. The main trick is indeed to never let any 'Implicit string cast' warning unfixed.
And sometimes use Alt+F2 to see the generated asm, and check there is no hidden "call" during the conversion.

5. Another good idea is to write some unit tests of your core process, uncoupled from TCP itself: write them in the original pre-Unicode Delphi, then recompile the code with the Unicode version of Delphi and ensure they do pass.
It will save you a lot of time!

January 18, 2021

7 hours ago, RDP1974 said:

Maybe in a old version? They are making "giant" steps forward.

Tests were done last year on the last Debian.

January 18, 2021

From my experiment, Delphi has a lot of troubles running under Wine. IIRC Delphi 7 starts, but debugging is not possible. Newer versions didn't start without some dependencies.

So Wine is not an option for the Delphi IDE itself.

On the contrary, regular VCL apps work well on Wine, if the UI components are almost standard.
You may also check https://winebottler.kronenberg.org/ which is a way of packaging a Win executable into a Mac app, embedding Wine with the package.

January 16, 2021

21 hours ago, Fr0sT.Brutal said:

If that "eaten" memory would be unused otherwise why you bother about that consumption? I suspect they just dynamically reserve as much memory as possible for internal needs.

No, it was not just "reserved", there was a lot more of dirty pages with Intel TBB.

We tried it on production on Linux, on high-end servers with heavy multi-thread process, and the resident size (RES) was much bigger - not only the virtual/shared memory (VIRT/SHR).

Also the guys from https://unitybase.info - which have very high demanding services - evaluated and rejected the Intel TBB use. Either the glibc MM https://sourceware.org/glibc/wiki/MallocInternals or our https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas give good results on Linux, with low memory consumption.

Anyway, I wouldn't use Windows to host demanding services. So if you have a Windows server with a lot of memory, you are free to use Intel TBB if you prefer.

January 15, 2021

The 3rd party dll are Intel TBB if I am correct.

So you should at least mention it, with the proper licence terms, and provide a link.

About memory management, from my tests the Intel TBB MM is indeed fast, but eats all memory, so it is not usable for any serious server-side software, running a long time.

Some numbers, tested on FPC/Linux, but you got the idea:

    - FPC default heap
     500000 interning 8 KB in 77.34ms i.e. 6,464,959/s, aver. 0us, 98.6 MB/s
     500000 direct 7.6 MB in 100.73ms i.e. 4,963,518/s, aver. 0us, 75.7 MB/s
    - glibc 2.23
     500000 interning 8 KB in 76.06ms i.e. 6,573,152/s, aver. 0us, 100.2 MB/s
     500000 direct 7.6 MB in 36.64ms i.e. 13,645,915/s, aver. 0us, 208.2 MB/s
    - jemalloc 3.6
     500000 interning 8 KB in 78.60ms i.e. 6,361,323/s, aver. 0us, 97 MB/s
     500000 direct 7.6 MB in 58.08ms i.e. 8,608,667/s, aver. 0us, 131.3 MB/s
    - Intel TBB 4.4
     500000 interning 8 KB in 61.96ms i.e. 8,068,810/s, aver. 0us, 123.1 MB/s
     500000 direct 7.6 MB in 36.46ms i.e. 13,711,402/s, aver. 0us, 209.2 MB/s
    for multi-threaded process, we observed best scaling with TBB on this system
    BUT memory consumption raised to 60 more space (gblic=2.6GB vs TBB=170GB)!
    -> so for serious server work, glibc (FPC_SYNCMEM) sounds the best candidate

Sign In

Arnaud Bouchez

Content Count

Joined

Last visited

Days Won

Content Type

Profiles

Forums

Calendar

Posts posted by Arnaud Bouchez

64 bit compiler problem

64 bit compiler problem

Spring4D 2.0 sneak peek - the evolution of performance

64 bit compiler problem

Spring4D 2.0 sneak peek - the evolution of performance

Fast Pos & StringReplace for 64 bit

Fast Pos & StringReplace for 64 bit

WASM engine in pure Pascal - an interesting open source project to watch!

Unicode string - how element iterating?

QueryPerformanceCounter precision

Adressing IP with System.Net.HttpClient

Fast lookup tables - TArray.BinarySearch vs Dictionary vs binary search

Edge Webview update (?)

Removing Hints & Warnings — Specifically, "H2443 Inline function"?

Micro optimization: Split strings

Quickly zero all local variables?

New asm for AES-HASH, AES-PRNG, AES-CTR and AES-GCM for Delphi/FPC

Delphi 64bit compiler RTL speedup

New asm for AES-HASH, AES-PRNG, AES-CTR and AES-GCM for Delphi/FPC

Dictionaries, Hashing and Performance

AnsiPos and AnsiString

Delphi 64bit compiler RTL speedup

CrossOver : Win -> Mac

Delphi 64bit compiler RTL speedup

Delphi 64bit compiler RTL speedup

Browse

Activity