Jump to content

Arnaud Bouchez

Members
  • Content Count

    315
  • Joined

  • Last visited

  • Days Won

    22

Everything posted by Arnaud Bouchez

  1. note: if you read the file from start to end, Memory mapped files are not faster than reading the file in memory. The memory faults make it slower than a regular single FileRead() call. For huge files on Win32 which won't be able to load in memory, you may use temporary chunks (e.g. 128MB). And if you really load it once and don't want to pollute the OS disk memory cache, consider using the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows. This is what we do with mORMot's FileOpenSequentialRead(). https://devblogs.microsoft.com/oldnewthing/20120120-00/?p=8493
  2. Yes, delete() is as bad as copy(), David is right! Idea is to keep the input string untouched, then append the output to a new output string, preallocated once with a maximum potential size. Then call SetLength() once at the end, which is likely to do nothing and reallocate the content in-place, thanks to the heap manager.
  3. You could do it with NO copy() call at all. Just write a small state machine and read the input one char per char.
  4. The main trick is to avoid memory allocation, i.e. temporary string allocations. For instance, the less copy() calls, the better. Try to rewrite your code to allocate one single output string per input string. Just parse the input string from left to write, then applying the quotes or dates processing on the fly. Then you could also avoid any input line allocation, and parse the whole input buffer at once. Parsing 100.000 lines could be done much quicker, if properly written. I guess round 500MB/s is easy to reach. For instance, within mORMot, we parse and convert 900MB/s of JSON in pure pascal, including string unquoting.
  5. This is a bit tricky (a COM object) but it works fine on Win32. You have the source code at https://github.com/synopse/SynProject We embedded the COM object as resource with the main exe, and it is uncompressed and registered for the current user.
  6. ShortString have a big advantage: they can be allocated on the stack. So they are perfect for small ASCII text, e.g. numbers to text conversion, or when logging information in plain English. Using str() over a shortstring is for instance faster than using IntToString() and a temporary string (or AnsiString) on a multi-thread program, because the heap is not involved. Of course, we could use a static array of AnsiChar, but ShortString have an advantage because they have their length encoded so they are safer and faster than #0 terminated strings. So on mobile platform, we could end up by creating a new record type, re-inventing the wheel whereas the ShortString type is in fact still supported and generated by the compiler, and even used by the RTL at its lowest system level. ShortString have been deprecated... and hidden. They could even be restored/unhidden by some tricks like https://www.idefixpack.de/blog/2016/05/system-bytestrings-for-10-1-berlin Why? Because some people at Embarcadero thought it was confusing, and that the language should be "cleaned up" - I translate by "more C# / Java like", with a single string type. This was the very same reason they did hide RawByteString and AnsiString... More a marketing strategy than a technical decision IMHO. I prefer the FPC more "conservative" way of preserving backward compatibility. It is worth noting that the FPC compiler source code itself uses a lot of shortstring internally, so it would never be deprecated on FPC for sure. 😉
  7. Arnaud Bouchez

    About TGUID type...

    Using a local TGUID constant seems the best solution. It is clean and fast (no string/hex conversion involved, just copy the TGUID record bytes). No need to make anything else, or ask for a feature request I guess, because it would be purely cosmetic.
  8. Arnaud Bouchez

    Why compiler allows this difference in declaration?

    IIRC it was almost mandatory to work with Ole Automation and Word/Excel. Named parameters are the usual way of calling Word/Excel Ole API, from Office macros or Visual Basic. So Delphi did have to support this too. And it is easy to implement named parameters with OLE. Because in its ABI, parameters are... indeed named. Whereas for regular native code function ABI, parameters are not named, but passed in order on registers or the stack. So implementing named parameters in Delphi would have been feasible, but would require more magic, especially for the "unnamed" parameter. It was requested several times in the past, especially from people coming from Python background. Named parameters can make calls more explicit. So the question is more for Embarcadero people. 😉 You can emulate this using a record or a class to pass the values, but it is not as elegant.
  9. Yes, I have seen the handcrafted IMT. But I am not convinced the "sub rcx, xxx; jmp xxx" code block makes a noticeable performance penalty - perhaps only on irrelevant micro benchmarks. The CPU lock involved in calling a Delphi virtual method through an interface has a cost for sure https://www.idefixpack.de/blog/2016/05/whats-wrong-with-virtual-methods-called-through-an-interface - but not a sub + jmp. Also the enumerator instance included into the list itself seems a premature optimization to me if we want to be cross-platform as we want. Calling GetThreadID on a non Windows platform has a true cost if you call the pthread library. And the resulting code complexity makes me wonder if it is really worth it in practice. Better switch to a better memory manager, or just use a record and rely on inlining + stack allocation of an unmanaged enumerator. Since you implemented both of those optimizations: OK, just continue to use them. But I won't go that way with mORMot. Whereas I still find your interface + pre-compiled folded classes an effective and nice trick (even if I just disable byte and word stubs for dictionaries: if you have a dictionary, the hash table will be bigger than the size of the byte/word data itself - so just use integers instead of byte/word for dictionaries in end-user code; but I still have byte/word specializations for IList<> which does not have any hash by default, but may on demand).
  10. Nice timings. Yes, you are right, in mORMot we only use basic iteration to be used in "for ... in .. do" statement with no further composition and flexibility as available with IEnumerable<T>. The difference for small loops is huge (almost 10 times) and for big loops is still relevant (2 times) when records are used. I guess mORMot pointer-based records could be slightly faster than RTL index-based values, especially when managed types are involved. In practice, I find "for .. in .. do" to be the main place for iterations. So to my understanding, records are the way to go for mORMot. Then nothing prevents another method returning complex and fluid IEnumerable<T>. We just don't go that way in mORMot yet.
  11. I discovered that using a record as TEnumerator makes the code faster, and allow some kind of inlining, with no memory allocation. My TSynEnumerator<T> record only uses 3 pointers on the stack with no temporary allocation. I prefer using pointers here to avoid any temporary storage e.g. for managed types. And the GetCurrent and MoveNext functions have no branch and inline very aggressively. And it works with Delphi 2010! 😉 Record here sounds like a much more efficient idea than a class + interface, as @Stefan Glienke did in Spring4D. Stefan, have you any reason to prefer interfaces also for the enumerator instead of good old record? From my finding, performance is better with a record, especially thanks to inlining - which is perfect on FPC, but Delphi still calls MoveNext and don't inline it. It also avoid a try...finally for most simple functions, nor any heap allocation. Please check https://github.com/synopse/mORMot2/commit/17b7a2753bb54057ad0b6d03bd757e370d80dce2
  12. I just found out that Delphi has big troubles with static arrays like THash128 = array[0..15] of byte. If I remove those types, then the Delphi compiler does emit "F2084 Internal Error" any more... So I will stick with the same kind of types as you did (byte/word/integer/int64/string/interface/variant) and keep those bigger ordinals (THash128/TGUID) to use regular (bloated) generics code on Delphi (they work with no problem on FPC 3.2). Yes, I get a glimpse of what you endure, and I thank you much for having found such nice tricks in Spring4D. The interface + folded types pattern is awesome. 😄
  13. Is it only me or awful and undocumented problems like " F2084 Internal Error: AV004513DE-R00024F47-0 " occur when working on Delphi with generics? @Stefan Glienke How did you manage to circumvent the Delphi compiler limitations? On XE7 for instance, as soon as I use intrinsics I encounter those problems - and the worse is that they are random: sometimes, the compilation succeed, but other times there is the Internal Error, sometimes on Win32 sometimes on Win64... 😞 I get rid of as many "inlined" code as possible, since I discovered generics do not like calling "inline" code. In comparison, FPC seems much more stable. The Lazarus IDE is somewhat lost within generics code (code navigation is not effective) - but at least, it doesn't crash and you can work as you expect. FPC 3.2 generics support is somewhat mature: I can see by disassembling the .o that intrinsics are properly handled with this compiler. Which is good news, since I rely heavily on it for folding base generics specialization using interface, as you did with Spring4D.
  14. Arnaud Bouchez

    64 bit compiler problem

    My guess is that the default data alignment may have changed between Delphi 10.2 and 10.3, so the static arrays don't have the same size. You may try to use "packed" for all the internal structures of the array.
  15. @Stefan Glienke If you can, please take a look at a new mORMot 2 unit: https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.collections.pas In respect to generics.collections, this unit uses interfaces as variable holders, and leverage them to reduce the generated code as much as possible, as the Spring4D 2.0 framework does, but for both Delphi and FPC. Most of the unit is in fact embedding some core collection types to mormot.core.collections.dcu to reduce the user units and executable size for Delphi XE7+ and FPC 3.2+. Thanks a lot for the ideas! It also publishes TDynArray and TSynDictionary high-level features like JSON/binary serialization or thread safety with Generics strong typing. More TDynArray features (like sorting and search) and also TDynArrayHasher features (an optional hash table included in ISynList<T>) are coming.
  16. Arnaud Bouchez

    64 bit compiler problem

    The stack is always "unloaded" when the method returns. That is a fact for sure. There is a "mov rbp, rsp" in the function prolog, and a reversed "mov rsp, rbp" in the function epilog. Nice and easy. Look at the stack trace in the debugger when you reach the stack overflow problem. You will find out the exact context. And switch to a dynamic array. Using proper copy() if you want to work on a local copy.
  17. Arnaud Bouchez

    64 bit compiler problem

    @Kas Ob. Win32 code is fine, it fills the stack variables with zeros, and my guess is that the size fits the TProgrammaCalcio size. The problem is really a stack overflow, not infinite recursion. The compiler has no bug here. User code is faulty. @marcocir Did you read my answer? There is no unexpected recursion involved. The problem is that this TProgrammaCalcio is too huge to fit on the stack. On Win64, this is what mov [rsp+rax],al does: it reserves some space for the stack, by increasing the stack pointer by 4KB chunks (the memory page size), and writing one single byte to it to trigger a page fault and a stack overflow if there is no stack place available. And you are using too much of it with your local variable. On Win32, it doesn't reserve the memory by 4KB chunks, but it fills the stack with zeros - which is slower, but does exactly the same. What does SizeOf(TProgrammaCalcio) return? Is it really a static array? You should switch to a dynamic array instead. A good practice is not to allocate more than 64KB on the stack, especially if there are some internal calls to other functions. In some execution context (e.g. a dll hosted by IIS), the per-thread stack size could be as little as 128KB.
  18. TDynArrayHashed is a wrapper around an existing dynamic array. It doesn't hold any data but works with TArray<T>. You could even maintain several hashes for several fields of the array item at the same time. TSynDictionary is a class holding two dynamic arrays defined at runtime (one for the keys, one for the values). So it is closer to the regular collections. But it has additional features (like thread safety, JSON or binary serialization, deprecation after timeout, embedded data search...), which could make the comparison unfair: it does more than lookups. The default syntax is using old fashioned TypeInfo() but you could wrap it into generics syntax instead, using TypeInfo(T). So I guess the comparison is fair. And the executable size would not be bloated because there is no generic involved for the actual implementation of the TSynDictionary class in mORMot. Removing generic/parametrized code as soon as possible is what every library is trying to do, both the Delphi RTL and Delphi Spring. The RTL and Delphi String were rewritten several times to keep the generic syntax as high as possible, and try to leverage compiler magic as much as possible to avoid duplicated or dead code. This is what String4D 2.0 is all about. mORMot just allows to avoid duplicated or dead code regeneration completely.
  19. Arnaud Bouchez

    64 bit compiler problem

    The Win64 assembly is not inefficient. It is almost the same code as the Win32 version. And my guess is that Win64 code would be faster. The on-stack TPanel is initialized with explicit zero fill of all bytes in Win32 in a loop, whereas it is with zero fill only of some managed fields within TPanel in Win64, if I understand well the generated asm. Then the RTL calls are comparable. So at execution, my guess is that the Win64 version is likely to be faster. So the compiler is not faulty. On the other hand, your code could be enhanced. One simple optimization would be to write "const Programma: TProgrammaCalcio;" for your program parameter to avoid a RTL array copy call (stack allocation + movsd/movsq + AddRefArray). Using "const" for managed types (string, arrays, records) is always a good idea. Actual stack variable use may be the real cause of the Stack Overflow problem. My guess is that you are abusing of huge variable on stack. Each calls seems to consume $8cb08 bytes on the stack. Much too big! So to fix it: 0) read more about how records and static arrays are used by Delphi, especially about memory allocation and parameter passing 1) use const to avoid a temporary copy on the local stack 2) don't use fixed size arrays for TProgrammaCalcio but a dynamic array allocated on heap 3) TPanel may also be somewhat big - consider using classes and not records to use the heap instead of the stack 4) If you are sure that heap allocation is a bottleneck (I doubt it here), you may create a temporary TPanel within the main TfrmFixedCalcio instance and reuse it Before wondering about the code generated by the Delphi compiler, please review your own code, which is likely to be the real bottleneck for performance. And in such a method, the FMX process and the GUI computation is likely to be the real bottleneck, not your Delphi code, and especially not the prolog/epilog generated by the Delphi compiler. "Premature Optimization is the root of all evil"! (DK)
  20. Nice reading. If you make some performance comparisons with other libraries, ensure you use the latest mORMot 2 source code - not mORMot 1, but mORMot 2. Some rewrite and optimization have been included in latest mORMot 2. And consider using TDynArrayHashed instead of TSynDictionary if you want a fair comparison - TSynDictionary has a lock to be thread-safe, and copy the retrieved data at lookup so it is not fair comparison with regular mapping classes.
  21. Here is for instance how FPC compile our PosExPas() function for Linux x86_64. https://gist.github.com/synopse/1e30b30a77f6b0288310115085401c1e You can see the resulting asm is very efficient. Thanks to the goto used in the pascal code, and proper use of pointers/registers, to be fair. 😉 You may find some inspiring string process in our https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.unicode.pas#L1404 unit. This link give you some efficient case-insensitive process over not only latin A-Z chars, but on the whole Unicode 10.0 case folding tables (i.e. work with greek, cyrilic, or other folds). This code is UTF-8 focused, because we use it instead of UTF-16 in our framework for faster processing with minimal memory allocation and usage. But you would find some pascal code which is as fast as manual asm with no SIMD support. For better performance, branchless SIMD is the key, but it is much more complex and error prone. The main trick about case insensitive search is that a branchless version using a lookup table for A-Z chars is faster than cascaded cmp/ja/jb branches, on modern CPUs. We just enhanced this idea to Unicode case folding.
  22. In practice, SSE 4.2 is not faster than regular SSE 2 code for strcmp and strlen. More complex process may benefit of SSE 4.2 - but the fastest JSON parser I have seen doesn't use it, and focuses on micro-parallelized branchless process with regular SIMD instructions - see https://github.com/simdjson/simdjson Memory access is the bottleneck. This is what Agner measured. About any asm, it is mandatory to refer to https://agner.org/optimize There are reference code and reference documentation about how modern asm should be written. The PosEx_Sha_Pas_2 version is one of the fastest, and probably faster than your version, even if it is written in pure pascal. For instance, reading a register then shr + cmp is not the fastest pattern today. Pascal version will also work on Mac and Linux, whereas your asm version would need additional code to support the POSIX ABI. We included it (with minimal tweaks like using NativeInt instead of integer, and using an inline stub for register usage) in https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.pas#L7974 First thing is to benchmark and compare your code with proper timing, and regular tests. Try with some complex process, not in naive loops which tends to be biased because in naive tests the data remains in the CPU L1 cache, so numbers are not realistic.
  23. Without a JIT, I don't see any much interest for WebAsm, to be honnest... I know that https://github.com/wasm3/wasm3 projects are good enough to run most simple processes. Having a Delphi version - sad it is not FPC compatible - is a good idea. Even if it is Wintel only. In terms of interpreting some logic, with proper sandboxing, a transpiler from other languages to JavaScript could be of a better interest, since most high-level features (like object mapping/dictionary, or strings) are already coded in compiled/native code. Since everyone is throughin away references, take a look at https://github.com/saghul/txiki.js coupled quickjs (fastest interpreter, but slower than Chakra, V8 or SM) and wasm3. 🙂 Edit: the FPC webassembly backend is more a proof of concept, just like the LLVM backend - on which it is based somehow, at least to produce the binary output. It lacks some pascal features, and is not ready to be used on any project sharing the same code than regular FPC pascal code. Also in this respect, the JavaScript FPC transpiler is much more advanced and usable.
  24. On Windows, you can use the UniScribe API to get the glyphs from UTF-16 sequences. We did that for our https://github.com/synopse/mORMot/blob/master/SynPdf.pas library. But it is not cross-platform at all! ICU is huge, but the cross-platform way of using it. This is what we did in mORMot 2 - not for glyphs but for case folding, comparison and codepage conversions. Check also https://github.com/BeRo1985/pucu/tree/master/src which is a pure-pascal Unicode library... I recently found https://github.com/bellard/quickjs/blob/master/libunicode.h which is very well written, and eventually available in our mORMot2 QuickJS wrapper.
  25. Arnaud Bouchez

    QueryPerformanceCounter precision

    To benchmarks instructions, you need specific SW tooling, and also proper HW. The reference is https://www.agner.org/optimize/#testp To benchmark Sleep() doesn't make any sense, especially on Windows. On windows, the Sleep() resolution is around the system timer which is typically between 14-20ms. Sleep() waits "at least" for the number of milliseconds specified. So in a waiting loop, you should never count the number of Sleep() iterations, but call GetTickCount64, with a timeout.
×