Jump to content

Leaderboard


Popular Content

Showing content with the highest reputation on 06/03/21 in Posts

  1. I have found it necessary to write small apps to derive what I need from reports out of PAL. The noise level is quite high, and the options do not address things I have needed.
  2. Btw, TranslateMessage only deals with virtual keys so is useless for worker threads. Removing it allows to save 3 lines 😉
  3. Here is for instance how FPC compile our PosExPas() function for Linux x86_64. https://gist.github.com/synopse/1e30b30a77f6b0288310115085401c1e You can see the resulting asm is very efficient. Thanks to the goto used in the pascal code, and proper use of pointers/registers, to be fair. 😉 You may find some inspiring string process in our https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.unicode.pas#L1404 unit. This link give you some efficient case-insensitive process over not only latin A-Z chars, but on the whole Unicode 10.0 case folding tables (i.e. work with greek, cyrilic, or other folds). This code is UTF-8 focused, because we use it instead of UTF-16 in our framework for faster processing with minimal memory allocation and usage. But you would find some pascal code which is as fast as manual asm with no SIMD support. For better performance, branchless SIMD is the key, but it is much more complex and error prone. The main trick about case insensitive search is that a branchless version using a lookup table for A-Z chars is faster than cascaded cmp/ja/jb branches, on modern CPUs. We just enhanced this idea to Unicode case folding.
  4. In practice, SSE 4.2 is not faster than regular SSE 2 code for strcmp and strlen. More complex process may benefit of SSE 4.2 - but the fastest JSON parser I have seen doesn't use it, and focuses on micro-parallelized branchless process with regular SIMD instructions - see https://github.com/simdjson/simdjson Memory access is the bottleneck. This is what Agner measured. About any asm, it is mandatory to refer to https://agner.org/optimize There are reference code and reference documentation about how modern asm should be written. The PosEx_Sha_Pas_2 version is one of the fastest, and probably faster than your version, even if it is written in pure pascal. For instance, reading a register then shr + cmp is not the fastest pattern today. Pascal version will also work on Mac and Linux, whereas your asm version would need additional code to support the POSIX ABI. We included it (with minimal tweaks like using NativeInt instead of integer, and using an inline stub for register usage) in https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.pas#L7974 First thing is to benchmark and compare your code with proper timing, and regular tests. Try with some complex process, not in naive loops which tends to be biased because in naive tests the data remains in the CPU L1 cache, so numbers are not realistic.
×