Jump to content
Sign in to follow this  
Tom de Neef

Fast Pos & StringReplace for 64 bit

Recommended Posts

I've seen several queries about 64 bit equivalents for the fast string routines in Win32. And I needed them myself as well.

I have published (assembly based) functions that mimic system.Pos and sysUtils.StringReplace for 32 & 64 bit compiling (both String and AnsiString). Extra feature: searching can be case sensitive or case insensitive.

 

Timing for string search (1k substrings in 1M string, microsec, see documentation; NextStrPos is new):

mode

32 bit

 

64 bit

search

System.pos

NextStrPos

 

System.pos

NextStrPos

 

 

 

 

 

 

String

550

500

 

730

400

ANSIstring

530

460

 

3.500

360

 

Timing for String replace (1k substrings in 1M string, microsec, see documentation; StrReplace is new):

mode

32 bit

 

64 bit

replace

SysUtils.

StringReplace

StrReplace

 

SysUtils.

StringReplace

StrReplace

 

 

 

 

 

 

String

6.500

1.050

 

6.000

1.050

ANSIstring

7.400

900

 

7.000

850

 

Files and documentation on https://sourceforge.net/projects/delphi-fast-pos-stringreplace/

  • Like 3

Share this post


Link to post

Pls explain your comment about the license. I have no wish to limit use in any way. What license condition should I use then ?

Share this post


Link to post
28 minutes ago, Tom de Neef said:

Pls explain your comment about the license. I have no wish to limit use in any way. What license condition should I use then ?

Apache or MPL are the most commonly used when you want few restrictions.

Edited by dummzeuch

Share this post


Link to post
On 6/1/2021 at 10:54 PM, Stefan Glienke said:

Certainly better than the RTL - however if you want it really fast and justify implementing in asm then SSE should be used.

For reference: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2

In practice, SSE 4.2 is not faster than regular SSE 2 code for strcmp and strlen. More complex process may benefit of SSE 4.2 - but the fastest JSON parser I have seen doesn't use it, and focuses on micro-parallelized branchless process with regular SIMD instructions - see https://github.com/simdjson/simdjson
Memory access is the bottleneck. This is what Agner measured.

 

About any asm, it is mandatory to refer to https://agner.org/optimize

There are reference code and reference documentation about how modern asm should be written.

 

The PosEx_Sha_Pas_2 version is one of the fastest, and probably faster than your version, even if it is written in pure pascal. For instance, reading a register then shr + cmp is not the fastest pattern today.
Pascal version will also work on Mac and  Linux, whereas your asm version would need additional code to support the POSIX ABI.
We included it (with minimal tweaks like using NativeInt instead of integer, and using an inline stub for register usage) in https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.pas#L7974

 

First thing is to benchmark and compare your code with proper timing, and regular tests.
Try with some complex process, not in naive loops which tends to be biased because in naive tests the data remains in the CPU L1 cache, so numbers are not realistic.

 

Edited by Arnaud Bouchez
  • Thanks 1

Share this post


Link to post

Here is for instance how FPC compile our PosExPas() function for Linux x86_64.

https://gist.github.com/synopse/1e30b30a77f6b0288310115085401c1e

 

You can see the resulting asm is very efficient.
Thanks to the goto used in the pascal code, and proper use of pointers/registers, to be fair. 😉

 

You may find some inspiring string process in our https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.unicode.pas#L1404 unit.

This link give you some efficient case-insensitive process over not only latin A-Z chars, but on the whole Unicode 10.0 case folding tables (i.e. work with greek, cyrilic, or other folds).

This code is UTF-8 focused, because we use it instead of UTF-16 in our framework for faster processing with minimal memory allocation and usage.
But you would find some pascal code which is as fast as manual asm with no SIMD support. For better performance, branchless SIMD is the key, but it is much more complex and error prone.
The main trick about case insensitive search is that a branchless version using a lookup table for A-Z chars is faster than cascaded cmp/ja/jb branches, on modern CPUs. We just enhanced this idea to Unicode case folding.

Edited by Arnaud Bouchez
  • Like 1

Share this post


Link to post

I have taken up two suggestions: change the license (to MIT) and use SSE4.2 (both 32 and 64 bit Windows platforms). Timing over 10M trials are as follows:

SEARCH:

mode

32 bit

 

64 bit

search

System.pos

NextStrPos

 

System.pos

NextStrPos

 

 

 

 

 

 

String

550

250

 

730

400

ANSIstring

530

460

 

3.500

360

 

REPLACE:

mode

32 bit

 

64 bit

replace

SysUtils.

StringReplace

StrReplace

 

SysUtils.

StringReplace

StrReplace

 

 

 

 

 

 

String

6.500

1.050

 

6.000

1.050

ANSIstring

7.400

900

 

7.000

850

 

Share this post


Link to post

64bit StrPos raises AV when either string is empty - also the overall design of your API is a bit weird to me (or I misunderstand something) - why do I need different functions when doing a one time search vs repeated searches. For example when using StrPos in a loop it gets terribly slow because of all the initialization happen in that function.

Edited by Stefan Glienke

Share this post


Link to post

Thanks for the report - I've corrected.

 

The functions are meant for repeated searches:

var t : TstrPosRec; k : integer;

begin

  t:=SetupStrPos(target,source,ignoreCase);

  repeat

     k:=NextStrPos(t);

     if k=0 then break;

     // user code, based on k

  until false;

 

Thus the initialization overhead is outside the loop.

 

When you code like this:

  k:=1-target.length;

  repeat

     k:=StrPos(target,source,k+target.length);

     if k=0 then break;

     // user code, based on k

  until false;

you bring all the initialization within the loop. The StrPos function is provided only as a replacement for system.pos (and does not offer ignoreCase for that reason).

  • Like 1

Share this post


Link to post

And I correct the timings. As someone wrote, they vary a lot with circumstances. But the relativities are a reasonable indication.

 

SEARCH:

mode

32 bit

 

64 bit

search

System.pos

NextStrPos

 

System.pos

NextStrPos

 

 

 

 

 

 

String

550

250

 

650

135

ANSIstring

530

460

 

3.500

65

 

REPLACE:

mode

32 bit

 

64 bit

replace

SysUtils.

StringReplace

StrReplace

 

SysUtils.

StringReplace

StrReplace

 

 

 

 

 

 

String

6.500

1.050

 

6.000

800

ANSIstring

7.400

900

 

7.000

650

  • Like 2

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×