Jump to content
pyscripter

CharInSet revisited (again)

Recommended Posts

Further to a previous post of mine, I have done some benchmarking to compare the different alternatives.  I am using some old benchmark code from Barry Kelly and added a few more options (word of char in, Char.IsInArray and if then).

Here are the 6 ways compared:

  1. while not CharInSet(cp^, [#0, #13, #10]) do
  2. case statement
  3.   while not (cp^ in [#0, #13, #10]) do  (raises a warning)
  4.   while not (Word(cp^) in [$0, $A, $D]) do
  5.   while not (cp^.IsInArray([#0, #13, #10])) do
  6.   while (cp^<>#0) and (cp^<>#$A) and (cp^<>#$D) do

 

Results with optimization turned on.

 

Win32

1. CharInSet:        0.037 seconds
2. case stmt:        0.039 seconds
3. set test:         0.036 seconds
4. Word set test:    0.037 seconds
5. InArray test:     0.219 seconds
6. if and test:      0.037 seconds

 

Win64

1. CharInSet:        0.106 seconds
2. case stmt:        0.047 seconds
3. set test:         0.042 seconds
4. Word set test:    0.044 seconds
5. InArray test:     0.248 seconds
6. if and test:      0.036 seconds

What is worrying is that in 64 bits both CharInSet and IsInArray (the recommended options) are much slower than the alternatives.  CharInSet is an inlined version of Option 3, but the code cannot take advantage of the specific contents of the set.  Here is the assembly code:

 

CharInSet.dpr.15: while not CharInSet(cp^, [#0, #13, #10]) do
0000000000429DAA 480FB708         movzx rcx,word ptr [rax]
0000000000429DAE 6681F9FF00       cmp cx,$00ff
0000000000429DB3 7711             jnbe P1 + $36
0000000000429DB5 480FB7C9         movzx rcx,cx
0000000000429DB9 480FA30D37000000 bt [rel $00000037],rcx
0000000000429DC1 0F92C1           setb cl
0000000000429DC4 EB02             jmp P1 + $38
0000000000429DC6 33C9             xor ecx,ecx
0000000000429DC8 84C9             test cl,cl
0000000000429DCA 74DA             jz P1 + $16

CharInSet.dpr.41: while not (cp^ in [#0, #13, #10]) do
000000000042A05A 480FB708         movzx rcx,word ptr [rax]
000000000042A05E 6683F90F         cmp cx,$0f
000000000042A062 7716             jnbe P3 + $3A
000000000042A064 66BA0100         mov dx,$0001
000000000042A068 D3E2             shl edx,cl
000000000042A06A 480FB70D3A000000 movzx rcx,word ptr [rel $0000003a]
000000000042A072 6685CA           test dx,cx
000000000042A075 0F95C1           setnz cl
000000000042A078 EB02             jmp P3 + $3C
000000000042A07A 33C9             xor ecx,ecx
000000000042A07C 84C9             test cl,cl
000000000042A07E 74D6             jz P3 + $16

CharInSet.dpr.51: while not (Word(cp^) in [$0, $A, $D]) do
000000000042A1AA 480FB708         movzx rcx,word ptr [rax]
000000000042A1AE 6683F90F         cmp cx,$0f
000000000042A1B2 7716             jnbe P4 + $3A
000000000042A1B4 66BA0100         mov dx,$0001
000000000042A1B8 D3E2             shl edx,cl
000000000042A1BA 480FB70D3A000000 movzx rcx,word ptr [rel $0000003a]
000000000042A1C2 6685CA           test dx,cx
000000000042A1C5 0F95C1           setnz cl
000000000042A1C8 EB02             jmp P4 + $3C
000000000042A1CA 33C9             xor ecx,ecx
000000000042A1CC 84C9             test cl,cl
000000000042A1CE 74D6             jz P4 + $16

Options 2 and 6 are too verbose and Options 1 and 5 too slow. Option 4 (Word(cp^) in [$0, $A, $D]) produces the same code as option 3 (cp^ in [#0, #13, #10]) without raising an exception and it sounds like a good choice.   Any downsides? 

 

Incidentally combining any of the above options with an initial test for the most common case e.g.   

while (Word(cp^) > $D) or not (Word(cp^) in [$0, $A, $D]) do

is much faster!

 

Source code attached.

CharInSet.dpr

Edited by pyscripter
  • Like 1

Share this post


Link to post

While this could be the fastest option, I'd recommend using char literals instead of ASCII codes: Word(#13) or Ord(#13) or even better Ord(CR)

Edited by Fr0sT.Brutal

Share this post


Link to post
26 minutes ago, Lars Fosdal said:

Does CharInSet support multi-byte Unicode characters?

function CharInSet(C: WideChar; const CharSet: TSysCharSet): Boolean;

Given this declaration, how could it?

 

Note that I assume that you mean UTF-16 surrogate pairs.

Share this post


Link to post
1 hour ago, Lars Fosdal said:

Does CharInSet support multi-byte Unicode characters?

All of the above assume that the characters you are looking for have a value less than 256 so that the set can be represented as a Delphi Set.  Otherwise TCharHelper.IsInArray or the case statement are the only readily available options.

Edited by pyscripter

Share this post


Link to post

@pyscripter - A char in a Unicode string is a word, not a byte.  The old set operator is useless for Scandinavian chars, as well as German umlauts, etc., unless you convert to ANSI - and that is not desirable.

Share this post


Link to post
8 minutes ago, Lars Fosdal said:

A char in a Unicode string is a word, not a byte.  The old set operator is useless for Norwegian characters, unless you convert to ANSI - and that is not desirable.

Sure.  The CharInSet and the like are useful if you want to search your unicode string for unicode chars with values less than 255 and typically less than 128.  Chars like " : CR LF \ /.   These values can be represented as a Delphi set. This is a common operation and that's why doing it efficiently is important.

Edited by pyscripter

Share this post


Link to post
18 minutes ago, pyscripter said:

Sure.  The CharInSet and the like are useful if you want to search your unicode string for unicode chars with values less than 255 and typically less than 128.  Chars like " : CR LF \ /.   These values can be represented as a Delphi set. This is a common operation and that's why doing it efficiently is important.

That the set can't have a full range of char values doesn't mean that the char C can't have any value in the BMP. For sure you aren't ever going to find a C that has ordinal above 255 but the value of C may not be known at compile time. For instance you may be looping through the elements in a string. 

  • Like 1

Share this post


Link to post
24 minutes ago, David Heffernan said:

For instance you may be looping through the elements in a string. 

Exactly.  For instance in Classes.TStringList,SetTextStr you have the following:

 

       

 while (P < PEndVal) and not (P^ in [#10, #13]) do Inc(P);

 

P is a PChar (unicode characther).  

 

The problem is that if I put such code in my program it raises a warning.  The whole (only) point of CharInSet is to avoid the warning, alas it appears suffering a speed penalty in Win64.  My original posts examines alternatives that avoid the warning and the speed penalty.

 

The warning is idiotic.   You get the same warning if you write

 

        while (P < PEndVal) and not (P^ in [AnsiChar(#10), AnsiChar(#13)]) do Inc(P);

 

So the warning is about downcasting the Unicode char to an AnsiChar.  But if you look at the generated code, the compiler does the right thing and does not downcast the Unicode char.   It compares the Word value of the Unicode char to the values of the elements of the set.  It does the same thing as if you wrote:

        while (P < PEndVal) and not (Word(P^) in [Ord(#10), Ord(#13)]) do Inc(P);

which does not raise a warning.  I would appreciate an explanation of the rationale of this warning.

Edited by pyscripter

Share this post


Link to post

Where applicable, I handle this case with, umm, case.

if CharInSet(ch, ['Ň'..'Ř'])

=>

case ch of
 'Ň'..'Ř': ..
end

but of course this only works with constants. While I sometimes miss ability to use predefined sets in case, it never been too big trouble for me. All formats I had to parse were like "ASCII for markup, Unicode for content" so I just didn't need to examine Unicode but if I had to, Unicode categories defined in System.Character could be real helpers

Share this post


Link to post

Why not suppress the warning? Or would you then be suppressing valuable warnings too? 

Edited by David Heffernan
  • Like 1

Share this post


Link to post

You could suppress just this warning by 

{$WARN WIDECHAR_REDUCED OFF}
while (P < PEndVal) and not (P^ in [#10, #13]) do Inc(P);
{$WARN WIDECHAR_REDUCED ON}

I prefer the shorter

while (P < PEndVal) and not (Ord(P^) in [10, 13]) do Inc(P);

The question still is whether the warning makes any sense.

Share this post


Link to post
1 minute ago, pyscripter said:

The question still is whether the warning makes any sense.

Unless I missed something, that's not the question in the original post. Is that what you are asking?

Share this post


Link to post

This has nothing to do with CharInSet per se but the fact that the Win64 compiler obviously is doing a poor job inlining it.

 

Win32 code looks like this (i modified it slightly using a loop var rather than increasing the PChar):

CharInSet.dpr.18: while not CharInSet(cp[i], [#0, #13, #10]) do
004C999C 0FB71446         movzx edx,[esi+eax*2]
004C99A0 6685D2           test dx,dx
004C99A3 740C             jz $004c99b1
004C99A5 6683EA0A         sub dx,$0a
004C99A9 7406             jz $004c99b1
004C99AB 6683EA03         sub dx,$03
004C99AF 75EA             jnz $004c999b

CharInSet.dpr.48: while not (cp[i] in [#0, #13, #10]) do
004C9B5B 0FB70C42         movzx ecx,[edx+eax*2]
004C9B5F 6685C9           test cx,cx
004C9B62 740C             jz $004c9b70
004C9B64 6683E90A         sub cx,$0a
004C9B68 7406             jz $004c9b70
004C9B6A 6683E903         sub cx,$03
004C9B6E 75EA             jnz $004c9b5a

Caution: if you go down that road you might feel the urge to rewrite half (if not more) of your code that is using RTL functions because the Win64 compiler does some terrible job on many of them.

 

Edit: reported codegen issue: https://quality.embarcadero.com/browse/RSP-35981

Edited by Stefan Glienke
  • Like 1

Share this post


Link to post
11 minutes ago, David Heffernan said:

Unless I missed something, that's not the question in the original post. Is that what you are asking?

The original post was about the best way to replace P^ in [#10, #13]  with something concise and efficient in order to avoid the warning assuming the warning is valid.   It also showed that the recommended way CharInSet is slow in Win64.  So I guess there are two parts to the question:

  • Can I safely ignore the warning? 
  • What is a concise and efficient way to achieve the same without a warning?

The post also suggests that Ord(P^) in [10, 13] appears to be a good way of avoiding the warning without sacrificing speed or conciseness.

Edited by pyscripter

Share this post


Link to post
2 minutes ago, pyscripter said:

Can I safely ignore the warning? 

Yes you can. After all, CharInSet is implemented using the in operator. Which is why it should give the same performance as the in operator, because of inlining, but the inline engine is failing for 64 bit.

 

For sure you can measure a difference in performance in the micro-benchmark, but can you measure it in the real program?

Edited by David Heffernan

Share this post


Link to post
7 minutes ago, David Heffernan said:

can you measure it in the real program

Yes indeed.  StringLists, TTextReader, SynEdit are full of this kind of shit.   When you load a big file scanning millions of characters that has an impact.  

Edited by pyscripter

Share this post


Link to post

If the values are in an int array, would it be possible to use a different kind of operation than a loop + comparison, f.x. the SIMD instructions?

 

Share this post


Link to post
26 minutes ago, David Heffernan said:

Yes you can. After all, CharInSet is implemented using the in operator.

That begs the question why have the warning and introduce CharInSet only to avoid the warning in the first place.  Why din't Embarcadero just eliminate the warning?

Edited by pyscripter
  • Like 1

Share this post


Link to post
35 minutes ago, Stefan Glienke said:

Caution: if you go down that road you might feel the urge to rewrite half (if not more) of your code that is using RTL functions because the Win64 compiler does some terrible job on many of them

Isn't this what you are doing with Spring4D? :classic_biggrin:

Share this post


Link to post
7 minutes ago, pyscripter said:

Isn't this what you are doing with Spring4D? :classic_biggrin:

Hence my word of warning :classic_cool: You might not want to suffer the same amount of pain as I do

Edited by Stefan Glienke
  • Like 1
  • Haha 3

Share this post


Link to post
1 hour ago, pyscripter said:

Yes indeed.  StringLists, TTextReader, SynEdit are full of this kind of shit.   When you load a big file scanning millions of characters that has an impact.  

I find that hard to believe. That operation is disk bound, or if the file is in disk cache then it is memory bound. 

  • Like 1

Share this post


Link to post
1 hour ago, pyscripter said:

That begs the question why have the warning and introduce CharInSet only to avoid the warning in the first place.  Why din't Embarcadero just eliminate the warning?

I guess there are scenarios where the warning is useful. Perhaps what they did wrong was not to make it more discerning. 

  • Like 1

Share this post


Link to post
2 hours ago, pyscripter said:

That begs the question why have the warning and introduce CharInSet only to avoid the warning in the first place.  Why din't Embarcadero just eliminate the warning?

The warning was introduced in D2009 as sets can only contain only ANSI characters not Unicode.  CharInSet was introduced that you signal to the compiler that you're aware of the limitation and know what "you  think you're doing" and how the compiler will handle the code. 

 

Of course they could handle the entire thing differently.

Edited by Lajos Juhász
  • Like 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×