Help with string extraction function

July 17, 2020

No misunderstanding at all, we were discussing the speed of a function even just David came to help and asked if you have a bottleneck, your answer was like, i don't know yet ! i might with this function but not sure.

I just laughed loud, remembered the shepherd and the wolf tale, though this is not the case, but how many here are bored looking for something exciting to do ! or just a reason to not do what must be done, valuing helping others.

David Heffernan · July 17, 2020

28 minutes ago, Mike Torrettinni said:

No, not using any of it, yet. Planning to do some profiling and benchmarking on real data and will see if there are any bottlenecks in this area.

If you don't have a performance bottleneck, make sure that you use a version of the code that is easy to read and maintain.

Mike Torrettinni · July 17, 2020

14 minutes ago, Kas Ob. said:

valuing helping others.

I thought my first post shows my level of Delphi expertise.

If it was expected from me to implement all the suggested new functions, benchmark on real data and report if any of them are bottlenecks... in less than 24h, then I'm clearly way over my head in this topic.

It started such a nice topic, practice for me, discovering new approaches, so I feel bad that I mislead you on the purpose. Perhaps admin should delete this topic so nobody else gets mislead.

Mike Torrettinni · July 17, 2020

4 minutes ago, David Heffernan said:

If you don't have a performance bottleneck, make sure that you use a version of the code that is easy to read and maintain.

Yes, I agree. Or, this would be most commented function, as I would probably need to comment in details every other line 🙂

July 17, 2020

18 minutes ago, Mike Torrettinni said:

It started such a nice topic, practice for me, discovering new approaches, so I feel bad that I mislead you on the purpose. Perhaps admin should delete this topic so nobody else gets mislead.

What are you talking about !, it is nice and still nice, don't overthink it,

It was the way the question was from David and your answer, look at it this way and you might understand the funny side, David came to help why group of people discussing and what re they struggling with, where is the bottleneck, like we all did, we all came to beat it into big open gate !, and you were like : what bottleneck ?!

got it now ?

Anyway, things is great here and the result function is great even to be pushed into the RTL.

Mike Torrettinni · July 17, 2020

1 minute ago, Kas Ob. said:

What are you talking about !, it is nice and still nice, don't overthink it,

It was the way the question was from David and your answer, look at it this way and you might understand the funny side, David came to help why group of people discussing and what re they struggling with, where is the bottleneck, like we all did, we all came to beat it into big open gate !, and you were like : what bottleneck ?!

got it now ?

Anyway, things is great here and the result function is great even to be pushed into the RTL.

Ok, I misunderstood your reaction. All good now 🙂

Mahdi Safsafi · July 17, 2020

3 hours ago, Stefan Glienke said:

No, because if you are doing something with whatever is at x[ i ] its faster to use a pointer that starts at @x[ 0 ] and just inc that rather than having a loop which then has 2 counter variables (they like to count down to zero while maintaining the upwards counting i). And on x86 that is a precious register that can be used for something else.

Yes, you're absolutely right.

Anders Melander · July 17, 2020

5 hours ago, Stefan Glienke said:

No, because if you are doing something with whatever is at x[ i ] its faster to use a pointer that starts at @x[ 0 ] and just inc that rather than having a loop which then has 2 counter variables (they like to count down to zero while maintaining the upwards counting i). And on x86 that is a precious register that can be used for something else.

My experience is that in some cases indexed access can be much faster than incrementing a pointer, so I guess "it depends".

I'm afraid I can't remember any specific cases.

Mahdi Safsafi · July 17, 2020

1 hour ago, Anders Melander said:

My experience is that in some cases indexed access can be much faster than incrementing a pointer, so I guess "it depends".

I'm afraid I can't remember any specific cases.

It even depends on your compiler. If your compiler is smart enough, it will generate better code sometime better than using pointer because compiler can understand better integer behavior than pointer. Using MSVC (which is not really smart enough) to compile those two function, generates EXACTLY the SAME opcodes for both !

__declspec(noinline)
bool test1(const char* str) {
	int i = 0;
	while (str[i]) if ('0' == str[i++]) return true;
	return false;
}

__declspec(noinline)
bool test2(const char* str) {
	char* p = (char*)&str[0];
	while (*p) if ('0'  == *p++ ) 	return true;
	return false;
}

Anders Melander · July 17, 2020

26 minutes ago, Mahdi Safsafi said:

generates EXACTLY the SAME opcodes for both

So I'm guessing the asm it generates uses indexed addressing rather that indirect addressing. Right?

Mahdi Safsafi · July 17, 2020

18 minutes ago, Anders Melander said:

So I'm guessing the asm it generates uses indexed addressing rather that indirect addressing. Right?

No

mov         cl,byte ptr [eax]

At CPU level indexed addressing = indirect addressing. The only difference is that an indexed addressing can have an expression inside []. Sometime this could be just good for alignment purpose. [eax] = [eax+0x00].
But Stefan was referring to something else : register usage for each load instruction.

Stefan Glienke · July 17, 2020

2 hours ago, Mahdi Safsafi said:

If your compiler is smart enough, ...

I stopped reading after that because we are in a Delphi forum...

Well - I did not - lets see - gcc and clang both generate this (so does vc++ I guess):

test1(char const*):
.L14:
  movzx edx, BYTE PTR [eax]
  test dl, dl
  je .L15
  add eax, 1
  cmp dl, 48
  jne .L14
  mov eax, 1
  ret
.L15:
  xor eax, eax
  ret

and this is what Delphi makes out of this function:

function Test(str: PAnsiChar): Boolean;
var
  i: Integer;
begin
  i := 0;
  while str[i] <> #0 do
  begin
    if '0' = str[i] then Exit(True);
    Inc(i);
  end;
  Result := False;
end;

Project25.dpr.13: i := 0;
0041BD90 33D2             xor edx,edx
0041BD92 EB09             jmp $0041bd9d
Project25.dpr.16: if '0' = str[i] then Exit(True);
0041BD94 80F930           cmp cl,$30
0041BD97 7503             jnz $0041bd9c
0041BD99 B001             mov al,$01
0041BD9B C3               ret 
Project25.dpr.17: Inc(i);
0041BD9C 42               inc edx
Project25.dpr.14: while str[i] <> #0 do
0041BD9D 0FB60C10         movzx ecx,[eax+edx]
0041BDA1 84C9             test cl,cl
0041BDA3 75EF             jnz $0041bd94
Project25.dpr.19: Result := False;
0041BDA5 33C0             xor eax,eax
Project25.dpr.20: end;
0041BDA7 C3               ret

And the other one:

function Test2(str: PAnsiChar): Boolean;
var
  p: PAnsiChar;
begin
  p := @str[0];
  while p^ <> #0 do
  begin
    if '0' = p^ then Exit(True);
    Inc(p);
  end;
  Result := False;
end;

Project25.dpr.26: p := @str[0];
0041BD90 EB09             jmp $0041bd9b
Project25.dpr.29: if '0' = p^ then Exit(True);
0041BD92 80FA30           cmp dl,$30
0041BD95 7503             jnz $0041bd9a
0041BD97 B001             mov al,$01
0041BD99 C3               ret 
Project25.dpr.30: Inc(p);
0041BD9A 40               inc eax
Project25.dpr.27: while p^ <> #0 do
0041BD9B 0FB610           movzx edx,[eax]
0041BD9E 84D2             test dl,dl
0041BDA0 75F0             jnz $0041bd92
Project25.dpr.32: Result := False;
0041BDA2 33C0             xor eax,eax
Project25.dpr.33: end;
0041BDA4 C3               ret

What we can see is that the Delphi compiler is clearly not clever but rather sticks to some hardcoded behavior such as that it is addicted to put loop conditions after the body (which usually is a good pattern) even if it makes little sense to do so. Another thing has to do with the post increment operator that we don't have in Delphi which causes the instruction after the check inside the loop which causes the additional jump because it cannot just restart the loop after it did not find the '0'.

Things like that require writing loops in a very ass backwards way in Delphi to get the same code generated.

In order to get the same code that the C++ compilers emit you have to write unreadable shit like this:

function Test3(str: PAnsiChar): Boolean;
var
  c: AnsiChar;
begin
  repeat
    c := str^;
    if c = #0 then Break;
    Inc(str);
    if c = '0' then Exit(True);
  until False;
  Result := False;
end;

When I used the local variable p here that I assigned str to it did not even keep using eax but insisted in putting a useless mov into a new register.

Edited July 17, 2020 by Stefan Glienke

Mike Torrettinni · July 17, 2020

I did some testing on largest real data set:

strings: 1,203,265

avg length: 25

standard deviation: 41

longest string: 2,243

All functions take between 100ms-171ms, so all are very fast.

Most common real data set is 10%-25% of this largest data, so these functions will perform under 50ms in those cases.

My current function that I use does the same things with: if Copy(s, 1,1) = '{'... and result := result + Copy(s, 1,1), so very very inefficient, but it takes about 1,8s on same data.

So this is not a bottleneck in my project, but still almost 20x faster is definitely something i will use - I just have to decide which one.

Mahdi Safsafi · July 17, 2020

24 minutes ago, Stefan Glienke said:

I stopped reading after that because we are in a Delphi forum...

Keep some faith Stefan ... remember a tortoise wins a hare once.

Stefan Glienke · July 17, 2020

5 minutes ago, Mahdi Safsafi said:

Keep some faith Stefan ... remember a tortoise wins a hare once.

At least regarding life expectancy 😉

Mahdi Safsafi · July 18, 2020

@Stefan Glienke

Quote

Well - I did not - lets see - gcc and clang both generate this (so does vc++ I guess):

As I said before, MSVC is somehow far behind gcc and clang. In fact, it used 8-bit registers heavily which are known to be slow as they require internal manipulation by the CPU:

00241000 8A 0D 10 21 24 00    mov         cl,byte ptr [string "test" (0242110h)]  
00241006 B8 10 21 24 00       mov         eax,offset string "test" (0242110h)  
0024100B 0F 1F 44 00 00       nop         dword ptr [eax+eax]  
00241010 40                   inc         eax  
00241011 80 F9 30             cmp         cl,30h  
00241014 74 09                je          test1+1Fh (024101Fh)  
00241016 8A 08                mov         cl,byte ptr [eax]  
00241018 84 C9                test        cl,cl  
0024101A 75 F4                jne         test1+10h (0241010h)  
0024101C 32 C0                xor         al,al  
0024101E C3                   ret  
0024101F B0 01                mov         al,1  
00241021 C3                   ret

For Delphi, output of Test1 and Test2 is very close. In fact if you change to PChar, you may find that unlike Test1, Test2 is using add reg, imm instruction.

Quote

What we can see is that the Delphi compiler is clearly not clever but rather sticks to some hardcoded behavior such as that it is addicted to put loop conditions after the body (which usually is a good pattern)

No its not a good pattern :

0041BD92 EB09             jmp $0041bd9d  // unecessary relative branch.
0041BD94 80F930           cmp cl,$30     // suppose this was a load op => bad for OoOE.
0041BD97 7503             jnz $0041bd9c   
0041BD99 B001             mov al,$01  
0041BD9B C3               ret

Quote

At least regarding life expectancy 😉

Just hope for better.

Edited July 18, 2020 by Mahdi Safsafi

Stefan Glienke · July 18, 2020

Here it is clearly not - but I was referring to the fact that mostly it is the way to do - see this exhaustive answer.

But what you typically don't do is to statically jump over the loop body to the condition at the bottom to then conditionally jump back up because the loop very likely runs at least once.

Either the compiler restructures the loop or simply puts the loop condition as if ontop of the loop as well (sometimes dcc even does it)

Edited July 18, 2020 by Stefan Glienke

Mahdi Safsafi · July 18, 2020

7 minutes ago, Stefan Glienke said:

Here it is clearly not - but I was referring to the fact that mostly it is the way to do - see this exhaustive answer.

But what you typically don't do is to statically jump over the loop body to the condition at the bottom to then conditionally jump back up because the loop runs at least once.

Either the compiler restructures the loop or simply puts the loop condition as if ontop of the loop as well (sometimes dcc even does it)

Today no high end compiler is doing that because it affects an important optimization part OoOE. Delphi uses a very old code generator that is not up to date.

Mahdi Safsafi · July 18, 2020

Moreover saying that "Fewer instructions / uops inside the loop = better" is somehow wrong. I can certainly prove that its wrong. using SIMD instructions where using n instruction to reduce loop count is better than having a single/few instruction inside that loop.

Edited July 18, 2020 by Mahdi Safsafi

Stefan Glienke · July 18, 2020

26 minutes ago, Mahdi Safsafi said:

Today no high end compiler is doing that because it affects an important optimization part OoOE. Delphi uses a very old code generator that is not up to date.

Just to make sure you don't misunderstand:

The common way of compilers (check it for yourself using compiler explorer) to write while x do is to turn it into if x then repeat until not x (simply said). The code you posted could get rid of the check at the bottom because there was nothing else to do after the check for '0' which results in a result true so that basically serves as this loop breaking check.

The first check makes sure the loop is even entered at all, otherwise it jumps over it (0 iterations) - the "real" loop condition is at the bottom of the loop body.

What is usually not good is to only put the loop condition at the bottom and statically jump over the loop body (because the duplicated condition at the top is missing). Thats two unnecessary jumps for the common case of "at least 1 iteration".

Edited July 18, 2020 by Stefan Glienke

Mahdi Safsafi · July 18, 2020

10 minutes ago, Stefan Glienke said:

Just to make sure you don't misunderstand:

The common way of compilers (check it for yourself using compiler explorer) to write while x do is to turn it into if x then repeat until not x (simply said). The code you posted could get rid of the check at the bottom because there was nothing else to do after the check for '0' which results in a result true so that basically serves as this loop breaking check.

The first check makes sure the loop is even entered at all, otherwise it jumps over it (0 iterations) - the "real" loop condition is at the bottom of the loop body.

What is usually not good is to only put the loop condition at the bottom and statically jump over the loop body (because the duplicated condition at the top is missing). Thats two unnecessary jumps for the common case of "at least 1 iteration".

Yeah Stefan, I clearly understand you. But the same thing you described can be achieved by branch prediction. So compiler should not generate such code unless you are targeting an old cpu.

Arnaud Bouchez · July 23, 2020

On 7/18/2020 at 2:48 AM, Stefan Glienke said:

The common way of compilers (check it for yourself using compiler explorer) to write while x do is to turn it into if x then repeat until not x (simply said).

Indeed. This is why I use the "if x then repeat until not x" explicit pattern in time-critical loops of my framework - e.g. when processing JSON.

As a nice side effect, the variables uses in "x" are more likely to be assigned to registers, since they will be used more often. Sometimes an explicit temporary local variable is needed. Or use of the 'result' variable if it is a PChar.
Both Delphi and FPC require this, unless "x" is a single simple test, where I have seen FPC able to optimize it IIRC.

Arnaud Bouchez · July 23, 2020

On 7/18/2020 at 3:18 AM, Mahdi Safsafi said:

Yeah Stefan, I clearly understand you. But the same thing you described can be achieved by branch prediction. So compiler should not generate such code unless you are targeting an old cpu.

Branch prediction can do wonders in some micro-benchmarks, but they always have a cost, until the CPU logic actually "warmed up". No branch is always better, since the real problem is branch misprediction.

Using the asm generated by a recent GCC with tuned optimization flags as a reference is a good hint of what is likely to be more efficient. Intel and AMD are major GCC contributors.
And a wall clock of a realistic process (not micro benchmark) is the ultimate reference. We use some part of our automated tests as performance benchmark, running some close-to-the reality scenarios.

Edited July 23, 2020 by Arnaud Bouchez

Mahdi Safsafi · July 23, 2020

@Stefan Glienke @Arnaud Bouchez

Ok, I did some research on that and found that I was wrong. In fact there is more benefits about using inversion loop(Sadly it does not apply to Delphi):

var
  i: Integer;
begin
 // legacy loop:
  for i := 0 to 10 - 1 do
  begin
    dosomething(i);
  end;

  //  loop inversion:
  i := 0;
  if (i < 10) then // this branch can be emited if i holds at compile time.
  begin
    repeat
      dosomething(i);
      inc(i);
    until (i = 10);
  end;
end;

There is a good paper : Generalizing loop-invariant code motion in a real-world compiler by Paul Colea that describes many loop optimization.

@Kas Ob you may be interested on reading this paper.

Stefan Glienke · July 23, 2020

@Mahdi Safsafi I would not use that code as evidence because there actually the for loop produces better code than the other one. I was talking about while x do loop vs if x then repeat until not x conversion.

Also it makes a difference if you have a static count or not - the benefit of the for to loop is that the compiler does no additional cmp after it decrements the internal "counting down to zero" variable but does the conditional jmp depending on the outcome of the dec.

An until loop produces an additional cmp instruction (reported as https://quality.embarcadero.com/browse/RSP-22124)

As always: it depends - while I certainly would not use a for-to loop when dealing with some string parsing but rather prever PChar I just found recently that using a for-to loop with indexing into a pointer was faster than inc(thatpointer).

Edited July 23, 2020 by Stefan Glienke

Sign In

Help with string extraction function

Recommended Posts

Guest

Share this post

Link to post

David Heffernan 2482

Share this post

Link to post

Mike Torrettinni 199

Share this post

Link to post

Mike Torrettinni 199

Share this post

Link to post

Guest

Share this post

Link to post

Mike Torrettinni 199

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Anders Melander 2114

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Anders Melander 2114

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Stefan Glienke 2175

Share this post

Link to post

Mike Torrettinni 199

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Stefan Glienke 2175

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Stefan Glienke 2175

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Stefan Glienke 2175

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Arnaud Bouchez 413

Share this post

Link to post

Arnaud Bouchez 413

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Stefan Glienke 2175

Share this post

Link to post

Create an account or sign in to comment

Create an account