Jump to content
Mike Torrettinni

Help with string extraction function

Recommended Posts

I saw this example of ExtractContent function: https://stackoverflow.com/a/62918216/5198394 - by Andreas Rejbrandt

It should return string without whatever is within brackets:

 

function ExtractContent(const S: string): string;
var
  i, c: Integer;
  InBracket: Boolean;
begin
  SetLength(Result, S.Length);
  InBracket := False;
  c := 0;
  for i := 1 to S.Length do
  begin
    if S[i] = '{' then
      InBracket := True
    else if S[i]= '}' then
      InBracket := False
    else if not InBracket then
    begin
      Inc(c);
      Result[c] := S[i];
    end;
  end;
  SetLength(Result, c);
end;

Then the last comment for this answer is :"... I'd use a pchar indexer and omit the bool flag and use two mutually exclusive (accept and reject) loops..." - by MartynA

 

So, I was trying to implement his suggestions, but this is not my area of expertise. So here is my attempt:

 

Renamed and brackets as parameters:

function RemoveTextBetweenChars(const aString: string; const aChar1, aChar2: Char): string;
var
  c     : integer;
  vP    : PChar;
  vSkip : boolean; // when to skip text
begin
  SetLength(Result, aString.Length);

  c     := 0;
  vP    := PChar(aString);
  vSkip := False;

  while vP^ <> #0 do
  begin
    if vP^ = aChar1 then
      vSkip := True
    else if (vP^ = aChar2) then
      vSkip := false
    else if Not vSkip then
    begin
      Inc(c);
      Result[c] := vP^;
    end;

    Inc(vP);
  end;

  SetLength(Result, c);
end;

 

But I was only able to implement PChar indexer. I have no idea how to do this without boolean variable and I jhave no idea what two mutually exclusive (accept and reject) loops are.

 

Any help is appreciated!

Share this post


Link to post
Guest
function ExtractContent(const S: string): string;
var
  p, r: pchar;
begin
  SetLength(Result, S.Length);
  r := @Result[1];
  p := @S[1];
  while p^ <> #0 do
  begin
    if p^ = '{' then
    begin
      while p^ <> '}' do
        Inc(p);
      Inc(p);
    end;
    r^ := p^;
    Inc(r);
    Inc(p);
  end;
  SetLength(Result, r - @Result[1]);
end;

 

Share this post


Link to post
Guest

OK, there is bug !

 

in the inner ( the refuse loop ), i missed check for end of S

function ExtractContent(const S: string): string;
var
  p, r: pchar;
begin
  SetLength(Result, S.Length);
  r := @Result[1];
  p := @S[1];
  while p^ <> #0 do
  begin
    if p^ = '{' then
    begin
      while p^ <> '}' do
        if p^ = #0 then
          Break
        else
          Inc(p);
      Inc(p);
    end;
    r^ := p^;
    Inc(r);
    Inc(p);
  end;
  SetLength(Result, r - @Result[1]);
end;

 

Share this post


Link to post
Guest

still not right

Mike, can you spot the bug case and fix it ?

Share this post


Link to post
6 minutes ago, Kas Ob. said:

still not right

Mike, can you spot the bug case and fix it ?

Hm, I guess I can't see the bug... and it works. Any hints on the bug?

Share this post


Link to post
Guest

look for the margin cases like lack of end closing bracket

Share this post


Link to post
12 minutes ago, Kas Ob. said:

look for the margin cases like lack of end closing bracket

Aha, it can go past the #0 if missing closing bracket. This is better right:

 

function ExtractContent3(const S: string): string;
var
  p, r: pchar;
begin
  SetLength(Result, S.Length);
  r := @Result[1];
  p := @S[1];
  while p^ <> #0 do
  begin
    if p^ = '{' then
    begin
      while p^ <> '}' do
        if p^ = #0 then
        begin
        sleep(0);
          Break;
        end
        else
          Inc(p);
      if p^ = #0 then // if inner loop gets to the end
        Break
      else
        Inc(p);
    end;
    r^ := p^;
    Inc(r);
    Inc(p);
  end;
  SetLength(Result, r - @Result[1]);
end;

 

Share this post


Link to post
Guest

right, but you ended with two if's and two loops means four conditional flow, just the original text !!

function ExtractContent(const S: string): string;
var
  p, r: pchar;
begin
  SetLength(Result, S.Length);

  r := @Result[1];
  p := @S[1];
  while p^ <> #0 do
  begin
    r^ := p^;
    if p^ = '{' then
    begin
      Dec(r);
      while p^ <> '}' do
        if p^ = #0 then
        begin
          Dec(p);
          Break;
        end else
          Inc(p);
    end;
    Inc(r);
    Inc(p);
  end;

  SetLength(Result, r - @Result[1]);
end;

I hope it is right now, i am lousy programmer !

Edited by Guest

Share this post


Link to post
6 minutes ago, Kas Ob. said:

right, but you ended with two if's and two loops means four conditional flow, just the original text !!


function ExtractContent(const S: string): string;
var
  p, r: pchar;
begin
  SetLength(Result, S.Length);

  r := @Result[1];
  p := @S[1];
  while p^ <> #0 do
  begin
    r^ := p^;
    if p^ = '{' then
    begin
      Dec(r);
      while p^ <> '}' do
        if p^ = #0 then
        begin
          Dec(p);
          Break;
        end else
          Inc(p);
    end;
    Inc(r);
    Inc(p);
  end;

  SetLength(Result, r - @Result[1]);
end;

I hope it is right now, i am lousy programmer ! 

Thank you, looks good to me! 🙂

Share this post


Link to post

But the original code was surely much easier to  understand and maintain.  And who can say whether the last one is 'better'?

 

What surprises me is that nobody has jumped in with a one line incomprehensible Regex.

Share this post


Link to post

New function is only about 17% faster than original - my simple tests. I was hoping that using pointers and all the changes would result in bigger performance gain.

 

2 minutes ago, timfrost said:

But the original code was surely much easier to  understand and maintain

I agree. This was more exercise for me and trying new things.

Share this post


Link to post
5 minutes ago, timfrost said:

What surprises me is that nobody has jumped in with a one line incomprehensible Regex.

Actually RegEx was accepted answer in that SO question. But I'm staying away from RegEx as much as possible.

Edited by Mike Torrettinni
  • Like 2

Share this post


Link to post
Guest
28 minutes ago, Mike Torrettinni said:

New function is only about 17% faster than original

So you are after speed, would you please confirm if this is faster

function ExtractContent(const S: string): string;
var
  p, ep, r, sr: pchar;
  len: Integer;
begin
  len := S.Length;
  SetLength(Result, len);
  r := Pointer(Result);    //r := @Result[1];
  sr := r;
  p := Pointer(S);         //p := @S[1];
  ep := p + len;
  while p < ep do
  begin
    r^ := p^;
    Inc(r);
    if p^ = '{' then
    begin
      Dec(r);
      while Byte(Ord(p^) * (Ord(p^) - Ord('}'))) <> 0 do
        Inc(p);
    end;
    Inc(p);
  end;

  SetLength(Result, r - sr);
end;

This prefer short string, the shorter the faster it will be,

Share this post


Link to post
10 minutes ago, Kas Ob. said:

So you are after speed, would you please confirm if this is faster 


function ExtractContent(const S: string): string;
var
  p, ep, r, sr: pchar;
  len: Integer;
begin
  len := S.Length;
  SetLength(Result, len);
  r := Pointer(Result);    //r := @Result[1];
  sr := r;
  p := Pointer(S);         //p := @S[1];
  ep := p + len;
  while p < ep do
  begin
    r^ := p^;
    Inc(r);
    if p^ = '{' then
    begin
      Dec(r);
      while Byte(Ord(p^) * (Ord(p^) - Ord('}'))) <> 0 do
        Inc(p);
    end;
    Inc(p);
  end;

  SetLength(Result, r - sr);
end;

This prefer short string, the shorter the faster it will be,

No significant change, some runs it's a little slower (1587ms vs 1610ms).

 

Wasn't really looking for speed performance, but it would be nice.

Share this post


Link to post

I'm testing for speed performance with this text example:

 

Contrary to popular belief, {Lorem Ipsum} is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. {Richard 
McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going 
through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of} "de Finibus Bonorum et 
{Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of} 
Lorem Ipsum, {"Lorem ipsum dolor sit amet.."}, comes from a line in section 1.10.32.

1mio iterations:

 

Stopwatch := TStopwatch.StartNew;
for i := 1 to 1000000 do
  ExtractContent5(s);
Stopwatch.Stop;
Memo5.Lines.Add(stopwatch.ElapsedMilliseconds.ToString);

 

 

 

Share this post


Link to post
Guest

with such long text, you might be better to use PosEx and Move instead of walking it.

Share this post


Link to post

Player 3 entered the game

 

(edited)

function RemoveTextBetweenChars(const aString: string; const aChar1, aChar2: Char): string;
label
  exit;
var
  c: integer;
  P, start, res: PChar;
begin
  if aString <> '' then
  begin
    SetLength(Result, aString.Length);
    res := Pointer(Result);
    P := Pointer(aString);
    c := 0;
    while True do
    begin
      start := P;
      while P^ <> aChar1 do
      begin
        inc(P);
        if P^ = #0 then
        begin
          Move(start^, res[c], (P - start) * SizeOf(Char));
          inc(c, P - start);
          goto exit;
        end;
      end;
      Move(start^, res[c], (P - start) * SizeOf(Char));
      inc(c, P - start);
      while P^ <> aChar2 do
      begin
        inc(P);
        if P^ = #0 then
          goto exit;
      end;
      inc(P);
      if P^ = #0 then
        goto exit;
    end;
exit:
  SetLength(Result, c);
  end;
end;

 

Edited by Attila Kovacs
  • Thanks 1

Share this post


Link to post
16 minutes ago, Attila Kovacs said:

Player 3 entered the game

 


function RemoveTextBetweenChars(const aString: string; const aChar1, aChar2: Char): string;
var
  c: integer;
  P, start: PChar;
begin
  if aString <> '' then
  begin
    SetLength(Result, aString.Length);
    P := Pointer(aString);
    c := 1;
    while True do
    begin
      start := P;
      while P^ <> aChar1 do
      begin
        inc(P);
        if P^ = #0 then
        begin
          Move(start^, Result[c], (P - start) * SizeOf(Char));
          SetLength(Result, P - PChar(aString));
          Exit;
        end;
      end;
      Move(start^, Result[c], (P - start) * SizeOf(Char));
      inc(c, P - start);
      while P^ <> aChar2 do
      begin
        inc(P);
        if P^ = #0 then
          Exit;
      end;
      inc(P);
      if P^ = #0 then
        Exit;
    end;
  end;
end;

 

OK, this one is 25% faster, pretty good!

Share this post


Link to post

For parsing purposes it might not be a good idea actually to return a string as that causes heap allocation every time you extract a substring (*) which I would guess does not only happen once but many times. Something like a PChar with additional Length information might be a better fit. If you really need the text as new string entity you can still materialize it.

 

(*) You will just not notice in your benchmark because in the loop the same string instance is being reused all the time - but if you would run this in 10.4 story might be different (see my comment in https://quality.embarcadero.com/browse/RSP-29450)

Edited by Stefan Glienke

Share this post


Link to post

You are right, even around the Move() was some string magic. With passing PChar is even faster. As for the result type, I'll let it for Mike to tune it.

Edited by Attila Kovacs

Share this post


Link to post
27 minutes ago, Attila Kovacs said:

Looking into the Move() implementation, wth is happened to rep movs*? Is it too slow nowadays?

AFAIK it gets a little bit better on modern CPU. However it's getting abandoned on favor of SIMD instructions.

IMO, I think that the RTL Move routine should be implemented as an intrinsic (not a function) in that way the compiler can generate a much better code. 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×