Mike Torrettinni 198 Posted July 16, 2020 51 minutes ago, Stefan Glienke said: For parsing purposes it might not be a good idea actually to return a string as that causes heap allocation every time you extract a substring (*) which I would guess does not only happen once but many times. Something like a PChar with additional Length information might be a better fit. If you really need the text as new string entity you can still materialize it. (*) You will just not notice in your benchmark because in the loop the same string instance is being reused all the time - but if you would run this in 10.4 story might be different (see my comment in https://quality.embarcadero.com/browse/RSP-29450) Wow, I hope they fix this by 10.5. I'm skipping 10.4 anyway. Share this post Link to post
Mike Torrettinni 198 Posted July 16, 2020 1 hour ago, Attila Kovacs said: goto exit; Oh, you used GoTo... brave man 🙂 It is fastest fuction so far 🙂 2 Share this post Link to post
Mike Torrettinni 198 Posted July 17, 2020 6 hours ago, Kas Ob. said: with such long text, you might be better to use PosEx and Move instead of walking it. If I look at PosEx code in System.pas it also walks char by char. No? @Attila Kovacs solutions uses Move, but not PosEx. You think PosEx could be better option for longer texts? Share this post Link to post
Guest Posted July 17, 2020 5 hours ago, Mike Torrettinni said: solutions uses Move, but not PosEx. You think PosEx could be better option for longer texts? No, No i looked at PosEX, it will not . but wait how your result is very far from mine, i mean my version is way faster for short string and for your string it is even faster than Atilla's program MikeStrings; {$APPTYPE CONSOLE} {$R *.res} uses SysUtils, Windows; var iStartTime, iEndTime, iFreq: Int64; Duration: Int64; procedure StartTiming; begin QueryPerformanceFrequency(iFreq); QueryPerformanceCounter(iStartTime); end; function StopTiming: string; begin QueryPerformanceCounter(iEndTime); Duration := iEndTime - iStartTime; Result := FloatToStrF(Duration * 1000{for ms} / iFreq, ffGeneral, 6, 3) + ' ms'; end; const MIKE_MID_STRING = 'Contrary to popular belief, {Lorem Ipsum} is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. {Richard ' + 'McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going ' + 'through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of} "de Finibus Bonorum et ' + '{Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of} ' + 'Lorem Ipsum, {"Lorem ipsum dolor sit amet.."}, comes from a line in section 1.10.32.'; JUST_SHORT_STRING = '123456{6789}ABCDEF'; function ExtractContentOriginal(const S: string): string; var i, c: Integer; InBracket: Boolean; begin SetLength(Result, S.Length); InBracket := False; c := 0; for i := 1 to S.Length do begin if S[i] = '{' then InBracket := True else if S[i] = '}' then InBracket := False else if not InBracket then begin Inc(c); Result[c] := S[i]; end; end; SetLength(Result, c); end; function ExtractContentKasFixed(const S: string): string; var p, ep, r, sr: pchar; len: Integer; begin len := S.Length; SetLength(Result, len); r := Pointer(Result); sr := r; p := Pointer(S); ep := p + len; while p < ep do begin r^ := p^; Inc(r); if p^ = '{' then begin Dec(r); while Byte(Ord(p^) * (Ord(p^) - Ord('}'))) <> 0 do Inc(p); end; Inc(p); end; SetLength(Result, r - sr); end; function ExtractContentKasNotFixed(const S: string; const aChar1, aChar2: Char): string; var p, ep, r, sr: pchar; len: Integer; begin len := S.Length; SetLength(Result, len); r := Pointer(Result); sr := r; p := Pointer(S); ep := p + len; while p < ep do begin r^ := p^; Inc(r); if p^ = aChar1 then begin Dec(r); while Byte(Ord(p^) * (Ord(p^) - Ord(aChar2))) <> 0 do Inc(p); end; Inc(p); end; SetLength(Result, r - sr); end; function ExtractContentAtilla1NotFixed(const aString: string; const aChar1, aChar2: Char): string; label exit; var c: integer; P, start, res: PChar; begin if aString <> '' then begin SetLength(Result, aString.Length); res := Pointer(Result); P := Pointer(aString); c := 0; while True do begin start := P; while P^ <> aChar1 do begin inc(P); if P^ = #0 then begin Move(start^, res[c], (P - start) * SizeOf(Char)); inc(c, P - start); goto exit; end; end; Move(start^, res[c], (P - start) * SizeOf(Char)); inc(c, P - start); while P^ <> aChar2 do begin inc(P); if P^ = #0 then goto exit; end; inc(P); if P^ = #0 then goto exit; end; exit: SetLength(Result, c); end; end; function ExtractContentAtilla1Fixed(const aString: string): string; label exit; var c: integer; P, start, res: PChar; begin if aString <> '' then begin SetLength(Result, aString.Length); res := Pointer(Result); P := Pointer(aString); c := 0; while True do begin start := P; while P^ <> '{' do begin inc(P); if P^ = #0 then begin Move(start^, res[c], (P - start) * SizeOf(Char)); inc(c, P - start); goto exit; end; end; Move(start^, res[c], (P - start) * SizeOf(Char)); inc(c, P - start); while P^ <> '}' do begin inc(P); if P^ = #0 then goto exit; end; inc(P); if P^ = #0 then goto exit; end; exit: SetLength(Result, c); end; end; procedure RunMikeTest(Loops: Integer); var I: Integer; begin Writeln('Mike test'); Write('Mid string'#9); StartTiming; for I := 1 to Loops do ExtractContentOriginal(MIKE_MID_STRING); Writeln(StopTiming); Write('Short string'#9); StartTiming; for I := 1 to Loops do ExtractContentOriginal(JUST_SHORT_STRING); Writeln(StopTiming); end; procedure RunKasFixedTest(Loops: Integer); var I: Integer; begin Writeln('Kas fixed test'); Write('Mid string'#9); StartTiming; for I := 1 to Loops do ExtractContentKasFixed(MIKE_MID_STRING); Writeln(StopTiming); Write('Short string'#9); StartTiming; for I := 1 to Loops do ExtractContentKasFixed(JUST_SHORT_STRING); Writeln(StopTiming); end; procedure RunKasNotFixedTest(Loops: Integer); var I: Integer; begin Writeln('Kas not fixed test'); Write('Mid string'#9); StartTiming; for I := 1 to Loops do ExtractContentKasNotFixed(MIKE_MID_STRING, '{', '}'); Writeln(StopTiming); Write('Short string'#9); StartTiming; for I := 1 to Loops do ExtractContentKasNotFixed(JUST_SHORT_STRING, '{', '}'); Writeln(StopTiming); end; procedure RunAtilla1NotFixedTest(Loops: Integer); var I: Integer; begin Writeln('Atilla not fixed test'); Write('Mid string'#9); StartTiming; for I := 1 to Loops do ExtractContentAtilla1NotFixed(MIKE_MID_STRING, '{', '}'); Writeln(StopTiming); Write('Short string'#9); StartTiming; for I := 1 to Loops do ExtractContentAtilla1NotFixed(JUST_SHORT_STRING, '{', '}'); Writeln(StopTiming); end; procedure RunAtilla1TestFixed(Loops: Integer); var I: Integer; begin Writeln('Atilla fixed test'); Write('Mid string'#9); StartTiming; for I := 1 to Loops do ExtractContentAtilla1Fixed(MIKE_MID_STRING); Writeln(StopTiming); Write('Short string'#9); StartTiming; for I := 1 to Loops do ExtractContentAtilla1Fixed(JUST_SHORT_STRING); Writeln(StopTiming); end; const ITERATIONS = 1000000; procedure RunTests; begin RunMikeTest(ITERATIONS); RunKasFixedTest(ITERATIONS); RunKasNotFixedTest(ITERATIONS); RunAtilla1NotFixedTest(ITERATIONS); RunAtilla1TestFixed(ITERATIONS); end; begin RunTests; Readln; end. And here my result Quote Mike test Mid string 1768.8 ms Short string 81.5028 ms Kas fixed test Mid string 533.336 ms Short string 30.154 ms Kas not fixed test Mid string 658.88 ms Short string 36.1251 ms Atilla not fixed test Mid string 612.567 ms Short string 42.1707 ms Atilla fixed test Mid string 604.432 ms Short string 40.2711 ms Share this post Link to post
aehimself 396 Posted July 17, 2020 When I made a parser like this I was working with indexes returned by String Helpers. Look for the first opener, look for the first closer from the opener. Make sure there are no quotes (or even number of quotes) in between. Make sure there are equal amount of openers and equal amount of closers in between. I'm not saying it was the fastest solution, but it worked 🙂 And it handled nested sections correctly, too. Share this post Link to post
Stefan Glienke 2002 Posted July 17, 2020 (edited) @Kas Ob. Your benchmark is flawed - every time you test multiple implementations in the same test you have a bias because of sorts of things such as memory layout/alignment, branch prediction, prefetcher and so on. For example when I run the code you posted "Kas not fixed wins" - when I comment out all others the code runs slower. When I comment out one of the tests all routines suddenly run faster. This clearly is an indicator that depending on where the code is put by the compiler sometimes they might be sitting in the cache more beneficial than some other times. Unfortunately at least to my knowledge with the Delphi compiler you have no control over code alignment - unless you write stuff in asm if course... Edited July 17, 2020 by Stefan Glienke 1 Share this post Link to post
Guest Posted July 17, 2020 7 minutes ago, Stefan Glienke said: For example when I run the code you posted "Kas not fixed wins" - when I comment out all others the code runs slower. When I comment out one of the tests all routines suddenly run faster. This clearly is an indicator that depending on where the code is put by the compiler sometimes they might be sitting in the cache more beneficial than some other times. Unfortunately at least to my knowledge with the Delphi compiler you have no control over code alignment - unless you write stuff in asm if course... Adding to your comment switching the position of the functions in the same unit will change the result too. It mostly depend on the loops destination address if it is aligned on 4,2,8... bytes or not, Delphi compiler has didn't got the memo ! FPC and LLVM do add nops in some places to align internal loops with aggressive optimization. But still flawed is flawed no one arguing, may more like non optimal, one thing though all of those you mentioned (memory layout/alignment, branch prediction, prefetcher) will not play huge role in this case, as the data the iterations is very huge that will eliminate the cache effect while the data and the code are small that will be cached and fitted in small execution window, so it is not the main player here Try to move the destination of those tight loops !, that is the case and adding $CODEALIGN 16 will not help much too as my outer loop will always will land on 2n, and if you want for the sake of educational purposes, try the same asm but add many as many nops needed just to make the conditional jmp land on multiplication of 16, 8,4,2, you will have better-to-less-better performance at the same order, then try to land on 3 ! see the performance drop 10-50% but will not be completely at random on the contrary it will repeatable process, this will happen because the data are not moving around so this exact effect destination address alignment is amplified. also this behaviour will be completely different on virtualized machines, there do exist random and unpredictable bahaviour. Share this post Link to post
Guest Posted July 17, 2020 Almost forgot to complain too. I mentioned 16..2 alignments but for the top performance it will be on 64, Delphi added code alignment, but not sure if this is still the case for latest version ( after Seattle), someone did good with it, but decided that 32 and 16 alignment is not needed, unlike any compiler out there, because who need to read Intel manuals. That worth opening QP or QC or whatever it called. For me, i do sometimes low level optimization and that ruin the effect even with code compiled with FPC as obj, Delphi is not allowing it to be aligned right, so i do copy the code at runtime to memory aligned and execute those critical parts, also for the future ( hope not far and not in far galaxy ) when low level optimization of the RTL start such alignment (32 and 64) will be mission critical. Share this post Link to post
Stefan Glienke 2002 Posted July 17, 2020 (edited) I was merely commenting on the things that can happen when using common practice for benchmarking and comparing different algorithms or implementation by simply putting them all into the same executable and then running them after each other in any order and less so on which of them are affecting this particular test. FWIW here is an optimized version of your routine (tested on 10.3.3 with $O+,W-): - no unnecessary Inc(r)/Dec(r) when '{' was found - no unnecessary check on the inner loop before first Inc(p) - better code generated for inner loop condition - yours generated imul - could even use until (p^ in ['}', #0]) which is even faster but generates W1050 but produces valid code as it uses the 16bit registers still. function ExtractContent(const S: string): string; var p, ep, r, sr: pchar; len: Integer; begin len := S.Length; SetLength(Result, len); r := Pointer(Result); sr := r; p := Pointer(S); ep := p + len; while p < ep do begin r^ := p^; if p^ = '{' then begin repeat Inc(p); until (p^ = '}') or (p^ = #0); end else Inc(r); Inc(p); end; SetLength(Result, r - sr); end; Edited July 17, 2020 by Stefan Glienke 1 Share this post Link to post
Attila Kovacs 629 Posted July 17, 2020 (edited) 1 hour ago, Stefan Glienke said: putting them all into the same executable and then running them after each other no-no, the compiler runs out of registers if do so, always testing separately Nice routine. I don't know why but if you use parameters instead if consts it performs even better. (Despite comparing against the stack) ExtractContent(const S: string; const aChar1, aChar2: Char); this boost it even further: var zero: Char; zero := #0; until (P^ = aChar2) or (P^ = zero); or even better: until (P^ = aChar2) or (P = ep); Edited July 17, 2020 by Attila Kovacs 1 Share this post Link to post
Guest Posted July 17, 2020 @Stefan Glienke you are right there, and your version is faster It is fascinating how FPC generate that inner loop with multiplication Quote MikeTest.pas:110 Inc(p); 0000000100001878 4883c602 add $0x2,%rsi MikeTest.pas:109 while Byte(Ord(p^) * (Ord(p^) - Ord('}'))) <> 0 do 000000010000187C 440fb706 movzwl (%rsi),%r8d 0000000100001880 67458d4883 lea -0x7d(%r8d),%r9d 0000000100001885 450fafc1 imul %r9d,%r8d 0000000100001889 4584c0 test %r8b,%r8b MikeTest.pas:95 begin 000000010000188C 75ea jne 0x100001878 <EXTRACTCONTENTKASFIXED+88> and this how Delphi does it Quote MikeStrings.dpr.106: while Byte(Ord(p^) * (Ord(p^) - Ord('}'))) <> 0 do 00000000004256BC 4C0FB611 movzx r10,byte ptr [rcx] 00000000004256C0 4180EA7D sub r10b,$7d 00000000004256C4 480FB601 movzx rax,byte ptr [rcx] 00000000004256C8 41F6EA imul r10b 00000000004256CB 84C0 test al,al 00000000004256CD 75E9 jnz ExtractContentKasFixed + $58 MikeStrings.dpr.109: Inc(p); 00000000004256CF 4883C102 add rcx,$02 MikeStrings.dpr.99: while p < ep do 00000000004256D3 483BCA cmp rcx,rdx 00000000004256D6 72C8 jb ExtractContentKasFixed + $40 00000000004256D8 90 nop I have no idea where this nop come from, as it does exactly the opposite of aligning the next line which is landing for a branch, this Delphi XE5, my IDE preferred for the light weight 64, you have such nop in latest versions ? Share this post Link to post
Stefan Glienke 2002 Posted July 17, 2020 (edited) 20 minutes ago, Kas Ob. said: I have no idea where this nop come from, as it does exactly the opposite of aligning the next line which is landing for a branch, this Delphi XE5, my IDE preferred for the light weight 64, you have such nop in latest versions? Yes but who cares for the nop (well for this one at least, yes the Win64 compiler inserts a ton of them in weird places), the next line is the SetLength before the ret. Edited July 17, 2020 by Stefan Glienke Share this post Link to post
Mahdi Safsafi 225 Posted July 17, 2020 @Kas Ob. I noticed that you take optimization very seriously. That's really good but just as an advice from someone that did, don't be driven too much. Behind that optimization door resides the evil ! In fact, I spent much time reading Intel doc/books, comparing different compilers (gcc, clang, ...). At the end I become more obsessional and less productive as I started paying much attention to my codes than what is required like taking some time to decide whether I should write if a = b or if a <> b. 2 Share this post Link to post
Attila Kovacs 629 Posted July 17, 2020 (edited) Slightly modified Stefan's beautiful routine. If I did not screw it up, it's faster because of comparing against the stack (parameters aChar1, aChar2), (I can't explain that), and omitting "sr" leaves more register for the compiler, which was a speed boost again. Also, the initial "while" became a "repeat until", and instead of "P^ = #0" there is a " P = ep". function ExtractContent4(const S: string; const aChar1, aChar2: Char): string; var P, ep, r: PChar; len: integer; begin len := S.Length; if len = 0 then exit(S); SetLength(Result, len); r := Pointer(Result); P := Pointer(S); ep := P + len; repeat r^ := P^; if P^ = aChar1 then begin repeat inc(P); until (P^ = aChar2) or (P = ep); end else inc(r); inc(P); until P > ep; SetLength(Result, r - Pointer(Result)); end; Edited July 17, 2020 by Attila Kovacs 1 Share this post Link to post
Stefan Glienke 2002 Posted July 17, 2020 (edited) 19 minutes ago, Attila Kovacs said: Also, the initial "while" became a "repeat until". Congrats, you just broke it for empty S. Also not storing Pointer(Result) makes it slower at least for me. 43 minutes ago, Mahdi Safsafi said: @Kas Ob. I noticed that you take optimization very seriously. That's really good but just as an advice from someone that did, don't be driven too much. Behind that optimization door resides the evil ! In fact, I spent much time reading Intel doc/books, comparing different compilers (gcc, clang, ...). At the end I become more obsessional and less productive as I started paying much attention to my codes than what is required like taking some time to decide whether I should write if a = b or if a <> b. You have to know where it's worth to put effort and where it does not matter. Edited July 17, 2020 by Stefan Glienke Share this post Link to post
Mahdi Safsafi 225 Posted July 17, 2020 @Stefan Glienke For some time it was totally out of my control Now its getting better but yet I still fight my self some time. Share this post Link to post
Attila Kovacs 629 Posted July 17, 2020 @Stefan Glienke thx, updated. Strange, with 10.1U2 it's definitely much faster if not storing it. O+ W- asm: ec4.txt Share this post Link to post
Mike Torrettinni 198 Posted July 17, 2020 Interesting stuff: asm, registries, aligning... it's like a black-box to me. Reading how little insignificant changes (order of functions calls) affect the performance, is kind of scary. But it shows how real performance optimization is based on case by case. It's hard (or impossible) to make 1 super performant algorithm for all cases. Just from this thread we can see the differences on optimization for short vs long strings, or like @aehimself pointed out if we need to consider nested brackets, escaped chars, quoted chars... Here I am thinking if increasing char pointer is faster than using integer iterator, while you guys are discussing condition branching and using stack and full set of registries 🙂 Share this post Link to post
Attila Kovacs 629 Posted July 17, 2020 (edited) @Mike Torrettinni And the same code can perform differently on different CPU's. 9 minutes ago, Mike Torrettinni said: increasing char pointer is faster than using integer iterator Because the "integer iterator" does nothing else just calculates the pointer + i over and over. So basically it's a shortcut. Edited July 17, 2020 by Attila Kovacs Share this post Link to post
Mahdi Safsafi 225 Posted July 17, 2020 3 minutes ago, Mike Torrettinni said: Here I am thinking if increasing char pointer is faster than using integer iterator, while you guys are discussing condition branching and using stack and full set of registries 🙂 On x86, sizeof(pointer) = sizeof(integer) so its much likely you are going to have the same performance. On x64, sizeof(pointer) = sizeof(int64) so using integer could be some how better. This was just a general rule, but does not apply always as it depends on the way you're using your code. Share this post Link to post
Stefan Glienke 2002 Posted July 17, 2020 (edited) 30 minutes ago, Mahdi Safsafi said: On x86, sizeof(pointer) = sizeof(integer) so its much likely you are going to have the same performance. No, because if you are doing something with whatever is at x[ i ] its faster to use a pointer that starts at @x[ 0 ] and just inc that rather than having a loop which then has 2 counter variables (they like to count down to zero while maintaining the upwards counting i). And on x86 that is a precious register that can be used for something else. Edited July 17, 2020 by Stefan Glienke Share this post Link to post
David Heffernan 2345 Posted July 17, 2020 1 hour ago, Mike Torrettinni said: Interesting stuff: asm, registries, aligning... it's like a black-box to me. Reading how little insignificant changes (order of functions calls) affect the performance, is kind of scary. But it shows how real performance optimization is based on case by case. It's hard (or impossible) to make 1 super performant algorithm for all cases. Just from this thread we can see the differences on optimization for short vs long strings, or like @aehimself pointed out if we need to consider nested brackets, escaped chars, quoted chars... Here I am thinking if increasing char pointer is faster than using integer iterator, while you guys are discussing condition branching and using stack and full set of registries 🙂 Do you have a performance bottleneck with this code? Share this post Link to post
Mike Torrettinni 198 Posted July 17, 2020 9 minutes ago, David Heffernan said: Do you have a performance bottleneck with this code? No, not using any of it, yet. Planning to do some profiling and benchmarking on real data and will see if there are any bottlenecks in this area. Share this post Link to post
Guest Posted July 17, 2020 Just now, Mike Torrettinni said: No, not using any of it, yet. Got me good ! Share this post Link to post
Mike Torrettinni 198 Posted July 17, 2020 4 minutes ago, Kas Ob. said: Got me good ! Wait, this all happened yesterday... sorry, but I can't implement improvements so fast. What happened here, miscommunication? Share this post Link to post