Attila Kovacs 629 Posted June 20, 2020 @Mahdi Safsafi I see thx. Which one is the forward branch again? The if or the else section? I'm not sure here anymore. Share this post Link to post
Guest Posted June 20, 2020 16 minutes ago, Attila Kovacs said: Which one is the forward branch again? Usually the else procedure Test(var A:TCPoint); begin if A.X = 0 then begin A.X := 1; A.Y := 1; end else begin A.X := 2; A.Y := 2; end; end; here the compiler put two exit point, which is very rare and only Delphi compiler do with very short functions. Share this post Link to post
Guest Posted June 20, 2020 While here it did use one exit point begin if A.X = 0 then begin A.X := 1; A.Y := 1; end else begin A.X := 2; A.Y := 2; end; A.X := A.Y; end; Share this post Link to post
Mahdi Safsafi 225 Posted June 20, 2020 16 minutes ago, Attila Kovacs said: @Mahdi Safsafi I see thx. Which one is the forward branch again? The if or the else section? I'm not sure here anymore. Don't worry ! I'll try to give a simple explanation 🙂 There're many kind of jump (direct, indirect, relative, ...). Relative jmp is the most used one and the efficient one. Relative means that the offset is relative to Program Counter PC(PC is a register that holds the current instruction pointer ... on x86 its a protected register ... on an ugly implementation such aarch32(ARM) its a public register.). Forward jump means the offset of jmp is positive (hence we are jumping down). Backward jump means offset is negative(jumping up). # address # opcodes # instruction # comment # (in decimal) backward_label: 00000000 85C0 test eax,eax 00000002 7407 jz forward_label ; PC=00000002 OFFSET=7 ; dest_addr = PC + OFFSET + SizeOf(CurrentInstruction) = 2 + 7 + 2 = 11 00000004 B801000000 mov eax,$00000001 00000009 EBF5 jmp backward_label ; PC=00000009 OFFSET=0xF5(-11) : dest_addr = Same_Formula_Above = 9 - 11 + 2 = 0 forward_label: 00000011 C3 ret Now I believe forward/backward branches are clear for you. The interesting part about backward branch prediction is that CPU assumes that its taken because usually backward branches are used for loop: // pascal code: begin for i = 0 to 10 do begin # dosomething ... end; # dosomething2 ... end; // asm version: xor ecx,ecx backward_label: # dosomething ... inc ecx cmp ecx, 11 jnz backward_label ; backward branch state2: # dosomething2 ... //------------------------------------------------------------------------------------------ if CPU assumes that the backward branch is not taken ... then it's a performance penalty ! For each iteration, CPU wastes time to execute state2 instructions and when it realizes that BP was wrong, it tries to recover ! there are 11 iteration !!! this can lead to a huge overhead if we are processing a large amount of data. On the other hand, if it assumes that the backward branch is taken, it would save a lot of time and recover only on the last n+1 item. Now I believe you are understanding the concept 🙂 and you can easily answer your own question 😉 Share this post Link to post
stijnsanders 35 Posted June 20, 2020 Aside from the control flow details, there's more to TMemoryStream, especially when you use it with potentially big blobs of data you're in danger of getting an "out of memory" error, even when there's in fact still memory available. It has to do with fragmentation (something that's in theory less of a problem on 64-bit). I should look up the code I once created to have a custom TCustomMemoryStream descendant that uses VirtualAlloc to reserve the memory to use. Then again, it's not that difficult, the memory allocation itself happens in a virtual method you can override, if I remember correctly. Windows internally has a special trick you can use when you want 2MiB-blocks or multiple's: it enters them differently in the virtual memory mapper so it takes less steps to translate the virtual address to the physical address. If it's performance you want you should check it out. (see MEM_LARGE_PAGES here) Share this post Link to post
Attila Kovacs 629 Posted June 20, 2020 (edited) @Mahdi Safsafi 1 hour ago, Mahdi Safsafi said: Now I believe forward/backward branches are clear for you. No, this was my question 🙂 "As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt’s work." After reading the articles again, I would say, "forward branches are not taken the first time" means, no conditional forward jumps are taken by the predictor for the first time. Am I right? Edited June 20, 2020 by Attila Kovacs Share this post Link to post
Anders Melander 1782 Posted June 20, 2020 9 hours ago, dummzeuch said: Why did I actually bring this up? I was looking at some code that was extensively using TMemoryStream as a buffer for various operations, where it accessed the Position and Size properties all the time and thought about the possible overhead of these method calls. Then I looked at the implementation of GetSize and found 3 calls to Seek in there, just to eventually get the value of a field. And then I looked at various other methods of T(Custom)MemoryStream and thought WTF is this? That I used the Write method as an example above was pure chance, it was just the last method I had been looking at. It sounds like you fell into the premature optimization trap; "Optimized" something because it looked inefficient without actually measuring if it would matter. Share this post Link to post
Mahdi Safsafi 225 Posted June 20, 2020 (edited) 1 hour ago, Attila Kovacs said: @Mahdi Safsafi No, this was my question 🙂 "As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt’s work." After reading the articles again, I would say, "forward branches are not taken the first time" means, no conditional forward jumps are taken by the predictor for the first time. Am I right? I see 🙂 For the first time when a CPU sees a branch, CPU uses static predictor : Mostly all CPU assumes that a backward branch is going to be taken for the reason I explained with the loop example. For forward branch, many CPU assumes (but not all) that a forward branch is not going to be taken. Some of them do a random prediction like core2. UPDATE: @Attila KovacsI forgot to answer your if/else question. remember what I said, for first time seen, CPU assumes "if" is taken because we intend to execute if-statement. So it's else section that is not going to be taken. Edited June 20, 2020 by Mahdi Safsafi 1 Share this post Link to post
Attila Kovacs 629 Posted June 21, 2020 (edited) Ok, here is a real-world example. In the code from https://quality.embarcadero.com/browse/RSP-29731 the original nested proc was: function IsBreak(AChar: PChar; ALineBreakPos: integer = 1): boolean; begin P2 := AChar; if AChar^ = #0 then Exit(True); Result := AChar^ = LineBreak[ALineBreakPos]; if Result and (ALineBreakPos < LineBreakLen) then Result := Result and IsBreak(AChar + 1, ALineBreakPos + 1); end; after changing it to: function IsBreak(AChar: PChar; ALineBreakPos: integer = 1): boolean; begin P2 := AChar; if AChar^ <> #0 then begin Result := AChar^ = LineBreak[ALineBreakPos]; if Result and (ALineBreakPos < LineBreakLen) then Result := Result and IsBreak(AChar + 1, ALineBreakPos + 1); end else Result := True; end; the time for reading an 50MB file with 227000 lines went from around 320ms to 300ms down, whereas the 2 codegens are the same except the conditional jump and that the "Exit(true)" is on the bottom (Result:=True). This must be OoOE, isn't it? It's impressive. Edited June 21, 2020 by Attila Kovacs Share this post Link to post
Guest Posted June 21, 2020 3 hours ago, Attila Kovacs said: the time for reading an 50MB file with 227000 lines went from around 320ms to 300ms down, whereas the 2 codegens are the same except the conditional jump and that the "Exit(true)" is on the bottom (Result:=True). This must be OoOE, isn't it? It's impressive. Right, and i am with any half cycle saving. Now if you want to make it even faster try this function IsBreak(AChar: PChar; ALineBreakPos: integer = 1): boolean; begin Result := True; P2 := AChar; if AChar^ <> #0 then begin Result := AChar^ = LineBreak[ALineBreakPos]; if Result and (ALineBreakPos < LineBreakLen) then Result := Result and IsBreak(AChar + 1, ALineBreakPos + 1); end; end; This should be even little faster, why ? I showed you in my two example above that if then else in the middle will have two branching jump to the exit if there is code after the if, and here you can't be sure if the compiler will put 2 exit point saving a jump or will use one exit point, in other words with one exit point and the if was true and the code had been executed then there will be a jump to skip the else section ( Result := True; ) , right ? By forcing the Result assigning at first, we we eliminated the second jump form else and made the if true section one piece till exit point, Can you time it please ? Assigning a value like result might be 1-2 cycles, but miss predicted branch hit might be up to 50 cycle or even more, it is highly depends on the CPU. The better version should be like this function IsBreak(AChar: PChar; ALineBreakPos: integer = 1): boolean; begin Result := AChar^ = #0; P2 := AChar; if not Result then begin Result := AChar^ = LineBreak[ALineBreakPos]; if Result and (ALineBreakPos < LineBreakLen) then Result := Result and IsBreak(AChar + 1, ALineBreakPos + 1); end; end; Share this post Link to post
dummzeuch 1505 Posted June 21, 2020 11 hours ago, Anders Melander said: It sounds like you fell into the premature optimization trap; "Optimized" something because it looked inefficient without actually measuring if it would matter. Quite possible. But on the other hand I had not yet arrived at the optimization step. I was still looking at the code and tried to understand what it was doing and why. And of course there is always the "I'm doing this because it's fun" part of optimizing. Btw: The best optimization for the code I actually care about (it's about writing JPEGs and text to an MJPEG video) probably is not using TMemoryStream at all but a simple byte buffer instead. That stream just was a convenience there. Share this post Link to post
Attila Kovacs 629 Posted June 21, 2020 (edited) 3 hours ago, Kas Ob. said: The better version should be like this As this is a nested proc, the generated code has no multiple exit points. Actually it's a very clean and straightforward asm, so no gain with your changes, rather a tick slower as you assigning the result twice. My original change however was shocking for me, same code, its parts in different order, 6.25% faster. Edited June 21, 2020 by Attila Kovacs Share this post Link to post
Alexander Sviridenkov 356 Posted June 21, 2020 Little bit offtopic, but from my experience 90% of performance issues in Delphi came from implicit try except blocks. When procedure contains local variable of managed type, or implicit variable (f.e. string1+string2, copy(), etc) try except block with finalization calls is added to whole procedure even if variable is used only in small block. Moving this block (in case it is rarely executed) to subprocedure helps a lot. 1 Share this post Link to post
Attila Kovacs 629 Posted June 21, 2020 (edited) @Alexander Sviridenkov Those implicit try/except blocks are really slow, you are right. These OoOE compliant codes are not so spectacular but they could sum up. I took a random example from htutils.pas, "AnsiStartsWith" and changed the early exit against "if (ls > lt) and (lt > 0) then .. else ...", as this is the most likely to happen and this is the path which will be predicted and I can measure around 5.5% faster code with a long "s" and a long "start" parameter. But I really needed long parameters + many many runs to be able to measure anything. Edited June 21, 2020 by Attila Kovacs Share this post Link to post
Mahdi Safsafi 225 Posted June 21, 2020 1 hour ago, Alexander Sviridenkov said: Little bit offtopic, but from my experience 90% of performance issues in Delphi came from implicit try except blocks. When procedure contains local variable of managed type, or implicit variable (f.e. string1+string2, copy(), etc) try except block with finalization calls is added to whole procedure even if variable is used only in small block. Moving this block (in case it is rarely executed) to subprocedure helps a lot. This only applies to dcc32 (Win32) as it uses stack-based-exception mechanism. Exception handler executes faster but this known to add 15% overhead when there is no exception. On the other hand, Win64 uses table-based-exception mechanism. Exception handler executes a little bit slower since its handled exclusively by the runtime but does not add any overhead when there is no exception. Share this post Link to post
Guest Posted June 21, 2020 Try..finally is worse than try..except as it is hidden with managed types, it will add accumulative overhead per declared local var (or with parameter passed without const or var) . On side note: Here i really intrigued how managed record with 10.4 is when have initializer. No one published any assembly about how local record been handled, is there hidden try..finally ? for all records or only for records with with Initialize and Finalize ? How about inline variables ? Where the hidden try..finally been inserted ? like local vars in the begging and in the end or in the middle and only when entering their begin..end So if someone would spare few minutes and add some small code piece with their assembly code generated, it would be nice. Share this post Link to post
Mahdi Safsafi 225 Posted June 21, 2020 @Kas Ob. Quote Here i really intrigued how managed record with 10.4 is when have initializer. No one published any assembly about how local record been handled, is there hidden try..finally ? Well I didn't try Delphi Sydney but I think(I'm not sure) that managed record does not have try/finally section because the record isn't allocated dynamically. Share this post Link to post
Guest Posted June 21, 2020 13 minutes ago, Mahdi Safsafi said: Well I didn't try Delphi Sydney but I think(I'm not sure) that managed record does not have try/finally section because the record isn't allocated dynamically. from http://docwiki.embarcadero.com/RADStudio/Rio/en/Structured_Types_(Delphi) Quote Records are value types, so they are copied on assignment, passed by value, and allocated on the stack unless they are declared globally or explicitly allocated using the New and Dispose function. Classes are reference types, so they are not copied on assignment, they are passed by reference, and they are allocated on the heap. Records are constructed automatically, using a default no-argument constructor, but classes must be explicitly constructed. Because records have a default no-argument constructor, any user-defined record constructor must have one or more parameters. and from https://community.idera.com/developer-tools/b/blog/posts/custom-managed-records-coming-to-delphi-10-4#:~:text=What is a Managed Record,rid of the memory location. class operator TMyRecord.Initialize (out Dest: TMyRecord); begin Dest.Value := 10; Log('created' + IntToHex (Integer(Pointer(@Dest))))); end; class operator TMyRecord.Finalize(var Dest: TMyRecord); begin Log('destroyed' + IntToHex (Integer(Pointer(@Dest))))); end; procedure LocalVarTest; var my1: TMyRecord; begin Log (my1.Value.ToString); end; This means there is Initialize been called and also the constructor, right ? if yes then both need cleaning up counterpart, eg. in case a managed field type been used like a string. Share this post Link to post
Mahdi Safsafi 225 Posted June 21, 2020 Quote This means there is Initialize been called and also the constructor, right ? if yes then both need cleaning up counterpart, eg. in case a managed field type been used like a string. Its developer responsibility to make the cleanup. As I said before, I'm not sure(I may be wrong) ... So better way to find out is to try 😉 Share this post Link to post
Stefan Glienke 2002 Posted June 22, 2020 (edited) 21 hours ago, Alexander Sviridenkov said: Little bit offtopic, but from my experience 90% of performance issues in Delphi came from implicit try except blocks. Yes - because try finally (not only the implicit! ones) are implemented poorly and result in trashing the return stack buffer of the CPU all the time: https://quality.embarcadero.com/browse/RSP-27375 @Attila Kovacs The difference between the two IsBreak versions is most likely related to branch prediction and the difference between a conditional forward jump typically taken and not taken - if you want to read more about it I suggest this rather in depth article by Matt Godbolt: https://xania.org/201602/bpu-part-one As for performance of this nested function I would guess it could run even faster when it would not access anything from the outside because that usually requires quite some stack lifting. And having to do that usually prevents the compiler to have more than one exit point as was mentioned before - if there are no registers to pop and that the compiler neatly puts rets in different places though - but I would guess you will not get around that for this function but probably around repeatedly accessing the LineBreak property and variables P2 and LineBreakLen Edited June 22, 2020 by Stefan Glienke 1 Share this post Link to post
Attila Kovacs 629 Posted June 22, 2020 I was struggling to write this down or not, poor dummzeuch's thread is hijacked as hell, but I can't stand, sorry. I cached the property LineBreak as Stefan suggested, but I could not P2 and LineBreakLen. Brain overflow. (less opcode, less stack usage) Then, I inverted the "if" in the middle, because the most likely branch is not to have a LineBreak at the current position in the string. (theoretically OoOE) Then I put Exit()'s instead of "Result :=" which resulted in 2 more exit points, which allows the compiler to use more registers instead of the stack. (less opcode, more register and less stack usage) Finally it went down to around 265ms (from 300), and I got a code which is far from any straightforward thinking and looks ugly and harder to understand. Maybe the last Exit() is unnecessary but this is not the point. It's interesting, how tiny things can affect the speed. Originally I was starting from 320ms and it was a replacement for the RTL which was using 10's of seconds to do the same thing. But I barely think that anybody could/would get used to it to write "if then else begin <actual code> end;" just because the most likely branch would be to do nothing. function IsBreak(AChar: PChar; ALineBreakPos: integer = 1): boolean; begin if AChar^ <> #0 then begin P2 := AChar; if (AChar^ <> LLineBreak[ALineBreakPos]) or (ALineBreakPos > LineBreakLen) then Exit(False) else Exit(IsBreak(AChar + 1, ALineBreakPos + 1)); end else Exit(True); end; Share this post Link to post
Mahdi Safsafi 225 Posted June 22, 2020 @Attila Kovacs Quote I was struggling to write this down or not, poor dummzeuch's thread is hijacked as hell, but I can't stand, sorry. He is really a good guy ... I'd understand if he kicked us. Quote It's interesting, how tiny things can affect the speed. Originally I was starting from 320ms and it was a replacement for the RTL which was using 10's of seconds to do the same thing. Indeed ! BTW, AChar is de-referenced twice ! try to declare a local CH that holds AChar^ "var Ch: Char := AChar^", this may improve a little bit your function. Quote But I barely think that anybody could/would get used to it to write "if then else begin <actual code> end;" just because the most likely branch would be to do nothing. You're right ! that's some how difficult to cover all your logic. Share this post Link to post
Guest Posted June 22, 2020 @Attila Kovacs I would suggest try to remove one conditional jump more from your last version, without looking at code i know that OR is translated by Delphi to conditional jump, and sometimes in some cases, not every time and not in every case, assigning a value and validate another is cheaper that comparing and jumping, assigning is usually one or two instructions, while jump in case of OR most likely is one or two as maximum as there might be a compare or not, but the that fact conditional branching is there it might hinder the ability of out-of-order-execution and to save few cycles by using the predicted branch and reordered instructions, i used it in hundreds of times with profiler and proofed result, but not every time it is better. so, would you please time this and share with us function IsBreak(AChar: PChar; ALineBreakPos: integer = 1): boolean; begin if AChar^ <> #0 then begin P2 := AChar; {$IFOPT B+}{$DEFINE BEnabled}{$ENDIF}{$B+} if (AChar^ <> LLineBreak[ALineBreakPos]) or (ALineBreakPos > LineBreakLen) then Exit(False) else Exit(IsBreak(AChar + 1, ALineBreakPos + 1)); {$IFNDEF BEnabled}{$B-}{$ENDIF} end else Exit(True); end; This should be one conditional jump shorter, yet it might not result in shorter execution time, i just want to point and explain that point of assigning and jumping for anyone love this stuff, such case only a profiler can tell, may be more experienced people can tell by just looking at it in Pascal code. Share this post Link to post
Remy Lebeau 1393 Posted June 22, 2020 On 6/20/2020 at 7:13 AM, dummzeuch said: Longint = integer since basically forever. Not on all platforms. On iOS 64bit and Linux 64bit, Longint is 8 bytes where Integer is 4 bytes. Share this post Link to post
Remy Lebeau 1393 Posted June 22, 2020 On 6/21/2020 at 7:47 AM, Kas Ob. said: On side note: Here i really intrigued how managed record with 10.4 is when have initializer. No one published any assembly about how local record been handled, is there hidden try..finally ? for all records or only for records with with Initialize and Finalize ? Managed records have Initialize() and Finalize() operators that users can override. But that also means they have to be called automatically when the record enters and leaves scope. So, just like other managed types, there is an hidden try..finally to ensure Finalize() is called, yes. That is not the case with unmanaged records. On 6/21/2020 at 7:47 AM, Kas Ob. said: How about inline variables ? Where the hidden try..finally been inserted ? like local vars in the begging and in the end or in the middle and only when entering their begin..end Inline variables begin and end their lifetime in the same scope that they are declared in. So that is where their initialization and cleanup is performed. 1 Share this post Link to post