dummzeuch 1505 Posted June 19, 2020 function TMemoryStream.Write(const Buffer; Count: Longint): Longint; var Pos: Int64; begin if (FPosition >= 0) and (Count >= 0) then begin Pos := FPosition + Count; if Pos > 0 then begin if Pos > FSize then begin if Pos > FCapacity then SetCapacity(Pos); FSize := Pos; end; System.Move(Buffer, (PByte(FMemory) + FPosition)^, Count); FPosition := Pos; Result := Count; Exit; end; end; Result := 0; end; This is TMemoryStream.Write from System.Classes (which apart from using Int64 now seems to be unchanged since at least Delphi 2007). It looks rather inefficient to me: It checks for Count >= 0 while in my opinion it should check for Count > 0 because otherwise nothing gets written anyway and Result gets set to 0 (Remember Count is 0 in that case) which is done after the if statement anyway. If it did check for Count > 0 it could simply omit the Pos > 0 check because we already know that it must be because FPosition >= 0 and Count > 0, so FPosition + Count must also be > 0. So, my implementation would look like this: function TMemoryStream.Write(const Buffer; Count: Integer): Longint; var Pos: Int64; begin if (FPosition < 0) or (Count <= 0) then begin Result := 0; Exit; //==> end; Pos := FPosition + Count; if Pos > FSize then begin if Pos > FCapacity then SetCapacity(Pos); FSize := Pos; end; System.Move(Buffer, Pointer(Longint(FMemory) + FPosition)^, Count); FPosition := Pos; Result := Count; end; Am I missing something? Another oddity is that the GetSize method is not overridden to simply return FSize rather than calling Seek three times. Given that there are quite a few places in the RTL / VCL (and also JCL / JVCL, Indy and of course my own code) which use TMemoryStream as a convenient way to move around some bytes, it's odd that it still has these inefficiencies. 1 Share this post Link to post
Remy Lebeau 1394 Posted June 19, 2020 21 minutes ago, dummzeuch said: If it did check for Count > 0 it could simply omit the Pos > 0 check because we already know that it must be because FPosition >= 0 and Count > 0, so FPosition + Count must also be > 0. Unless the result of FPosition + Count is higher than High(Int64), in which case the result will wrap to a negative value. Granted, that is unlikely to happen, but it is still a possibility with signed integer arithmetic. 21 minutes ago, dummzeuch said: Another oddity is that the GetSize method is not overridden to simply return FSize rather than calling Seek three times. Perhaps, but Seek() is a very low overhead operation in a memory stream, so this is not really an issue. 21 minutes ago, dummzeuch said: Given that there are quite a few places in the RTL / VCL (and also JCL / JVCL, Indy and of course my own code) which use TMemoryStream as a convenient way to move around some bytes, it's odd that it still has these inefficiencies. Then you should take this up with Embarcadero directly, since they are the only ones who can actually change it. 1 Share this post Link to post
Mahdi Safsafi 225 Posted June 19, 2020 Theoretically what you said is true. However, in practice Count could be (-) ! Moreover, CPU likes the first implementation over yours, the probability to call write with Count > 0 is much higher than calling it with zero ! (Out-of-order execution). Share this post Link to post
Guest Posted June 19, 2020 @dummzeuch If you are after the efficiency then you should stick to the original structure in concern of exit In Delphi there is one exit point in all procedure unlike many other languages, and that is a thing i like and hate at same time, Delphi procedures/functions internally looks like this function TMemoryStream.Write(const Buffer; Count: Integer): Longint; var Pos: Int64; begin if (FPosition < 0) or (Count <= 0) then begin Result := 0; //Exit; //==> //This Exit is in fact a simple GOTO @EXITPOINT end; ..... FPosition := Pos; Result := Count; @EXITPOINT: end; Mahdi just wrote the explanation before me, so the theory is to minimize the branching and jumping and always prefer the conditional compare to be true ( or false) as long the conditional jmp/branch is not executed, in other prefer the code to be one continued piece, same as the original function. I think PByte is safer than LongInt, and the Pos should be named NewPos On side note : if you like those in RTL, then just put the following line and try to trace it with Step Into in the debugger TEncoding.UTF8.GetBytes('ABCD'); That is ugly and nasty stuff right there, try to count how many if-then-raise you will encounter, then try to figure out why WideCharToMultiByte been called 3 times, against what Microsoft API intended it to be, simply all the RTL stuff with encoding is mostly counter productive, that is the stuff in very need to be addressed. Can encoding classes be implemented differently, yes and it should, and i recommend everyone don't use any line from that pile when speed is a concern or factor. Share this post Link to post
Attila Kovacs 629 Posted June 19, 2020 if (FPosition >= 0) and (Count >= 0) then also translates to if not ((FPosition >= 0) and (Count >= 0)) then goto @exitpoint; so I can't see any performance hit on Exit(). Also, I can't see any sense for any optimization in this method. Do you really need to save 1-2 CPU cycles in a stream operation? Strange. Share this post Link to post
Guest Posted June 20, 2020 5 hours ago, Attila Kovacs said: so I can't see any performance hit on Exit(). You will not see it, not like that, because that is very hard to measure, the magnitude may range between 0 and few cycles ( but in fact it might measure minus few cycles also), so out-of-order-execution is a complicate matter, and what i wrote above above is fraction of how it does work, because it does need the compiler assistance to make sure the assembly code is helping to create better chance for the CPU to use it. explaining in example will be better, also the internet full of resources about this subject, but i found this on SO, which does have 2 nice and accurate answers https://cs.stackexchange.com/questions/97407/difference-between-delayed-branches-and-out-of-order-execution One more fact the compiler should be aware of those situation and utilize different registers to help the CPU execute out of order instructions, Which Delphi compiler is helpless here. Now to see it in work you should create test/profile case that use low level timing directly from the CPU itself (use the RDTSC instruction), then time that part, as we all have modern CPU's, most likely your CPU is having some technology like Speed-Step, so go to the BIOS and disable it, then time again to establish a base line, after that only you will get better idea if out-of-order-execution is been used there, here we need the actual speed from the CPU as it built without those enhancement, We are not talking about slower performance, but we might get better performance by helping the CPU to decide and execute bunch of instruction at the same time. One more thing about profiling, if your timing ranging, means not 100% exact result ( like 1.23456 second every single time) , then you need to make sure that all CPU enhanced technology are disabled first ( Speed-Step, Hyper-Threading...), if you still get ranging result then make sure to switch to use Fibers instead of Threads, Fiber will not be interrupted ( less interaction from the OS) it will have less chance to suffer from a context switched before yielding, such you will get stable result, after that enable those enhancement, and change between jumps/branching to measure what was the result. In other words unless you have same result from each and every execution without these CPU enhancement, then measuring the speed with them is invalid. 5 hours ago, Attila Kovacs said: Also, I can't see any sense for any optimization in this method. Do you really need to save 1-2 CPU cycles in a stream operation? Strange. Yes and no, 2 cycles here with additional few in case out-of-order-execution is triggered and helped, might looks negligible but think of it differently, it is RTL and it is on every Delphi application working over the globe, simply put why not ?, imagine how many cycles been wasted since Delphi 2007 as the OP mentioned, RTL is not something you change every day and it should be best of the best, the number of cycles can be saved is literally unimaginable number, right ? now imagine you can squeeze 2 cycles from most RTL procedures !! On other hand what is the cost of having better and faster RTL ? nothing, it is there and will be there for ever, it should be evolving in that part, hiring or outsourcing that part to real professionals knows how things should be done is most likely cheaper than visiting new city for marketing festival. On side note: i have simple example which will show this out-of-order-execution with SIMD instructions that might raise your eyebrows, not sure to put it here now or later after dummzeuch agree to steer the subject away, dummzeuch looks like nice guy, he is not Daniel, he is not Lars , those guys are traumatizing material. Share this post Link to post
Attila Kovacs 629 Posted June 20, 2020 (edited) And how do you measure the out-of-order-execution's algorithm's power consumption? Isn't it possible that for saving some cycles in the code, this algo in the cpu has to consume a bit more energy? Edited June 20, 2020 by Attila Kovacs Share this post Link to post
Guest Posted June 20, 2020 4 minutes ago, Attila Kovacs said: And how do you measure the out-of-order-execution's algorithm's power consumption? Well that is beyond me ! But i have thoughts on that 1) I don't think it will be huge factor. 1) out-of-order-execution is working all the time for every step(s) (pre-determined with specific algorithms mostly on the CPU back-end), and only be disabled by the BIOS, this means the wasted power is just wasted and nothing can be done here, unless you should use it efficiently in decreasing the power consumption on the processed code, in other words it should be helped by your code to return the wasted power or even save more the wasted many folds. 2) modern CPU decrease the clock to save huge amount of power, so the wasted from above is gained multiple folds with this, making those small algorithms power consumption have little impact. 3) if you think about those algorithm and how they do works, you will be intrigued as they work faster than the code ! few folds indeed in some cases, means those hardware algorithm works on higher clock frequency as they should but they aren't, about this i don't have much information. 4) those algorithms ( CPU front-end and back-end) responsible to handle the cache, preparing the cache lines and on 3 levels is not free power operation! means the more your code cache miss and more power you waste, code block is been brought in blocks, so when your code jump forward or backward, there will be a chance you violated the block and execution will be stopped until that part ( destination) been fetched, here again more wasted power, on this subject every access to memory will be passed though specific circuits ( algorithms) to determine if they are in cache or need bringing or lets do it there on memory itself. 15 minutes ago, Attila Kovacs said: Isn't it possible that for sparing some cycles in the code, this algo in the cpu has to consume a bit more energy? Not sure i understand the question right, but up i may be i covered the some part of this. As rule of thumb and to clarify : Faster code mostly uses less power, that is right and simple, but you should be aware on how to compare it right, when you switch between two algorithms like you comparing between CS and SpinLock looping in multithreaded code, that different thing, yes with multithreaded applications it is faster and will need way less time with looping in many cases, but you consumed huge amount of power to achieve this speed by looping in place, such situation should not compared, they are like apple and orange thing. Resources about this on Internet are scarce and most the time for concern specific small part, as those algorithms and behaviour are manufacturer specific and even per CPU generation, and here quote from of those resources https://www.realworldtech.com/sandy-bridge/3/ Quote One of the areas that Intel’s microarchitects concentrate on most keenly is branch prediction. It seems like hardly a generation goes by without Intel improving the branch predictors in one fashion or another. The rationale is fairly straight forward. Many improvements that increase performance also increase the energy used; to maintain efficiency, microarchitects must ensure that a new feature gains more performance than it costs in energy or power. Last if someone have more information about this it will be welcomed to share, also if i am wrong i would love to be corrected. Share this post Link to post
Guest Posted June 20, 2020 I missed pasting the rest of the quote and the forum removing all text when i click edit ! Quote One of the areas that Intel’s microarchitects concentrate on most keenly is branch prediction. It seems like hardly a generation goes by without Intel improving the branch predictors in one fashion or another. The rationale is fairly straight forward. Many improvements that increase performance also increase the energy used; to maintain efficiency, microarchitects must ensure that a new feature gains more performance than it costs in energy or power. In contrast, branch prediction is one of the few areas where improvements generally increase performance and decrease energy usage. Each mispredicted branch will flush the entire pipeline, losing the work of up to a hundred or so in-flight instructions and wasting all the energy expended on those instructions. Consequently, avoiding expensive mispredictions with better branch predictors is highly desirable and a prime focus for Intel. Share this post Link to post
Attila Kovacs 629 Posted June 20, 2020 @Kas Ob. Nice, this quote about Intel and their work. Now, how does an early exit vs. an "if then begin end;" differ, as the Delphi compiler, as far I have seen until now, does not multiply the return code but results in jumps to the end. Only the condition changes, true/false. Share this post Link to post
dummzeuch 1505 Posted June 20, 2020 (edited) OK, so I understand that there might be the possibility of an Int64 overflow, here: 17 hours ago, dummzeuch said: Pos := FPosition + Count; which could result in Pos becoming negative. On the other hand that would mean a stream with about MaxInt64 bytes of data, which is quite a lot: 9,223,372,036,854,775,807 or 2^33 gibibytes. Do 64 bit processors actually have the capacity of addressing that much (virtual) memory? Are there any other bugs in my implementation? There might also be no performance gain at all due to highly sophisticated features of the CPU. And of course: There might be other possible improvements If I want this to be optimized I should submit a report to Embarcadero Why did I actually bring this up? I was looking at some code that was extensively using TMemoryStream as a buffer for various operations, where it accessed the Position and Size properties all the time and thought about the possible overhead of these method calls. Then I looked at the implementation of GetSize and found 3 calls to Seek in there, just to eventually get the value of a field. And then I looked at various other methods of T(Custom)MemoryStream and thought WTF is this? That I used the Write method as an example above was pure chance, it was just the last method I had been looking at. Edited June 20, 2020 by dummzeuch Share this post Link to post
Guest Posted June 20, 2020 (edited) 2 hours ago, dummzeuch said: which could result in Pos becoming negative. On the other hand that would mean a stream with about MaxInt64 bytes of data, which is quite a lot: 9,223,372,036,854,775,807 or 2^33 tebibytes. Do 64 bit processors actually have the capacity of addressing that much (virtual) memory? Pos is a virtual address not a physical one, and can easily be hitting that range, as this up to the OS, if you have remember from 32 bits many Windows Kernel DLL's are allocated close to 2GB range while the system might have only 1 GB of physical RAM, same thing goes for 64bit, this might be documented somewhere like this but not quite shows that Kernel dll's are allocated on top , that is a thing you can see in the debugger https://docs.microsoft.com/en-us/windows-hardware/drivers/gettingstarted/virtual-address-spaces 2 hours ago, dummzeuch said: Why did I actually bring this up? I was looking at some code that was extensively using TMemoryStream as a buffer for various operations, where it accessed the Position and Size properties all the time and thought about the possible overhead of these method calls. Then I looked at the implementation of GetSize and found 3 calls to Seek in there, just to eventually get the value of a field. And then I looked at various other methods of T(Custom)MemoryStream and thought WTF is this? That I used the Write method as an example above was pure chance, it was just the last method I had been looking at. Any optimization is good, as long it will yield performance return (very small also is not nothing) or at least will have better chance to help the modern CPU's to get better execution time, or less jumping around causing the cache to become a bottleneck. Edit : Removing a mistake talking about different thing Edited June 20, 2020 by Guest Share this post Link to post
Guest Posted June 20, 2020 37 minutes ago, Attila Kovacs said: Now, how does an early exit vs. an "if then begin end;" differ, as the Delphi compiler, as far I have seen until now, does not multiply the return code but results in jumps to the end. Only the condition changes, true/false. Exactly, those jumps will be there and they completely depend on the code specially if they are a small portion of a method ( not entire method/function), in other words there will be jumps, the point is to make the most expected code to run ( higher chance to be executed not skipped) to combine with code after that jump to give the CPU the ability with that hint to go on and execute the bigger piece of code, in other words lets suppose we have 10 instruction to reach the compare and jump, then there is 10 for true and 10 for false, the best way to help ( hint) the CPU, is to write the the 10 instruction with higher chance to be executed right after a that conditional compare (if), because the CPU will jump in both cases, only with first case it has more probability to combine and enhance the execution of those 20 instructions ( + with the compare and jump) in out of order manner, the other case with less probability will not always use that much attention from the CPU as it may be far or will split that algorithm to calculate both branches, this unlikely to happen soon, yet Modern CPU getting smarter each day. Share this post Link to post
Guest Posted June 20, 2020 Here good old article from Intel https://software.intel.com/content/www/us/en/develop/articles/branch-and-loop-reorganization-to-prevent-mispredicts.html C/C++ compiler already have different methods to allow you to hint the compiler about the likeliness of conditional branching, google will help there. https://en.cppreference.com/w/cpp/language/attributes/likely https://llvm.org/docs/BranchWeightMetadata.html Share this post Link to post
Attila Kovacs 629 Posted June 20, 2020 (edited) @Kas Ob. Ah, I see, what you were referring to. 23 minutes ago, Kas Ob. said: the point is to make the most expected code to run But this was/should be the case even without knowing any implementation detail of the CPU. At least for me. Yes, considering this, an early exit could screw the cpu's o-of-order algo. Interesting. Edited June 20, 2020 by Attila Kovacs Share this post Link to post
Guest Posted June 20, 2020 I believe the term is flushing the pipeline, from the same article above Quote (talking about Sandy Bridge here ) Each mispredicted branch will flush the entire pipeline, losing the work of up to a hundred or so in-flight instructions and wasting all the energy expended on those instructions. .... Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed. Share this post Link to post
Mahdi Safsafi 225 Posted June 20, 2020 Quote But this was/should be the case even without knowing any implementation detail of the CPU. OoOE is an old technology designed to improve CPU idle. That means CPU is free to rearrange the order of execution of a logic but must yield the same expectation. Branch prediction (BP) was needed to make that feature works smoothly. So in short, there was two important time-frame here. an era before P4, where developers/compilers were forced to cooperate with CPU in order to benefit from this feature. CPU provided special opcode prefix for branch called hint_prediction ( 0x2E: Branch Not Taken; 0x3E: Branch Taken). Compiler such gcc provided built in function to support that (__builtin_expect ). Developers were forced to use the asm version or the builtin function to benefit from BP. Obviously, this wasn't good because only high qualified developers(that clearly understood the technology) were able to use it. Besides they couldn't even cover all their logic (it was just impossible to cover all your if statements). CPU maker realized that and replaced the ugly prefix 0x2E, 0x3E by an automatic solution that does not require developer cooperation (>P4). Today, CPU maker are working heavy to improve OoOE/BP because it was clear that on a multiple run (not a single run) this can give a high performance (They dedicated special CPU unit for just BP). Now life is more easy but yet you still need to understand the concept in order to make a high performance version of your logic. For example, the second implementation of TMemoryStream.Write has two state and spins on Result:=0. Original implementation has mostly one state and highly optimized for Count > 0 on a multiple run. function TMemoryStream.Write(const Buffer; Count: Integer): Longint; var Pos: Int64; begin // there is a hight chance that the condition is false. if CPU predicates to true ... thats a waste of time. if (FPosition < 0) or (Count <= 0) then begin Result := 0; Exit; // cpu must wait to validate BP(idle). end else begin // state2: Pos := FPosition + Count; if Pos > FSize then begin if Pos > FCapacity then SetCapacity(Pos); FSize := Pos; end; System.Move(Buffer, Pointer(Longint(FMemory) + FPosition)^, Count); FPosition := Pos; Result := Count; // if CPU prediction was wrong, it must recover from state2(heavy). end; end; function TMemoryStream.Write(const Buffer; Count: Longint): Longint; var Pos: Int64; begin // there is a hight chance that the condition is true. If CPU predicates to true ... it saved time. if (FPosition >= 0) and (Count >= 0) then begin Pos := FPosition + Count; if Pos > 0 then begin if Pos > FSize then begin if Pos > FCapacity then SetCapacity(Pos); FSize := Pos; end; System.Move(Buffer, (PByte(FMemory) + FPosition)^, Count); FPosition := Pos; Result := Count; Exit; // cpu may wait here to validate BP(s). end; end; // state2: Result := 0; // recovering from state2 is not heavy. end; 1 Share this post Link to post
Mahdi Safsafi 225 Posted June 20, 2020 Quote which could result in Pos becoming negative. On the other hand that would mean a stream with about MaxInt64 bytes of data, which is quite a lot: 9,223,372,036,854,775,807 or 2^33 gibibytes. Do 64 bit processors actually have the capacity of addressing that much (virtual) memory? AFAIK, there is no implementation that can handle this massive amount of data. Stream uses Int64 to allow a stream to handle > 4GB but not 2^33GB. Quote Are there any other bugs in my implementation? Why you replaced LongInt with Integer for count ? Is this a typos ? Share this post Link to post
Mahdi Safsafi 225 Posted June 20, 2020 @Kas Ob. You may need to take a look at Intel performance counter monitor(PCM). I didn't tried it myself but a friend of mine recommended it for me. Share this post Link to post
Guest Posted June 20, 2020 2 hours ago, Kas Ob. said: 2 hours ago, dummzeuch said: which could result in Pos becoming negative. On the other hand that would mean a stream with about MaxInt64 bytes of data, which is quite a lot: 9,223,372,036,854,775,807 or 2^33 tebibytes. Do 64 bit processors actually have the capacity of addressing that much (virtual) memory? Pos is a virtual address not a physical one, and can easily be hitting that range, as this up to the OS, if you have remember from 32 bits many Windows Kernel DLL's are allocated close to 2GB range while the system might have only 1 GB of physical RAM, same thing goes for 64bit, this might be documented somewhere like this but not quite shows that Kernel dll's are allocated on top , that is a thing you can see in the debugger https://docs.microsoft.com/en-us/windows-hardware/drivers/gettingstarted/virtual-address-spaces That was a mistake from me, i got confused between Pos and with LongInt casting of FMemory, please ignore that. Share this post Link to post
Attila Kovacs 629 Posted June 20, 2020 (edited) @Mahdi Safsafi Thx for your explanation through the example with source code, really nice. I was curious how the prediction works and found that "As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt’s work." and "Forward branches dominate backward branches by about 4 to 1 (whether conditional or not). About 60% of the forward conditional branches are taken, while approximately 85% of the backward conditional branches are taken (because of the prevalence of program loops). Just knowing this data about average code behavior, we could optimize our architecture for the common cases. A "Static Predictor" can just look at the offset (distance forward or backward from current PC) for conditional branches as soon as the instruction is decoded. Backward branches will be predicted to be taken, since that is the most common case. The accuracy of the static predictor will depend on the type of code being executed, as well as the coding style used by the programmer. These statistics were derived from the SPEC suite of benchmarks, and many PC software workloads will favor slightly different static behavior." However I did not found anything about "first time they see them", in what inertia system are they considered as "first seen". This means, we can tell that at the first "if" in the example above, the prediction is "not to take the else case". Not sure what happens on the next call to the method though, Edited June 20, 2020 by Attila Kovacs Share this post Link to post
dummzeuch 1505 Posted June 20, 2020 1 hour ago, Mahdi Safsafi said: AFAIK, there is no implementation that can handle this massive amount of data. Stream uses Int64 to allow a stream to handle > 4GB but not 2^33GB. Why you replaced LongInt with Integer for count ? Is this a typos ? Longint = integer since basically forever. Share this post Link to post
Guest Posted June 20, 2020 12 minutes ago, Attila Kovacs said: However I did not found anything about "first time they see them", in what inertia system are they considered as "first seen". This means, we can tell that at the first "if" in the example above, the prediction is "not to take the else case". Not sure what happens on the next call to the method though, Right, and next time it might already on most modern CPU will expect to take else. But the question is in real life code, will that if be executed those many time in consequence while that predictor still remember to predict else to take ?, i mean the window of decision memory for the predictor can't be huge and will work on small portion of instructions, so when your code get out of that scope the predictor already forgot about that else and will need to take the hit again, right ? Share this post Link to post
Mahdi Safsafi 225 Posted June 20, 2020 27 minutes ago, dummzeuch said: Longint = integer since basically forever. Longint does not have a fixed width on all platforms. // Delphi 10.3 // System unit // line 242: {$IF SizeOf(LongInt) = 8} {$DEFINE LONGINT64} {$DEFINE LONGINTISCPPLONG} {$ENDIF} So I prefer to stick with the original declaration. Share this post Link to post
Mahdi Safsafi 225 Posted June 20, 2020 Quote However I did not found anything about "first time they see them", in what inertia system are they considered as "first seen". They're considered as first seen when there is no previous information available. When CPU has no prior information about a branch, it assumes that the jump is taken (because most of the time we intend to execute the code inside "if statement"). BP uses complex algorithms for the prediction (85-90% of the prediction are correct !). Modern CPU achieves up to 95% ! Those algorithms kept improved since the first time. As I said before, for recent CPU, there is a full specialized CPU-unit for BP. While your program is running, CPU records all executed branches. When calling a logic for the second, third time, ... CPU uses previous available information to predicates whether a branch is going to be taken or not. Note that this technology is widely used by many architectures (not only for x86). However, implementation varies (some of them have a dedicated BP unit, others not, some of them are more able of OoOE, others have limited support, ... ). 1 Share this post Link to post