PaulM117 2 Posted November 8, 2023 (edited) I have setup my project to compile with today-released Delphi 12 Athens and existing Delphi 11.1 Alexandria. 1. This is an absolutely ceteris paribus comparison (same project, same code, same compiler options, same linker options) 2. I compile my application in Win64 Release on Delphi 12 - 15.0mb 3. I immediately thereafter compile the same exact application/project/config in Delphi 11.1 - 14.7mb Does anybody know what Embarcadero did to explain the difference? Based on all the info released on Delphi 12, I do not see any reference to core RTL changes that would explain this difference. I know "it's only 0.3mb", but the problem is it's a 0.3mb that shouldn't be there if I am doing the exact same thing with the exact same versions of libaries, the exact same code, and the exact same compiler options. Do you think when new versions of Microsoft Visual Studio C++ get released that such a thing happens? To me this is only acceptable if the announced version upgrade changes include a disclosure about core RTL changes that would cause a reasonable person in my shoes to understand why and where the bloat is coming from. For only code, 0.3mb difference seems hard to justify. Does anyone know where this is coming from, based upon what's been disclosed so far? I did not find on the Embarcadero website today a link to a wiki-style documentation page listing all of the Delphi changes in-detail from 11.3 where I can review RTL improvements. Has such a listing been released yet as for previous versions? Edited November 8, 2023 by PaulM117 Share this post Link to post
Vandrovnik 214 Posted November 8, 2023 32 minutes ago, PaulM117 said: Does anyone know where this is coming from, based upon what's been disclosed so far? I did not find on the Embarcadero website today a link to a wiki-style documentation page listing all of the Delphi changes in-detail from 11.3 where I can review RTL improvements. Has such a listing been released yet as for previous versions? List of changes: https://docwiki.embarcadero.com/RADStudio/Athens/en/New_features_and_customer_reported_issues_fixed_in_RAD_Studio_12.0 Share this post Link to post
Stefan Glienke 2002 Posted November 8, 2023 (edited) You realize that it's not "exact same code" if you compare 11 and 12, right? The code in the RTL and the VCL or FMX (depending on which one you use) - change between those versions. It only requires one use of a class somewhere that was not used before or an introduction of an additional list of something inside of some class and the binary size increases. Go diff the source directory of your C:\Program Files (x86)\Embarcadero\Studio\22.0 and C:\Program Files (x86)\Embarcadero\Studio\23.0 directories and check for yourself. Or build with a map file and then diff the map file to find these changes. Also - and this might be marginal but contribute to the overall increase: they changed the count and index of all collection types to NativeInt which means that some instructions regarding those might be a few bytes larger on 64bit (see some x86-64 instruction reference of your choice for details). Otherwise, when I think about it it might also save a few bytes because it does no do register widening anymore on 64bit. So take this last paragraph with a grain of salt and consider it just additional information. I would guess a few fixes and new features here and there can easily add up to 300K more binary size - especially with the general issue the Delphi compiler has with generics - see RSP-16520. Edited November 8, 2023 by Stefan Glienke 4 Share this post Link to post
PaulM117 2 Posted November 8, 2023 (edited) Thanks both for your input. The most meaningful answer is that, as you pointed out Stefan, I could diff the RTL directories and examine myself. I'd be curious to hear of any other RTL changes that might be significant for the developer of a performance-critical Delphi 11.1 app to know about when upgrading. The full wiki bugfix list is helpful and I am currently pouring over that. Hopefully I can finally get rid of my custom FastMove routine I copied from someone's suggestion on the quality report site and assume that the built-in Move() is now as optimized. Edited November 8, 2023 by PaulM117 Share this post Link to post
Stefan Glienke 2002 Posted November 8, 2023 (edited) 20 minutes ago, PaulM117 said: built-in Move() is now as optimized. It better be - I spent quite some time on it. FWIW it was already introduced in 11.3 Edited November 8, 2023 by Stefan Glienke 3 1 Share this post Link to post
David Heffernan 2345 Posted November 8, 2023 Why do you care about these tiny differences in these tiny executables? If you care use Delphi 5. 2 Share this post Link to post
PaulM117 2 Posted November 8, 2023 29 minutes ago, David Heffernan said: Why do you care about these tiny differences in these tiny executables? If you care use Delphi 5. User perception and optimization of robust tactical deployment of application to Win64 targets (minimal bandwidth/file IO usage). Admittedly, EXE size is a comparatively small factor to these stated ends when compared with runtime performance. It's necessary though that I at least know what is occupying the size. I need to be able to reason about the contents of the binary file and ensure there is no major waste or unknowns. Instruction cache effects could also theoretically impact performance negatively with larger EXE sizes, could they not? Share this post Link to post
David Heffernan 2345 Posted November 8, 2023 (edited) 6 minutes ago, PaulM117 said: User perception and optimization of robust tactical deployment of application to Win64 targets (minimal bandwidth/file IO usage). I don't understand this sentence. 6 minutes ago, PaulM117 said: It's necessary though that I at least know what is occupying the size. Read the map file. 6 minutes ago, PaulM117 said: I need to be able to reason about the contents of the binary file and ensure there is no major waste or unknowns This hasn't bothered you before, why do you suddenly need to reason about this now. 6 minutes ago, PaulM117 said: Instruction cache effects could also theoretically impact performance negatively with larger EXE sizes, could they not? What are these effects you talk about? Anyway, you measured the size of your executables, but now you claim that performance is what concerns. If performance concerns you I recommend that you measure performance. Seems like you've measured the wrong thing. Anyway, if you really care about performance you'd surely not be using delphi, literally the worst compiler for performance. Edited November 8, 2023 by David Heffernan Share this post Link to post
Stefan Glienke 2002 Posted November 8, 2023 (edited) 10 minutes ago, PaulM117 said: Instruction cache effects could also theoretically impact performance negatively with larger EXE sizes, could they not? If you care about such things you are totally wrong using Delphi - especially 64bit. It does suboptimal use of all the registers available in 64bit, it produces a crapton of conditional jumps instead of better alternatives that exist for decades, it does not use SSE (let alone AVX) which exists almost as long (except for some floating point stuff), it does zero optimization wrt to loop alignment, it does not restructure binary code so that some cold code does not sit in the middle of some hot code. I could go on, but these are just a few things that matter more if you really care about the least ns squeezed out of your application than some 300K of binary size. Edited November 8, 2023 by Stefan Glienke 2 Share this post Link to post
Anders Melander 1784 Posted November 8, 2023 57 minutes ago, David Heffernan said: Anyway, if you really care about performance you'd surely not be using delphi, literally the worst compiler for performance. It's possible to care about more than one single thing. Personally, I care a lot about performance but I also care about code readability, ease of UI design, and TBH, the amount of fun I have writing the code. If I only cared about performance then I probably wouldn't use Delphi. 3 Share this post Link to post
David Heffernan 2345 Posted November 8, 2023 4 minutes ago, Anders Melander said: Personally, I care a lot about performance but I also care about code readability, ease of UI design, and TBH, the amount of fun I have writing the code. I'm pretty sure that you can have all of those things in the same language Share this post Link to post
Anders Melander 1784 Posted November 8, 2023 2 minutes ago, David Heffernan said: I'm pretty sure that you can have all of those things in the same language I'm pretty sure that depends on who "you" is. So far I haven't found one. 1 Share this post Link to post
ŁukaszDe 38 Posted November 9, 2023 @PaulM117 If size of exe file is important for you, try Upx https://upx.github.io. For my project, Upx made 12MB exe file from 42 MB. 1 Share this post Link to post
Sherlock 663 Posted November 9, 2023 Just to clarify: Have you not seen increases or other differences between Delphi 11.1 and 11.3? I mean, there was some work done there too. Why compare "some version along the way" to the newest, and not the last version before the new release? Share this post Link to post
David Heffernan 2345 Posted November 9, 2023 5 hours ago, ŁukaszDe said: @PaulM117 If size of exe file is important for you, try Upx https://upx.github.io. For my project, Upx made 12MB exe file from 42 MB. And makes your app a target for anti virus products. And what what gain? You end up with the same executable loaded in memory. I've never understood the point of this tool. 8 Share this post Link to post
Anders Melander 1784 Posted November 9, 2023 (edited) 20 hours ago, David Heffernan said: I've never understood the point of this tool. It would make his project fit on 10 floppy disks instead of 35. Edited November 10, 2023 by Anders Melander 1 7 Share this post Link to post
PaulM117 2 Posted November 9, 2023 (edited) 19 hours ago, David Heffernan said: This hasn't bothered you before, why do you suddenly need to reason about this now. In September I restarted my flagship application from scratch and incrementally tracked each increase in file size from bare Win64 VCL app in Delphi 11.1. So I did have a reasonable accounting. I never bought 11.3 as my update subscription expired. 19 hours ago, David Heffernan said: What are these effects you talk about? Anyway, you measured the size of your executables, but now you claim that performance is what concerns. If performance concerns you I recommend that you measure performance. Seems like you've measured the wrong thing. Anyway, if you really care about performance you'd surely not be using delphi, literally the worst compiler for performance. I do of course measure performance. This was about EXE bloat. Your posts and scattered internet writings, along with Dalija, Primoz, Arnaud, Remy, and others I am forgetting, have helped me tremendously through the years to learn performant and efficient coding - I am thankful for the dialogue. Let me try my best to contradict your overall practically mostly true argument for my specific case. It is possible to attain optimal performance in Delphi by the mere fact that we have Win64 assembler code ability. Moreover: 19 hours ago, Stefan Glienke said: If you care about such things you are totally wrong using Delphi - especially 64bit. It does suboptimal use of all the registers available in 64bit, it produces a crapton of conditional jumps instead of better alternatives that exist for decades, it does not use SSE (let alone AVX) which exists almost as long (except for some floating point stuff), it does zero optimization wrt to loop alignment, it does not restructure binary code so that some cold code does not sit in the middle of some hot code. I could go on, but these are just a few things that matter more if you really care about the least ns squeezed out of your application than some 300K of binary size. For SSE2, I use Neslib.FastMath which beats the previous MS D3DX10 DLL SSE-optimized libraries I was using - this is a graphics application and heavily GPU bound due to my excellent, lean, low-level, cache-efficient coding strategies for CPU code possible in Delphi. I am running at 120fps with less than 2% CPU usage in Task Manager with an unfinished app, and am confident I can keep it that way. A C++ app where the programmer has not laid out memory in a cache-efficient manner will have worse performance than a correctly written Delphi Win64 app using FastMath for SSE (packing in to TVector4s) I use inline constants everwhere to pull things into registers and pay attention as much as I can to producing clean assembly with Object Pascal syntax. For instance, I never write for var I := 0 to Length(arr)-1 in hot paths, after I found (admittedly a few versions ago) that it did a repeat call to Length() and/or subtraction - I always go const AHi = High(arr); for var I := 0 to AHI do, etc. If I were using C++ I would have the convenience of trusting that little stupid things like that were already taken care of for me, but the benefits of Delphi for UI design outweight the minor inconveniences. However I know you are right about the overall quality of Embarcadero's compiler which I lament. I greatly lament that Embarcadero has spent more time on Firemonkey/C++/database stuff that's irrelevant to me rather than making a highly optimized Win32/64 compiler and built-in profiling/instrumenting tools. However, this is just a matter of convenience. I still reject that I can't produce as optimal performance of a Delphi Win64 application compared to MSVC, only that it requires much more attention in some areas. Let me know if you disagree with that. BTW - I did decide against the EXE compression libraries for that very reason of their apparent likelihood to trigger AV detections. Edited November 9, 2023 by PaulM117 1 Share this post Link to post
Stefan Glienke 2002 Posted November 9, 2023 (edited) 4 minutes ago, PaulM117 said: I never write for var I := 0 to Length(arr) in hot paths, after I found (admittedly a few versions ago) that it did a repeat call to Length() This is false knowledge - it only does repeated calls to Length when you do for x in some_dynamic_array do loop Edited November 9, 2023 by Stefan Glienke 1 Share this post Link to post
PaulM117 2 Posted November 9, 2023 Just now, Stefan Glienke said: This is false knowledge - it only does repeated calls to Length when you do for x in some_dynamic_array do loop I need to recheck this behavior in 12. Maybe it was just the extra subtraction, but I swear I saw in the assembly there was some ridiculously omitted basic optimization/wasteful output. Probably was 10.4. In general things like this have cultivated a very helpful distrust of the compiler. OK, here's another one you will probably agree with. In C++, most compilers will take simple loops linearly iterating over an array and omit the re-indexing computation by incrementing a pointer. But I always have to go var pFirstItem := PInteger(@IntArr[0]); for I := 0 to IntArrHigh do begin pFirstItem^ := something; Inc(pFirstItem); end. With this kind of attention as a common practice, I really wonder how much performance in the low-level assembly code I am missing as compared to MSVC. My belief that memory layout and cache usage efficiency are orders of magnitude more meaningful seems to be supported by recent industry experience, books, talks, articles, trends, etc. I'm part of the whole "Data Oriented Design" religion and Delphi allows me to do this well. Share this post Link to post
Stefan Glienke 2002 Posted November 9, 2023 (edited) 1 hour ago, PaulM117 said: In C++, most compilers will take simple loops linearly iterating over an array and omit the re-indexing computation by incrementing a pointer. But I always have to go var pFirstItem := PInteger(@IntArr[0]); for I := 0 to IntArrHigh do begin pFirstItem^ := something; Inc(pFirstItem); end. With this kind of attention as a common practice, I really wonder how much performance in the low-level assembly code I am missing as compared to MSVC. Also not true (anymore, it might have been decades ago) - or it might happen that a shifting pointer might be better because of register pressure or shortage under x86 because it only requires one register opposed to two when indexing into an array. But indexing into a memory address with an increasing or decreasing index register is always faster. Another situation might happen when your array is a field of your class and you index into that one because then the compiler is really stupid and re-reads the field every time and then indexes into it. But then the issue is re-reading the field and not the indexing into it. I solved this by putting the array into a local pointer variable and then index into that one - like here. And yet another situation happens on 64bit when using an Integer index variable because then it always does an extra register widening instruction which can be not zero cost (yes, I need to fix the code I just pointed to because it does exactly that having i declared as Integer and not NativeInt as it should be, shoot me). Oh, one particular bad thing about dcc64 is that it does not really optimize some instructions in loops well. From dcc32 we know about the counting down to 0 behavior of a for-to loop where it maintains two counters, the actual index variable (if you actually use that within the loop) and the counting down to 0 variable that it uses to control the loop. For that it usually uses the dec/jnz combination which works well, macro fuses and all that. On win64 it does sub reg, 1, test reg, reg, jnz where only test and jnz fuse which causes wasted cycles. That extra test is complete bonkers because the sub (which should actually be a dec) already sets the zero flag! See RSP-37745 Another missed opportunity of loop optimization that affects both win32 and win64 is letting the compiler create loop that counts from -count to -1 which is another optimization technique where you grab the position after the last element then index into it. This way if you don't need the index variable itself for something else than indexing you only need 2 registers, one points to right after the last element and the loop just needs the nicely fusing inc reg, jnz Edited November 9, 2023 by Stefan Glienke 4 Share this post Link to post
Pat Foley 51 Posted November 9, 2023 (edited) 2 hours ago, PaulM117 said: PInteger(@IntArr[0] Shouldn't that be PNativeInt(@ to allow either 32 or 64 compiling. The first 12 compile for me was 11K bigger but additional recompile yielded 45k smaller exe*. The displays show one NAN and blank "TokenWindow:bds.exe" *Did remove unused Controls from a package though. Edited November 9, 2023 by Pat Foley Corrected more for sizeof(theOffset) Share this post Link to post
Tommi Prami 130 Posted November 10, 2023 21 hours ago, ŁukaszDe said: @PaulM117 If size of exe file is important for you, try Upx https://upx.github.io. For my project, Upx made 12MB exe file from 42 MB. How much this will add overhead at runtime? Share this post Link to post
David Heffernan 2345 Posted November 10, 2023 31 minutes ago, Tommi Prami said: How much this will add overhead at runtime? It's presumably a small overhead at process creation 1 Share this post Link to post
Kas Ob. 121 Posted November 10, 2023 @Stefan Glienke Right on point there. I want to add one thing about inc/dec vs add/sub 1, they are not faster nor slower, but there is side effect which might sometimes yield a difference and speed boost, the effect is from the competition between two CPU enhancement Out-of-Order-Execution and Speculative Execution, older CPUs have OoOE mostly and little of SE, through generations CPUs are depending on SE more as it enhance and gain more than OoOE, the thing is both share in part (or all i have no idea) of the OoOE window, which is a buffer that handle the Re-Order buffer ROB also, while SE has an extra window/buffer to handle all branches/variations of executions, here these branches are not only branching form jumps only but from calculations, like adc the result might be two variants and that depend on the carry C flag, and this is the real difference, while as you know inc/dec change C flag, add/sub 1 will not, the difference is inc/dec will save bytes to fit in OoOE to execute more instruction, while SE will branching the whole block of following instruction due the change of C flag, thus shorten its overall can-be-executed. This effect can be witness if C flag is used in a loop to detect or checked, also can be observed when the loop is bigger than the OoOE window, on my SandyBridge that window size is 168 byte, and SE is not playing any role, on more modern CPU that windows is more than 224 bytes, i couldn't find a table or a list of comparison for many CPUs, but observed this on big loops running on modern XEONs, against the test on my SandyBridge where the loop is big more than ~180 byte of instructions and add/sub where faster by ~%5, and on mine add/sub was slower by ~%1. Share this post Link to post
Alex7691 7 Posted January 19 (edited) On 11/9/2023 at 9:25 AM, Stefan Glienke said: It better be - I spent quite some time on it. FWIW it was already introduced in 11.3 I ran some benchmark applications this week and I found out that the servers built with Delphi 12 are 13-16% faster than the same built with Delphi 10.4. I can't currently pinpoint the exact source of the performance gain but I guess that Move() and others are playing an important part on it. Anyway, very well done! Cheers, Edited January 19 by Alex7691 1 Share this post Link to post