Mahdi Safsafi 225 Posted November 22, 2020 (edited) 3 hours ago, Stefan Glienke said: Especially since one of its selling points is "it compiles to native code" - if that native code is garbage for modern CPUs because its written like in the 90s that's kinda poor. Agree ! One thing I noticed, people complain too much about compilation time(which could be a principal factor that influences EMB decision about optimization) ! I clearly understood that as a fast compilation time makes us all happy. But this shouldn't come at a cost of emitting a poor code. If someone likes a fastest compiler ... Know for sure that clients also like fastest app! A well generated code as general means things gets done quickly, for server it means less power consumption (saving bill). On mobile, friendly for battery life (happy client), ... etc. Edited November 22, 2020 by Mahdi Safsafi 1 Share this post Link to post
Stefan Glienke 2002 Posted November 22, 2020 5 minutes ago, Mahdi Safsafi said: One thing I noticed, people complain too much about compilation time ! I clearly understood that as a fast compilation time makes us all happy. But this shouldn't come at a cost of emitting a poor code. If someone likes a fastest compiler ... Know for sure that clients also like fastest app! A well generated code as general means things gets done quickly, for server it means less power consumption (saving bill). On mobile, friendly for battery life (happy client), ... etc. Why can't we have both - a fast compilation generating debug friendly non optimized code and one that churns a little longer and emits those juicy optimizations. Anyhow the current slowliness in the compiler comes from sloppy code in the compiler and not because it does so many amazing things. 3 Share this post Link to post
Lars Fosdal 1791 Posted November 23, 2020 I certainly would not mind a Delphi for .NET - but it would need to plug into the rest of the .NET world, and not try to take it on alone. The last time they tried, MS did not play nice, and the other company that did an Object Pascal for .NET - well, they sort of went rogue. There is a lot of Azure and .NET in my future, and I believe that I will have to leave Delphi behind for C#. But - perhaps is Delphi better off as a compiler for native code on x86/x64/ARM64 on various platforms. Share this post Link to post
Guest Posted November 23, 2020 The thing is we already here and and Delphi IDE does and can utilize LLVM, so why on earth they neutered it, instead of using it right ! Just was looking for ideas then found some C code and decided to check how different compiler handle loop unrolling on their own, stopped checking after seeing the result of CLang 10.0 , the generated assembly made me very angry. Have a look for yourself https://godbolt.org/z/1cva96 LLVM can do magic like the above, and also can do unthinkable stuff, that should be the focus, if not for full project then just parts where developer deemed crucial, that is more useful than a linker, or shopping for $1000 of free software to bundle. Share this post Link to post
Mahdi Safsafi 225 Posted November 24, 2020 @Stefan Glienke @Kas Ob. Got some time and tried to do full simd ... The result is awesoooome !!! Can we go further ? yep (using ymm but this is not gonna be easy as Delphi doesn't support it yet) type TChar4 = array [0 .. 3] of Char; PChar4 = ^TChar4; const { Source Data Format : Imm8[1:0] } DF_UNSIGNED_BYTES = 0; DF_UNSIGNED_WORDS = 1; DF_SIGNED_BYTES = 2; DF_SIGNED_WORDS = 3; { Aggregation Operation : Imm8[3:2] } AGGREGATION_OP_EQUAL_ANY = 0 shl 2; AGGREGATION_OP_RANGES = 1 shl 2; AGGREGATION_OP_EQUAL_EACH = 2 shl 2; AGGREGATION_OP_EQUAL_ORDERED = 3 shl 2; { Polarity : Imm8[5:4] } POLARITY_POSITIVE = 0 shl 4; POLARITY_NEGATIVE = 1 shl 4; POLARITY_MASKED_POSITIVE = 2 shl 4; POLARITY_MASKED_NEGATIVE = 3 shl 4; { Output Selection : Imm8[6] } OS_LSI = 0 shl 6; OS_MSI = 1 shl 6; OS_BIT_MASK = 0 shl 6; OS_BYTE_WORD_MASK = 1 shl 6; const [Align(16)] AndMaskData: array [0 .. 7] of SmallInt = (31, 31, 31, 31, 31, 31, 31, 31); RangeData: array [0 .. 7] of WideChar = '09afAF' + #0000; TableData: array [0 .. 25] of TChar4 = ('????', '1010', '1011', '1100', '1101', '1110', '1111', '????', '????', '????', '????', '????', '????', '????', '????', '????', '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); function _HexToBinGhost(P, Q: PChar; Len: Integer): Boolean; asm push ebx push esi push edi mov ebx, ecx jz @@Empty lea esi, [TableData ] mov edi, ecx and ebx, 7 // trailing shr edi, 3 // number of simd-run jz @@HandleTrailing // too small to be done using simd movdqa xmm0, [AndMaskData] movdqa xmm1, [RangeData ] @@SimdLoop: movdqu xmm3, [eax] { check if its a valid hex } pcmpistri xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI ja @@Process { either end (null terminated) or invalid char } test Word [eax + ecx *2], -1 // is NT ? => continue jnz @@InvalidChar @@Process: pand xmm3, xmm0 // and each element with 31 pxor xmm2, xmm2 { --- first four chars --- } movdqa xmm4, xmm3 { first two char } punpcklbw xmm4, xmm2 // unpack low data pcmpeqq xmm2, xmm2 // generate mask // gather two TChar4 = 2*SizeOf(TChar4) = 16 bytes db $c4, $e2, $e9, $90, $2c, $e6 // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2 movdqu [edx], xmm5 // store to result { second two char } pshufd xmm4, xmm4, $0E // move next two elements to low pcmpeqq xmm2, xmm2 // gather two TChar4 db $c4, $e2, $e9, $90, $2c, $e6 // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2 movdqu [edx+16], xmm5 { --- last four chars --- } { first two char } pxor xmm2, xmm2 punpckhbw xmm3, xmm2 // unpack high data pcmpeqq xmm2, xmm2 db $c4, $e2, $e9, $90, $2c, $de // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2 movdqu [edx+32], xmm5 { second two char } pshufd xmm3, xmm3, $0E pcmpeqq xmm2, xmm2 db $c4, $e2, $e9, $90, $2c, $de // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2 movdqu [edx+48], xmm5 add eax, 16 add edx, 16 * 4 dec edi jnz @@SimdLoop test ebx, ebx jz @@NoTrailing @@HandleTrailing: mov ecx, ebx @@LegacyLoop: movzx edi, word [eax] lea ebx, [edi - 48] cmp ebx, 9 jbe @@Copy lea ebx, [edi - 65] cmp ebx, 5 jbe @@Copy lea ebx, [edi - 97] cmp ebx, 8 ja @@InvalidChar @@Copy: and edi, 31 movq xmm0, [esi + edi * 8] movq [edx], xmm0 add eax, 2 add edx, 8 dec ecx jnz @@LegacyLoop @@NoTrailing: mov word[edx], 0 // NT. @@Empty: mov eax, True @@End: pop edi pop esi pop ebx ret @@InvalidChar: xor eax, eax jmp @@End end; function HexToBinGhost(const Value: string): string; begin SetLength(Result, Length(Value) * 4); if _HexToBinGhost(Pointer(Value), Pointer(Result), Length(Value)) then exit; raise EConvertError.CreateFmt('Invalid hex digit found in ''%s''', [Value]); end; 1 Share this post Link to post
Stefan Glienke 2002 Posted November 24, 2020 4 minutes ago, Mahdi Safsafi said: The result is awesoooome !!! Nice, now make it run on multiple cores 😅 3 Share this post Link to post
Tom F 83 Posted November 24, 2020 (edited) @David Heffernan I am surprised to see you using a pointer. I've always felt that one of Delphi's strengths was the diminished need to use pointers, since pointers introduce the risk of weird, unpredictable errors. I'm not criticizing, but rather asking why you choose to use pointers here. Is it because you wanted to make it as fast as possible, or do you feel pointers are clearer and sufficiently safe or ? Edited November 24, 2020 by Tom F Share this post Link to post
Guest Posted November 24, 2020 @Mahdi Safsafi I think it can be far better with SIMD 1) use such barebone to remove the extra call dependency function HexToBinBareBone(const Value:string):string; procedure ToRaiseError(const Value:string); begin raise EConvertError.CreateFmt('Invalid hex digit found in ''%s''', [Value]); end; asm test eax,eax // Empty input string -> Empty result jz @Exit push edi //.. push edx // Value : string in stack , we keep this at last so we can access it with [esp] push eax // Value : string in stack , we keep this at last so we can access it with [esp+4] mov edx , [eax-4] // get length shl edx , 2 mov eax , [esp+4] mov ecx , offset system.@UStrSetLength call ecx mov edx , [esp+4] // restore result pointer mov edx , [edx] // edx -> result[1] mov eax , [esp] .. mov [edx],$00410042 // testing .. pop edx pop eax pop edi @Exit: ret @InvalidChar: pop edx pop eax // we need eax value to be the parameter passed to show the error message right pop edi jmp ToRaiseError end; 2) You build it to use lookup table while using SIMD is most valued benefit is to reduce memory access ( aka lookup table), so you are far better by calculating generating the 0 and 1 by masking. 3) This should be removed, why are you trying to check for NT while you have the length in of the string, so may be switching to explicit string compare instead of implicit will help here ja @@Process { either end (null terminated) or invalid char } test Word [eax + ecx *2], -1 // is NT ? => continue jnz @@InvalidChar 4) You used AVX512 instruction, my CPU is old and doesn't have either AVX2 or AVX512, couldn't run it, but keep in mind AVX2 and AVX512 does have huge hidden drawback, i think Stefan was pointing to with some may be hidden sarcasm intentionally or not, AVX2 and AVX512 in multi core concurrent usage can literally melt the CPU !, so the CPU will decrease its frequency, such decrease is not momentarily so the impact of over using AVX512 and AVX2 can be huge, to dig this further, start here https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency 5) When i wrote my MMX version i thought how i miss the shuffle instruction that does exist with SSEx, but when i wrote XMM i forgot to use 🙂 , so the idea instead of manipulating a unicode char better to translate the Ansi equivalent (one byte) then generate the binary bit string one byte per char too, after that unpack them using shuffle to memory then interleave the zeros, such interleaving with two unpack, three SIMD instruction that will be executed in 1 cycle. 6) To convert from hex char to decimal byte value, a more simpler and faster algorithm can performed, as to subtract the ord('0') then you will have the mask value for those with 'FF' then just AND them, compare after subtraction for letters again for first char , i think you got the idea .... at last just Or them together and you will have your decimal values. I am sleepy now and can't open my eyes so i might missed something or wrote something wrong, for that i am sorry. Share this post Link to post
David Heffernan 2345 Posted November 24, 2020 2 hours ago, Tom F said: I'm not criticizing, but rather asking why you choose to use pointers here. Is it because you wanted to make it as fast as possible, or do you feel pointers are clearer and sufficiently safe or ? This code is very easy to verify that it is safe. For more complex code, that may not be the case, and so pointers could be risky. I was using pointers here for performance reasons. Copy a bunch of characters in one go rather than character by character. Others showed much better ways to do it than my initial attempt. 1 Share this post Link to post
Mahdi Safsafi 225 Posted November 25, 2020 @Kas Ob. Quote 2) You build it to use lookup table while using SIMD is most valued benefit is to reduce memory access ( aka lookup table), so you are far better by calculating generating the 0 and 1 by masking. Agree ! Just for clarification : what I did is promoting a legacy algorithm to use SIMD (the algorithm used in SIMD is the same used in the legacy version) just to demonstrate the benefic of going full SIMD. Quote 3) This should be removed, why are you trying to check for NT while you have the length in of the string, so may be switching to explicit string compare instead of implicit will help here No ! 1- The explicit version is slower than the implicit. You can check that yourself. 2- The explicit uses 3 register (eax,edx,ecx) ! while the implicit uses only ecx ! This can be a serious drawback on x86 (we have less registers). Quote You used AVX512 instruction, my CPU is old and doesn't have either AVX2 or AVX512 No. I didn't used any AVX512 (my CPU doesn't even support it). All what I'm using is a single instruction vpgatherdq from AVX2. The rest is SSEx. Quote but keep in mind AVX2 and AVX512 does have huge hidden drawback, i think Stefan was pointing to with some may be hidden sarcasm intentionally or not, AVX2 and AVX512 in multi core concurrent usage can literally melt the CPU !, so the CPU will decrease its frequency, such decrease is not momentarily so the impact of over using AVX512 and AVX2 can be huge, to dig this further, start here https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency You gave a link and apparently you didn't read it all 🙂 -1 The instructions that were spotted to cause frequency decrease are those called heavy instructions(AVX512 instructions+++, some 256-bit instructions and those that execute on the FP unit). While scalar and all 128-bit instructions are fine (do not cause frequency decrease) ! Mine doesn't use AVX512, doesn't use any FP instruction, and all are 128 bit. -2 Keep in mind that CPU can decrease frequency even if you don't use any (SSEx/AVX/AVX512) for different reason (thermal condition, power consumption, ...). Quote 5) ... $ Nice idea. Working on byte instead of word and shuffling at the end probably will yield better result. Quote 6) ... $ Nice idea too. Quote I am sleepy now and can't open my eyes so i might missed something or wrote something wrong, for that i am sorry. Totally Ok ... Don't worry 🙂 Share this post Link to post
Guest Posted November 25, 2020 (edited) and now, my code can be used or not? NOTE: 100.000.000 it's enought! EOUTMemory! --- Im using 2 Array dynamic :)))) Intel i7 4770K (4 x 8 cpus at stock) RAM 16GB at 1600MHz test in DEBUG mode unit uFormMain; interface uses Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics, Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls; type TForm1 = class(TForm) btn_Creating_Hex_Randoms: TButton; btn_Convert_HexToBin_Values: TButton; mmMyMemoLog: TMemo; edtTotalArraysToHexValues: TEdit; edtCreatingHexValuesRandomic: TEdit; Label1: TLabel; edtConvertingHexValueToBin: TEdit; btn_Showing_HexValues_In_Arrays: TButton; edtShowingHexValuesAndBinInArrays: TEdit; btn_STOP_ME_PLEASE: TButton; procedure btn_Creating_Hex_RandomsClick(Sender: TObject); procedure btn_Convert_HexToBin_ValuesClick(Sender: TObject); procedure FormCreate(Sender: TObject); procedure btn_Showing_HexValues_In_ArraysClick(Sender: TObject); procedure btn_STOP_ME_PLEASEClick(Sender: TObject); private { Private declarations } public { Public declarations } end; var Form1: TForm1; implementation {$R *.dfm} uses System.Diagnostics; const lTotalHexGenerated: integer = 1000000; // 1.000.000 Hex values for tests!!! lHexChars: array [0 .. 15] of char = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'); lBinValues: array [0 .. 15] of Ansistring = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001', '1010', '1011', '1100', '1101', '1110', '1111'); var lArraysWithMyHexGenerated: array of string; lArraysWithMyBinGenerated: array of string; // lMyStopWatch: TStopwatch; lCounter : integer; function fncMyHexToBin(const lHexValue: string): string; var lEachHexChar: char; begin Result := ''; // for lEachHexChar in lHexValue do // now, ONLY HexChars in const it should be accepted, as expected! But, Im not verifying here for while! // // now is not necessary verify the exception... for while!!! // try Result := Result + lBinValues[Pos(lEachHexChar, lHexChars) - 1]; // except // case the "char" is not found, we have a "AV"! then.... doesnt matter for us! // end; end; procedure TForm1.btn_Creating_Hex_RandomsClick(Sender: TObject); var i : integer; z : integer; lNewHexString: string; begin lNewHexString := ''; // if (high(lArraysWithMyHexGenerated) = -1) then begin ShowMessage('lArraysWithMyHexGenerated = 0 values generated'); exit; end; // mmMyMemoLog.Clear; // Randomize; // lMyStopWatch := TStopwatch.StartNew; // try // for i := 0 to Pred(lTotalHexGenerated) do begin for z := 1 to 8 do lNewHexString := lNewHexString + lHexChars[(Random(high(lHexChars)) + 1)]; // lArraysWithMyHexGenerated[i] := lNewHexString; // lNewHexString := ''; // // Memo1.Lines.Add(i.ToString + ' - ' + lArraysWithMyHexGenerated[i]); end; // finally lMyStopWatch.Stop; // edtCreatingHexValuesRandomic.Text := Format('[%d] was created in [%d]ms', [lTotalHexGenerated, lMyStopWatch.ElapsedMilliseconds]); end; end; procedure TForm1.btn_Showing_HexValues_In_ArraysClick(Sender: TObject); var i: integer; begin if (high(lArraysWithMyHexGenerated) = -1) then begin ShowMessage('lArraysWithMyHexGenerated = 0 values generated'); exit; end; // mmMyMemoLog.Clear; // lMyStopWatch := TStopwatch.StartNew; // try // for lCounter := 0 to Pred(lTotalHexGenerated) do begin mmMyMemoLog.Lines.Add(lCounter.ToString + ' - ' + lArraysWithMyHexGenerated[lCounter] + ' - ' + lArraysWithMyBinGenerated[lCounter]); // Application.ProcessMessages; // if (lCounter = lTotalHexGenerated) then // forcing a "stop!" exit; end; // finally lMyStopWatch.Stop; // edtShowingHexValuesAndBinInArrays.Text := Format('[%d] Showing Hex and Bin on MemoLog in [%d]ms', [lTotalHexGenerated, lMyStopWatch.ElapsedMilliseconds]); end; end; procedure TForm1.btn_STOP_ME_PLEASEClick(Sender: TObject); begin lCounter := lTotalHexGenerated; // forcing a "stop!" end; procedure TForm1.FormCreate(Sender: TObject); begin edtTotalArraysToHexValues.Text := FormatFloat('###,###,###,###,##0', high(lArraysWithMyHexGenerated) + 1); end; procedure TForm1.btn_Convert_HexToBin_ValuesClick(Sender: TObject); var i: integer; begin if (high(lArraysWithMyHexGenerated) = -1) then begin ShowMessage('lArraysWithMyHexGenerated = 0 values generated'); exit; end; // lMyStopWatch := TStopwatch.StartNew; // try // for i := 0 to Pred(lTotalHexGenerated) do begin lArraysWithMyBinGenerated[i] := fncMyHexToBin(lArraysWithMyHexGenerated[i]); // // Memo1.Lines.Add(i.ToString + ' - ' + fncMyHexToBin(lArraysWithMyHexGenerated[i])); end; // finally lMyStopWatch.Stop; // edtConvertingHexValueToBin.Text := Format('[%d] HexString was converted in Bin in [%d]ms', [lTotalHexGenerated, lMyStopWatch.ElapsedMilliseconds]); end; end; initialization ReportMemoryLeaksOnShutdown := true; // SetLength(lArraysWithMyHexGenerated, lTotalHexGenerated); // if low memory = EOutOfMemory exception! // SetLength(lArraysWithMyBinGenerated, lTotalHexGenerated); // if low memory = EOutOfMemory exception! finalization end. hug Edited November 25, 2020 by Guest Share this post Link to post
Guest Posted November 25, 2020 6 hours ago, Mahdi Safsafi said: No ! 1- The explicit version is slower than the implicit. You can check that yourself. 2- The explicit uses 3 register (eax,edx,ecx) ! while the implicit uses only ecx ! This can be a serious drawback on x86 (we have less registers). It is slower https://stackoverflow.com/questions/46762813/how-much-faster-are-sse4-2-string-instructions-than-sse2-for-memcmp But 1) The gain from removing the two conditional branching will totally worth it, because they are forward and one will be taken almost always, in other words that "ja @@Process" is anti pattern, and bad for branch prediction. 2) You can save these register in XMM registers, this is faster than pushing them on stack, or on any memory at all. Just an idea that worth investigating. 6 hours ago, Mahdi Safsafi said: You gave a link and apparently you didn't read it all 🙂 -1 The instructions that were spotted to cause frequency decrease are those called heavy instructions(AVX512 instructions+++, some 256-bit instructions and those that execute on the FP unit). While scalar and all 128-bit instructions are fine (do not cause frequency decrease) ! Mine doesn't use AVX512, doesn't use any FP instruction, and all are 128 bit. -2 Keep in mind that CPU can decrease frequency even if you don't use any (SSEx/AVX/AVX512) for different reason (thermal condition, power consumption, ...). What i am missing here? I have SandyBridge and it raise exception on vpgatherdq , and looking here https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq ,it is promoted as AVX512, what CPU do you have ? I read that but what i missed that it the xmm version of that instruction in the reference page, my bad, sorry for that. Share this post Link to post
David Heffernan 2345 Posted November 25, 2020 4 hours ago, emailx45 said: my code can be used or not? You can extend my benchmark program in earlier posts to see how it performs. 1 Share this post Link to post
Mahdi Safsafi 225 Posted November 25, 2020 3 hours ago, Kas Ob. said: It is slower https://stackoverflow.com/questions/46762813/how-much-faster-are-sse4-2-string-instructions-than-sse2-for-memcmp But 1) The gain from removing the two conditional branching will totally worth it, because they are forward and one will be taken almost always, in other words that "ja @@Process" is anti pattern, and bad for branch prediction. 2) You can save these register in XMM registers, this is faster than pushing them on stack, or on any memory at all. Just an idea that worth investigating. 1- I told you its slower ! Using explicit version ... Will just make things crappy. 2- The purpose of xmm is not to store general purpose register ! Doing that and inside a loop is definitely an anti-pattern because you need to save and restore back the register and you need more than one instruction to just save one register which means that you are just going to ruin your cache because xmm instruction usually are long in opcodes (prefixes/vex/evex) ! 3- A possibility to negate ja @@Process to jna @@CheckInvalidCharOrNT exists and would be much friendly for the static predictor (without the need to save any register) : pcmpistri xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI jna @@CheckInvalidCharOrNT // in most time jump is not taken. @@Process: ... @@CheckInvalidCharOrNT: { either end (null terminated) or invalid char } test Word [eax + ecx *2], -1 // is NT ? => continue jnz @@InvalidChar jmp @@Process Quote I have SandyBridge and it raise exception on vpgatherdq , and looking here https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq ,it is promoted as AVX512 Here is the one that I used AVX2: https://www.felixcloutier.com/x86/vpgatherdq:vpgatherqq SandyBridge does not support AVX2. Quote what CPU do you have ? Not better than yours ! A Haswell ! Share this post Link to post
Stefan Glienke 2002 Posted November 25, 2020 Keep in mind the slightly different tendencies to branch predict on different CPUs when microbenchmarking cold code. https://xania.org/201602/bpu-part-one Share this post Link to post
Mahdi Safsafi 225 Posted November 25, 2020 2 hours ago, Stefan Glienke said: Keep in mind the slightly different tendencies to branch predict on different CPUs when microbenchmarking cold code. https://xania.org/201602/bpu-part-one Very interesting (specially the results). Although Intel in its guideline recommends : Quote Branch Prediction Optimization: Arrange code to be consistent with the static branch prediction algorithm. Share this post Link to post
Stefan Glienke 2002 Posted November 25, 2020 (edited) 24 minutes ago, Mahdi Safsafi said: Branch Prediction Optimization: Arrange code to be consistent with the static branch prediction algorithm. Exactly - that's why I just recently rearranged some of my code from this pattern (which I personally like very much for its readability and less indention): if not check then SomeErrorMessageStuff/raise/exit Usual stuff; to: if check then begin do usual stuff optionally exit end else Error stuff With some noticable improvements. Not only will the common part be on the fallthrough but also it avoids unnecessary register preserving when you have an error raising subroutine that will never return which the compiler does not know of. I wish Delphi would have something like noreturn Edited November 25, 2020 by Stefan Glienke 1 Share this post Link to post
Mahdi Safsafi 225 Posted November 26, 2020 @Stefan Glienke Quote (which I personally like very much for its readability and less indention) Bad habit dies hard 🙂 Quote With some noticable improvements. Indeed ! Many may consider it as a premature optimization ... but in fact in real app, it should give a noticeable difference. Mostly you can't judge micro-optimization using a toy/simple benchmark because those do not tell the truth all the time. Since you're interested in this subject ... I'm going to spin it a little bit and introduce switch/case statement. Look how Delphi compiler generated an ugly (anti-pattern) code. And see how the same msvc compiler generated a friendly codes ! // Delphi {$O+} function foo(I: Integer): Integer; begin case I of 0: Exit(random(255)); 1: Exit(3); 2: Exit(4); 3: Exit(5); 4: Exit(6); 5: Exit(7); else Exit(0); end; end; // --- asm --- { 0041C564 55 push ebp 0041C565 8BEC mov ebp,esp Test.dpr.14: case I of 0041C567 83F805 cmp eax,$05 0041C56A 774E jnbe $0041c5ba 0041C56C FF248573C54100 jmp dword ptr [eax*4+$41c573] // unaligned jmp-table 0041C573 8BC5 mov eax,ebp // compiler inserted jmp-table just after the branch !!! 0041C575 41 inc ecx 0041C576 0097C541009E add [edi-$61ffbe3b],dl 0041C57C C54100 lds eax,[ecx+$00] 0041C57F A5 movsd 0041C580 C54100 lds eax,[ecx+$00] 0041C583 AC lodsb 0041C584 C54100 lds eax,[ecx+$00] 0041C587 B3C5 mov bl,$c5 0041C589 41 inc ecx 0041C58A 00B8FF000000 add [eax+$000000ff],bh 0041C590 E81F82FEFF call Random 0041C595 5D pop ebp 0041C596 C3 ret Test.dpr.18: Exit(3); 0041C597 B803000000 mov eax,$00000003 0041C59C 5D pop ebp 0041C59D C3 ret Test.dpr.20: Exit(4); 0041C59E B804000000 mov eax,$00000004 0041C5A3 5D pop ebp 0041C5A4 C3 ret Test.dpr.22: Exit(5); 0041C5A5 B805000000 mov eax,$00000005 0041C5AA 5D pop ebp 0041C5AB C3 ret Test.dpr.24: Exit(6); 0041C5AC B806000000 mov eax,$00000006 0041C5B1 5D pop ebp 0041C5B2 C3 ret Test.dpr.26: Exit(7); 0041C5B3 B807000000 mov eax,$00000007 0041C5B8 5D pop ebp 0041C5B9 C3 ret Test.dpr.28: Exit(0); 0041C5BA 33C0 xor eax,eax Test.dpr.30: end; 0041C5BC 5D pop ebp 0041C5BD C3 ret } // godbolt msvc x86 int foo(int i) { #define Exit(x) return x switch (i) { case 0: Exit(rand()); // just to prevent c optimization case 1: Exit(3); case 2: Exit(4); case 3: Exit(5); case 4: Exit(6); case 5: Exit(7); default: Exit(0); } } // --- asm --- /* int foo(int) PROC ; foo push ebp mov ebp, esp push ecx mov eax, DWORD PTR _i$[ebp] mov DWORD PTR tv64[ebp], eax cmp DWORD PTR tv64[ebp], 5 ja SHORT $LN10@foo mov ecx, DWORD PTR tv64[ebp] jmp DWORD PTR $LN12@foo[ecx*4] // aligned jmp-table // fall through !!! $LN4@foo: call _rand jmp SHORT $LN1@foo $LN5@foo: mov eax, 3 jmp SHORT $LN1@foo $LN6@foo: mov eax, 4 jmp SHORT $LN1@foo $LN7@foo: mov eax, 5 jmp SHORT $LN1@foo $LN8@foo: mov eax, 6 jmp SHORT $LN1@foo $LN9@foo: mov eax, 7 jmp SHORT $LN1@foo $LN10@foo: xor eax, eax $LN1@foo: mov esp, ebp pop ebp ret 0 npad 2 // alignement // compiler inseted jmp-table at the end and its aligned !: $LN12@foo: DD $LN4@foo DD $LN5@foo DD $LN6@foo DD $LN7@foo DD $LN8@foo DD $LN9@foo int foo(int) ENDP ; foo */ Share this post Link to post
Alexander Elagin 143 Posted November 26, 2020 FreePascal version of the same function in mode O1 (quick optimization): function foo(I: Integer): Integer; begin case I of 0: Exit(random(255)); 1: Exit(3); 2: Exit(4); 3: Exit(5); 4: Exit(6); 5: Exit(7); else Exit(0); end; end; FPC_Test.lpr:50 begin 00000001000018C0 55 push %rbp 00000001000018C1 4889e5 mov %rsp,%rbp 00000001000018C4 488d6424d0 lea -0x30(%rsp),%rsp 00000001000018C9 894df8 mov %ecx,-0x8(%rbp) FPC_Test.lpr:51 case I of 00000001000018CC 89c8 mov %ecx,%eax 00000001000018CE 85c9 test %ecx,%ecx 00000001000018D0 0f8c6e000000 jl 0x100001944 <FOO+132> 00000001000018D6 85c0 test %eax,%eax 00000001000018D8 741e je 0x1000018f8 <FOO+56> 00000001000018DA 83e801 sub $0x1,%eax 00000001000018DD 7429 je 0x100001908 <FOO+72> 00000001000018DF 83e801 sub $0x1,%eax 00000001000018E2 7430 je 0x100001914 <FOO+84> 00000001000018E4 83e801 sub $0x1,%eax 00000001000018E7 7437 je 0x100001920 <FOO+96> 00000001000018E9 83e801 sub $0x1,%eax 00000001000018EC 743e je 0x10000192c <FOO+108> 00000001000018EE 83e801 sub $0x1,%eax 00000001000018F1 7445 je 0x100001938 <FOO+120> 00000001000018F3 eb4f jmp 0x100001944 <FOO+132> 00000001000018F5 0f1f00 nopl (%rax) FPC_Test.lpr:52 0: Exit(random(255)); 00000001000018F8 b9ff000000 mov $0xff,%ecx 00000001000018FD e8bed70000 callq 0x10000f0c0 <SYSTEM_$$_RANDOM$LONGINT$$LONGINT> 0000000100001902 8945f4 mov %eax,-0xc(%rbp) 0000000100001905 eb45 jmp 0x10000194c <FOO+140> 0000000100001907 90 nop FPC_Test.lpr:53 1: Exit(3); 0000000100001908 c745f403000000 movl $0x3,-0xc(%rbp) 000000010000190F eb3b jmp 0x10000194c <FOO+140> 0000000100001911 0f1f00 nopl (%rax) FPC_Test.lpr:54 2: Exit(4); 0000000100001914 c745f404000000 movl $0x4,-0xc(%rbp) 000000010000191B eb2f jmp 0x10000194c <FOO+140> 000000010000191D 0f1f00 nopl (%rax) FPC_Test.lpr:55 3: Exit(5); 0000000100001920 c745f405000000 movl $0x5,-0xc(%rbp) 0000000100001927 eb23 jmp 0x10000194c <FOO+140> 0000000100001929 0f1f00 nopl (%rax) FPC_Test.lpr:56 4: Exit(6); 000000010000192C c745f406000000 movl $0x6,-0xc(%rbp) 0000000100001933 eb17 jmp 0x10000194c <FOO+140> 0000000100001935 0f1f00 nopl (%rax) FPC_Test.lpr:57 5: Exit(7); 0000000100001938 c745f407000000 movl $0x7,-0xc(%rbp) 000000010000193F eb0b jmp 0x10000194c <FOO+140> 0000000100001941 0f1f00 nopl (%rax) FPC_Test.lpr:59 Exit(0); 0000000100001944 c745f400000000 movl $0x0,-0xc(%rbp) 000000010000194B 90 nop FPC_Test.lpr:61 end; 000000010000194C 8b45f4 mov -0xc(%rbp),%eax 000000010000194F 90 nop 0000000100001950 488d6500 lea 0x0(%rbp),%rsp 0000000100001954 5d pop %rbp 0000000100001955 c3 retq Judge it yourself Share this post Link to post
David Heffernan 2345 Posted November 26, 2020 13 hours ago, Stefan Glienke said: Exactly - that's why I just recently rearranged some of my code from this pattern (which I personally like very much for its readability and less indention): if not check then SomeErrorMessageStuff/raise/exit Usual stuff; to: if check then begin do usual stuff optionally exit end else Error stuff With some noticable improvements. Not only will the common part be on the fallthrough but also it avoids unnecessary register preserving when you have an error raising subroutine that will never return which the compiler does not know of. I wish Delphi would have something like noreturn This is not something we should be happy about though. The compiler should be doing this for us. Share this post Link to post
David Heffernan 2345 Posted November 26, 2020 On 11/25/2020 at 2:21 AM, emailx45 said: my code can be used or not? On 11/25/2020 at 7:19 AM, David Heffernan said: You can extend my benchmark program in earlier posts to see how it performs. When you do this, you find that your code is around more than 10 times slower than even the slowest versions in the benchmark, and that's before we've added code to handle errors which is currently missing from your code. Share this post Link to post
Guest Posted November 26, 2020 8 minutes ago, David Heffernan said: The compiler should be doing this for us. Can't agree more, only one problem though, How to open ticket for the compiler to ask it for better job ? and may be to point few internet resources to read. Share this post Link to post
David Heffernan 2345 Posted November 26, 2020 1 hour ago, Kas Ob. said: Can't agree more, only one problem though, How to open ticket for the compiler to ask it for better job ? and may be to point few internet resources to read. Submit a report to QualityPortal Share this post Link to post
Guest Posted November 26, 2020 1 minute ago, David Heffernan said: Submit a report to QualityPortal I was trying to be funny, like holding the compiler from his ear and point to some assembly code, and say bad boy !, bad boy! Share this post Link to post
Mahdi Safsafi 225 Posted November 26, 2020 (edited) 2 hours ago, Alexander Elagin said: FreePascal version of the same function in mode O1 (quick optimization): In the example you gave FPC didn't used a jump table ! instead it lowered the case statement into if statement. But adding more case, will make FPC use jump-table. I tested that my self on x86 (O4). Here is some comparison for jump-table: MSVC && FPC: - Both generate a friendly code for branch predictor and the fall through executes the first case. - Both generate aligned jump-table. - MSVC inserted the table at the function end. While FPC inserted the table in the const data region. - The first case likely is going to be in the same cache line as the jmp instruction. Delphi: - Generated an ugly pattern for the branch predictor and the fall through points to random instructions(because the table was immediately inserted after the jmp instruction -decompiled data-). Things can get worse if those random instructions contain data that gets decompiled into branch !!! - The table is not aligned ! This is just ugly and means that the data (jump location) can be split into two cache line !!! - The first case is unlikely going to be in the same cache line as the jmp instruction. Bravo FPC team ! 🙂 Edited November 26, 2020 by Mahdi Safsafi 1 Share this post Link to post