Lars Fosdal 1792 Posted November 21, 2020 I don't know, David, which is why I ask. Share this post Link to post
David Heffernan 2345 Posted November 21, 2020 13 minutes ago, Lars Fosdal said: I don't know, David, which is why I ask. I think the way I phrased the question should help your intuition! 😉 It's always worth double checking, but a lookup table involves working the answer out before you compile. I would imagine it is hard for runtime code to ever beat that. Share this post Link to post
Guest Posted November 21, 2020 9 minutes ago, Lars Fosdal said: I don't know, David, which is why I ask. I will dissect it !, Here modified Mahdi version with lookup table to be handling one char, all build are 32bit with XE8 procedure HexToBin2(const HexValue: Char; Res: PChar); type TChar4 = array[0..3] of Char; PChar4 = ^TChar4; const Table1: array['0'..'9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); Table2: array['a'..'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111'); var P: PChar4 absolute Res; begin case HexValue of '0'..'9': P^ := Table1[HexValue]; 'a'..'f': P^ := Table2[HexValue]; 'A'..'F': P^ := Table2[Chr(Ord(HexValue) xor $20)]; else //raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; end; and here its assembly details Block Throughput: 7.26 Cycles Throughput Bottleneck: FrontEnd Loop Count: 22 Port Binding In Cycles Per Iteration: -------------------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------------------------- | Cycles | 1.8 0.0 | 1.7 | 5.0 5.0 | 5.0 4.0 | 6.0 | 1.7 | 1.7 | 5.0 | -------------------------------------------------------------------------------------------------- | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------------------------------- | 1* | | | | | | | | | mov edx, eax | 1 | | 0.3 | | | | 0.7 | | | add edx, 0xffffffd0 | 1* | | | | | | | | | sub dx, 0xa | 0*F | | | | | | | | | jb 0x16 | 1 | 0.3 | 0.5 | | | | 0.3 | | | add edx, 0xfffffff9 | 1* | | | | | | | | | sub dx, 0x6 | 0*F | | | | | | | | | jb 0x43 | 1 | 0.3 | 0.3 | | | | 0.5 | | | add edx, 0xffffffe6 | 1* | | | | | | | | | sub dx, 0x6 | 0*F | | | | | | | | | jb 0x1f | 1 | 0.3 | | | | | | 0.7 | | jmp 0x55 | 1 | | | 1.0 1.0 | | | | | | mov edx, dword ptr [rbp-0x4] | 1* | | | | | | | | | movzx eax, ax | 1 | | | | 1.0 1.0 | | | | | mov ecx, dword ptr [rax*8+0x41e480] | 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rdx], ecx | 1 | | | 1.0 1.0 | | | | | | mov ecx, dword ptr [rax*8+0x41e484] | 2^ | | | | 1.0 | 1.0 | | | | mov dword ptr [rdx+0x4], ecx | 1 | 0.7 | | | | | | 0.3 | | jmp 0x3a | 1 | | | 1.0 1.0 | | | | | | mov edx, dword ptr [rbp-0x4] | 1* | | | | | | | | | movzx eax, ax | 1 | | | | 1.0 1.0 | | | | | mov ecx, dword ptr [rax*8+0x41e348] | 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rdx], ecx | 1 | | | 1.0 1.0 | | | | | | mov ecx, dword ptr [rax*8+0x41e34c] | 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rdx+0x4], ecx | 1 | 0.3 | | | | | | 0.7 | | jmp 0x1f | 1 | | | | 1.0 1.0 | | | | | mov edx, dword ptr [rbp-0x4] | 1 | | 0.7 | | | | 0.3 | | | xor ax, 0x20 | 1* | | | | | | | | | movzx eax, ax | 1 | | | 1.0 1.0 | | | | | | mov ecx, dword ptr [rax*8+0x41e348] | 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rdx], ecx | 1 | | | | 1.0 1.0 | | | | | mov ecx, dword ptr [rax*8+0x41e34c] | 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rdx+0x4], ecx Total Num Of Uops: 35 Here my base assembly instructions version Block Throughput: 7.21 Cycles Throughput Bottleneck: FrontEnd Loop Count: 22 Port Binding In Cycles Per Iteration: -------------------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------------------------- | Cycles | 5.1 0.0 | 4.8 | 1.3 0.6 | 1.4 0.4 | 3.0 | 4.9 | 5.2 | 1.3 | -------------------------------------------------------------------------------------------------- DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3) F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion occurred # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected X - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------------------------------- | 3^# | | 0.1 | | 0.4 | 1.0 | | | 0.6 | push rdi | 1* | | | | | | | | | mov edi, edx | 1* | | | | | | | | | movzx eax, ax | 1 | 0.3 | 0.5 | | | | 0.2 | | | mov ecx, 0x39 | 1 | 0.2 | 0.3 | | | | 0.5 | | | sub ecx, eax | 1 | 0.3 | | | | | | 0.7 | | sar ecx, 0x1f | 1 | 0.2 | 0.5 | | | | 0.3 | | | and ecx, 0x27 | 1 | 0.3 | 0.2 | | | | 0.5 | | | neg ecx | 1 | 0.2 | 0.3 | | | | 0.2 | 0.3 | | add eax, ecx | 1 | 0.3 | 0.2 | | | | 0.3 | 0.2 | | add eax, 0xffffffd0 | 1* | | | | | | | | | xor ecx, ecx | 1 | 0.3 | 0.2 | | | | 0.3 | 0.2 | | mov dx, 0x1 | 1 | 0.2 | 0.3 | | | | 0.2 | 0.3 | | test al, 0x4 | 1 | 0.3 | | | | | | 0.7 | | cmovnz cx, dx | 1 | 0.7 | | | | | | 0.3 | | shl ecx, 0x10 | 1 | | 0.7 | | | | 0.3 | | | test al, 0x8 | 1 | 0.3 | | | | | | 0.7 | | cmovnz cx, dx | 1 | | 0.3 | | | | 0.7 | | | add ecx, 0x300030 | 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.4 | mov dword ptr [rdi], ecx | 1* | | | | | | | | | xor ecx, ecx | 1 | 0.3 | 0.2 | | | | 0.5 | | | test al, 0x1 | 1 | 0.2 | | | | | | 0.8 | | cmovnz cx, dx | 1 | 0.8 | | | | | | 0.2 | | shl ecx, 0x10 | 1 | | 0.8 | | | | 0.2 | | | test al, 0x2 | 1 | 0.2 | | | | | | 0.8 | | cmovnz cx, dx | 1 | | 0.2 | | | | 0.8 | | | add ecx, 0x300030 | 2^ | | | 0.4 | 0.3 | 1.0 | | | 0.3 | mov dword ptr [rdi+0x4], ecx | 2^ | | | 0.6 0.6 | 0.4 0.4 | | | | | pop rdi Total Num Of Uops: 33 And this is my MMX one Block Throughput: 8.84 Cycles Throughput Bottleneck: Backend Loop Count: 22 Port Binding In Cycles Per Iteration: -------------------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------------------------- | Cycles | 9.0 0.0 | 4.5 | 1.0 1.0 | 1.0 1.0 | 1.0 | 9.0 | 4.5 | 1.0 | -------------------------------------------------------------------------------------------------- | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------------------------------- | 1* | | | | | | | | | movzx eax, ax | 1 | | 0.5 | | | | | 0.5 | | mov ecx, 0x39 | 1 | | 0.5 | | | | | 0.5 | | sub ecx, eax | 1 | | | | | | | 1.0 | | sar ecx, 0x1f | 1 | | 1.0 | | | | | | | and ecx, 0x27 | 1 | | 0.5 | | | | | 0.5 | | neg ecx | 1 | | 0.5 | | | | | 0.5 | | add eax, ecx | 1 | | 0.5 | | | | | 0.5 | | add eax, 0xffffffd0 | 1 | | | | | | 1.0 | | | movd mmx0, eax | 1 | 1.0 | | | | | | | | pxor mmx1, mmx1 | 1 | | | | | | 1.0 | | | punpckldq mmx0, mmx0 | 3 | | 0.5 | | | | 2.0 | 0.5 | | packssdw mmx0, mmx0 | 2^ | 1.0 | | 1.0 1.0 | | | | | | pand mmx0, qword ptr [rip+0x41e600] | 1 | 1.0 | | | | | | | | pcmpeqw mmx0, mmx1 | 2^ | 1.0 | | | 1.0 1.0 | | | | | psubw mmx0, qword ptr [rip+0x41e608] | 1 | | | | | | 1.0 | | | pshufw mmx0, mmx0, 0x1b | 2^ | | | | | 1.0 | | | 1.0 | movq qword ptr [rdx], mmx0 | 10 | 5.0 | 0.5 | | | | 4.0 | 0.5 | | emms Total Num Of Uops: 32 Analysis Notes: Backend allocation was stalled due to unavailable allocation resources. You can see the toll of using EMMS and without emms Block Throughput: 6.00 Cycles Throughput Bottleneck: Backend Loop Count: 22 Port Binding In Cycles Per Iteration: -------------------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------------------------- | Cycles | 4.0 0.0 | 4.0 | 1.0 1.0 | 1.0 1.0 | 1.0 | 5.0 | 4.0 | 1.0 | -------------------------------------------------------------------------------------------------- DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3) F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion occurred # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected X - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------------------------------- | 1* | | | | | | | | | movzx eax, ax | 1 | | 1.0 | | | | | | | mov ecx, 0x39 | 1 | | | | | | | 1.0 | | sub ecx, eax | 1 | | | | | | | 1.0 | | sar ecx, 0x1f | 1 | | 1.0 | | | | | | | and ecx, 0x27 | 1 | | 1.0 | | | | | | | neg ecx | 1 | | | | | | | 1.0 | | add eax, ecx | 1 | | 1.0 | | | | | | | add eax, 0xffffffd0 | 1 | | | | | | 1.0 | | | movd mmx0, eax | 1 | 1.0 | | | | | | | | pxor mmx1, mmx1 | 1 | | | | | | 1.0 | | | punpckldq mmx0, mmx0 | 3 | | | | | | 2.0 | 1.0 | | packssdw mmx0, mmx0 | 2^ | 1.0 | | 1.0 1.0 | | | | | | pand mmx0, qword ptr [rip+0x41e600] | 1 | 1.0 | | | | | | | | pcmpeqw mmx0, mmx1 | 2^ | 1.0 | | | 1.0 1.0 | | | | | psubw mmx0, qword ptr [rip+0x41e608] | 1 | | | | | | 1.0 | | | pshufw mmx0, mmx0, 0x1b | 2^ | | | | | 1.0 | | | 1.0 | movq qword ptr [rdx], mmx0 Total Num Of Uops: 22 Analysis Notes: Backend allocation was stalled due to unavailable allocation resources. Share this post Link to post
Guest Posted November 21, 2020 2 minutes ago, David Heffernan said: I would imagine it is hard for runtime code to ever beat that. I think, i just did. Share this post Link to post
David Heffernan 2345 Posted November 21, 2020 On 11/19/2020 at 4:12 PM, Mahdi Safsafi said: @David Heffernan Few remarks about your code if you don't mind : 1- Its pointless to use string when characters are fixed in size ... Simply use static array of X char. 2- Its also pointless to calculate index when you already used a case ... Simply declare your array using char-range. In your case, compiler generated additional instructions to compute the index. function HexToBin2(const HexValue: string): string; type TChar4 = array [0 .. 3] of Char; PChar4 = ^TChar4; const Table1: array ['0' .. '9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); Table2: array ['a' .. 'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111'); var HexDigit: Char; P: PChar4; begin SetLength(Result, Length(HexValue) * 4); P := PChar4(Result); for HexDigit in HexValue do begin case HexDigit of '0' .. '9': P^ := Table1[HexDigit]; 'a' .. 'f': P^ := Table2[HexDigit]; 'A' .. 'F': P^ := Table2[Chr(Ord(HexDigit) xor $20)]; else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; Inc(P); end; end; My benchmarking suggests that point 1 has no impact on performance, but point 2 does. Actually the suggestion made by @Kas Ob. to replace the Move in my code with PInt64(Ptr)^ := PInt64(BinaryValues[HexDigitValue])^ is all you need to match the performance of your version. Which makes a lot of sense to me. Share this post Link to post
David Heffernan 2345 Posted November 21, 2020 (edited) 40 minutes ago, Kas Ob. said: I think, i just did. I made this benchmark program, that wraps your single digit asm into a function that handles string input, and included error checking. If you skip the error checking then your asm code is faster. But if you include error checking, not so. How would you go about integrating your single digit code into a string level function with error checking? Also, I'd love to have a high level explanation as to why my reasoning that a lookup table should not be beaten is wrong. Or is it perhaps that the implementation of the lookup table could be done better? {$APPTYPE CONSOLE} uses System.SysUtils, System.Classes, System.Diagnostics; //******************************************** // Constants to tune the benchmark performance const HexStringLength = 64; InputDataCount = 512; IterationCount = 20000; //***************************************************************** // David Heffernan's routine, with Move replace by Int64 assignment {$DEFINE ReplaceMoveWithInt64Assign} function HexToBin(const HexValue: string): string; const BinaryValues: array [0..15] of string = ( '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001', '1010', '1011', '1100', '1101', '1110', '1111' ); var HexDigit: Char; HexDigitValue: Integer; Ptr: PChar; begin SetLength(Result, Length(HexValue) * 4); Ptr := Pointer(Result); for HexDigit in HexValue do begin case HexDigit of '0'..'9': HexDigitValue := Ord(HexDigit) - Ord('0'); 'a'..'f': HexDigitValue := 10 + Ord(HexDigit) - Ord('a'); 'A'..'F': HexDigitValue := 10 + Ord(HexDigit) - Ord('A'); else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; {$IFDEF ReplaceMoveWithInt64Assign} PInt64(Ptr)^ := PInt64(BinaryValues[HexDigitValue])^; {$ELSE} Move(Pointer(BinaryValues[HexDigitValue])^, Ptr^, 4 * SizeOf(Char)); {$ENDIF} Inc(Ptr, 4); end; end; //********************* // Kas Ob's asm version procedure CharToBin_ASM32(HexChar: Char; HexBuffer: PChar); asm push edi mov edi,edx // Get the decimal value of one Hex Char (= half byte) movzx eax, HexChar mov ecx, 57 sub ecx, eax sar ecx, 31 and ecx, 39 neg ecx add eax, ecx add eax, - 48 // Produce 4 Chars presenting 4 bits of HexChar xor ecx,ecx mov dx,$1 test al,4 cmovne cx,dx shl ecx,16 test al,8 cmovne cx,dx add ecx,$00300030 mov [edi],ecx xor ecx,ecx test al,1 cmovne cx,dx shl ecx,16 test al,2 cmovne cx,dx add ecx,$00300030 mov [edi+4],ecx pop edi end; function HexToBinAsm(const HexValue: string): string; var HexDigit: Char; HexDigitValue: Integer; Ptr: PChar; begin SetLength(Result, Length(HexValue) * 4); Ptr := Pointer(Result); for HexDigit in HexValue do begin case HexDigit of '0'..'9','a'..'f','A'..'F': CharToBin_ASM32(HexDigit, Ptr); else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; Inc(Ptr, 4); end; end; //*************** // Benchmark code function RandomHexDigit: Char; var Ordinal: Integer; begin Ordinal := Random(16); case Ordinal of 0..9: Result := Chr(Ord('0') + Ordinal); 10..15: Result := Chr(Ord('a') + Ordinal - 10); else raise Exception.Create(''); end; end; function RandomHexString(Length: Integer): string; var Index: Integer; begin SetLength(Result, Length); for Index := 1 to Length do Result[Index] := RandomHexDigit; end; procedure Benchmark; var Index, Iteration: Integer; sw: TStopwatch; HexStrings: TStringList; binaryString: string; begin HexStrings := TStringList.Create; try for Index := 0 to InputDataCount-1 do HexStrings.Add(RandomHexString(HexStringLength)); sw := TStopwatch.StartNew; for Iteration := 0 to IterationCount-1 do for Index := 0 to HexStrings.Count-1 do begin binaryString := HexToBin(HexStrings[Index]); binaryString := ''; // force a reallocation of the string every iteration of this loop end; Writeln(sw.ElapsedMilliseconds); finally HexStrings.Free; end; end; begin try Randomize; Benchmark; except on E: Exception do Writeln(E.ClassName, ': ', E.Message); end; Readln; end. EDIT: It turns out that this benchmark is possibly useless at the moment, because HexToBinAsm doesn't work correctly (almost certainly my fault). I need to work out what I've done wrong. EDIT2: No, it's fine, I was mistaken, HexToBinAsm above works correctly. Edited November 21, 2020 by David Heffernan Share this post Link to post
Guest Posted November 21, 2020 The XMM version is way better and faster than them all, 6 cycles in all, while MMX instruction are doing the work of full byte, means converting two HEX character with these instruction will be the same speed. procedure CharToBin_XMM(HexChar: Char; HexBuffer: PChar); const DEC_TO_BIN_WORD_MASK: array[0..7] of UInt16 = ($01, $02,$00,$00, $04, $08, $00, $00); DEC_TO_BIN_FF_TO_CHARONE_DISTANCE: array[0..7] of UInt16 = ($FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF); DEC_TO_BIN_REVERSE_MASK: array[0..15] of Byte = (10, $80, 8, $80, 2, $80, 0, $80, $80, $80, $80, $80, $80, $80, $80, $80); asm movzx eax, HexChar mov ecx, 57 sub ecx, eax sar ecx, 31 and ecx, 39 neg ecx add eax, ecx add eax, - 48 // Produce 4 Chars presenting 4 bits of HexChar movd xmm0, eax pxor xmm1, xmm1 movdqu xmm2, DEC_TO_BIN_FF_TO_CHARONE_DISTANCE punpckldq xmm0, xmm0 packssdw xmm0, xmm0 pand xmm0, dqword ptr[DEC_TO_BIN_WORD_MASK] pcmpeqw xmm0, xmm1 psubw xmm0, xmm2 movdqu xmm3,DEC_TO_BIN_REVERSE_MASK PSHUFB xmm0, xmm3 // reverse the result movq qword ptr[HexBuffer], xmm0 end; Block Throughput: 6.05 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations) Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------------- | Cycles | 3.0 0.0 | 3.0 | 1.5 1.5 | 1.5 1.5 | 1.0 | 4.0 | 4.0 | 1.0 | --------------------------------------------------------------------------------------- | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 1* | | | | | | | | | | movzx eax, ax | 1 | 0.1 | 0.1 | | | | | 0.9 | | | mov ecx, 0x39 | 1 | 0.1 | 0.1 | | | | | 0.9 | | | sub ecx, eax | 1 | 0.3 | | | | | | 0.7 | | | sar ecx, 0x1f | 1 | 0.2 | 0.5 | | | | | 0.3 | | | and ecx, 0x27 | 1 | 0.1 | 0.3 | | | | | 0.6 | | | neg ecx | 1 | 0.3 | 0.6 | | | | | 0.2 | | | add eax, ecx | 1 | 0.3 | 0.3 | | | | | 0.5 | | | add eax, 0xffffffd0 | 1 | | | | | | 1.0 | | | | movd xmm0, eax | 1* | | | | | | | | | | pxor xmm1, xmm1 | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | movdqu xmm2, xmmword ptr [rip+0x41e610] | 1 | | | | | | 1.0 | | | | punpckldq xmm0, xmm0 | 1 | | | | | | 1.0 | | | | packssdw xmm0, xmm0 | 2^ | 0.6 | 0.4 | 0.5 0.5 | 0.5 0.5 | | | | | | pand xmm0, xmmword ptr [rip+0x41e600] | 1 | 0.5 | 0.5 | | | | | | | | pcmpeqw xmm0, xmm1 | 1 | 0.6 | 0.4 | | | | | | | | psubw xmm0, xmm2 | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | movdqu xmm3, xmmword ptr [rip+0x41e620] | 1 | | | | | | 1.0 | | | | pshufb xmm0, xmm3 | 2^ | | | | | 1.0 | | | 1.0 | CP | movq qword ptr [rdx], xmm0 Total Num Of Uops: 21 Share this post Link to post
David Heffernan 2345 Posted November 21, 2020 UPDATED BENCHMARK: {$APPTYPE CONSOLE} uses System.SysUtils, System.Classes, System.Diagnostics; //******************************************** // Constants to tune the benchmark performance const HexStringLength = 64; InputDataCount = 512; IterationCount = 20000; //***************************************************************** // David Heffernan's routine, with Move replace by Int64 assignment {$DEFINE ReplaceMoveWithInt64Assign} function HexToBin(const HexValue: string): string; const BinaryValues: array [0..15] of string = ( '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001', '1010', '1011', '1100', '1101', '1110', '1111' ); var HexDigit: Char; HexDigitValue: Integer; Ptr: PChar; begin SetLength(Result, Length(HexValue) * 4); Ptr := Pointer(Result); for HexDigit in HexValue do begin case HexDigit of '0'..'9': HexDigitValue := Ord(HexDigit) - Ord('0'); 'a'..'f': HexDigitValue := 10 + Ord(HexDigit) - Ord('a'); 'A'..'F': HexDigitValue := 10 + Ord(HexDigit) - Ord('A'); else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; {$IFDEF ReplaceMoveWithInt64Assign} PInt64(Ptr)^ := PInt64(BinaryValues[HexDigitValue])^; {$ELSE} Move(Pointer(BinaryValues[HexDigitValue])^, Ptr^, 4 * SizeOf(Char)); {$ENDIF} Inc(Ptr, 4); end; end; //*********************** // Kas Ob's asm32 version procedure CharToBin_ASM32(HexChar: Char; HexBuffer: PChar); asm push edi mov edi,edx // Get the decimal value of one Hex Char (= half byte) movzx eax, HexChar mov ecx, 57 sub ecx, eax sar ecx, 31 and ecx, 39 neg ecx add eax, ecx add eax, - 48 // Produce 4 Chars presenting 4 bits of HexChar xor ecx,ecx mov dx,$1 test al,4 cmovne cx,dx shl ecx,16 test al,8 cmovne cx,dx add ecx,$00300030 mov [edi],ecx xor ecx,ecx test al,1 cmovne cx,dx shl ecx,16 test al,2 cmovne cx,dx add ecx,$00300030 mov [edi+4],ecx pop edi end; function HexToBinAsm32(const HexValue: string): string; var HexDigit: Char; Ptr: PChar; begin SetLength(Result, Length(HexValue) * 4); Ptr := Pointer(Result); for HexDigit in HexValue do begin case HexDigit of '0'..'9','a'..'f','A'..'F': CharToBin_ASM32(HexDigit, Ptr); else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; Inc(Ptr, 4); end; end; //********************* // Kas Ob's xmm version procedure CharToBin_XMM(HexChar: Char; HexBuffer: PChar); const DEC_TO_BIN_WORD_MASK: array[0..7] of UInt16 = ($01, $02,$00,$00, $04, $08, $00, $00); DEC_TO_BIN_FF_TO_CHARONE_DISTANCE: array[0..7] of UInt16 = ($FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF); DEC_TO_BIN_REVERSE_MASK: array[0..15] of Byte = (10, $80, 8, $80, 2, $80, 0, $80, $80, $80, $80, $80, $80, $80, $80, $80); asm movzx eax, HexChar mov ecx, 57 sub ecx, eax sar ecx, 31 and ecx, 39 neg ecx add eax, ecx add eax, - 48 // Produce 4 Chars presenting 4 bits of HexChar movd xmm0, eax pxor xmm1, xmm1 movdqu xmm2, DEC_TO_BIN_FF_TO_CHARONE_DISTANCE punpckldq xmm0, xmm0 packssdw xmm0, xmm0 pand xmm0, dqword ptr[DEC_TO_BIN_WORD_MASK] pcmpeqw xmm0, xmm1 psubw xmm0, xmm2 movdqu xmm3,DEC_TO_BIN_REVERSE_MASK PSHUFB xmm0, xmm3 // reverse the result movq qword ptr[HexBuffer], xmm0 end; function HexToBinXmm(const HexValue: string): string; var HexDigit: Char; Ptr: PChar; begin SetLength(Result, Length(HexValue) * 4); Ptr := Pointer(Result); for HexDigit in HexValue do begin case HexDigit of '0'..'9','a'..'f','A'..'F': CharToBin_XMM(HexDigit, Ptr); else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; Inc(Ptr, 4); end; end; //*************** // Benchmark code function RandomHexDigit: Char; var Ordinal: Integer; begin Ordinal := Random(16); case Ordinal of 0..9: Result := Chr(Ord('0') + Ordinal); 10..15: Result := Chr(Ord('a') + Ordinal - 10); else raise Exception.Create(''); end; end; function RandomHexString(Length: Integer): string; var Index: Integer; begin SetLength(Result, Length); for Index := 1 to Length do Result[Index] := RandomHexDigit; end; procedure Benchmark; var Index, Iteration: Integer; sw: TStopwatch; HexStrings: TStringList; binaryString: string; begin HexStrings := TStringList.Create; try for Index := 0 to InputDataCount-1 do HexStrings.Add(RandomHexString(HexStringLength)); sw := TStopwatch.StartNew; for Iteration := 0 to IterationCount-1 do for Index := 0 to HexStrings.Count-1 do begin binaryString := HexToBin(HexStrings[Index]); binaryString := ''; // force a reallocation of the string every iteration of this loop end; Writeln('Pascal lookup: ', sw.ElapsedMilliseconds); sw := TStopwatch.StartNew; for Iteration := 0 to IterationCount-1 do for Index := 0 to HexStrings.Count-1 do begin binaryString := HexToBinAsm32(HexStrings[Index]); binaryString := ''; // force a reallocation of the string every iteration of this loop end; Writeln('asm32: ', sw.ElapsedMilliseconds); sw := TStopwatch.StartNew; for Iteration := 0 to IterationCount-1 do for Index := 0 to HexStrings.Count-1 do begin binaryString := HexToBinXmm(HexStrings[Index]); binaryString := ''; // force a reallocation of the string every iteration of this loop end; Writeln('xmm: ', sw.ElapsedMilliseconds); finally HexStrings.Free; end; end; begin try Randomize; Benchmark; except on E: Exception do Writeln(E.ClassName, ': ', E.Message); end; Readln; end. Compiled on XE7, 32 bit, release configuration OUTPUT: Pascal lookup: 3743 asm32: 4396 EAccessViolation: Access violation at address 004D77D3 in module 'Project14.exe'. Read of address FFFFFFFF The AV is HexToBinXmm. Perhaps I misunderstood something. Share this post Link to post
Mike Torrettinni 198 Posted November 21, 2020 My benchmark shows that Mahdi's HexToBin2 is still faster then HexToBinAsm32. Actually, the longer the Hex string, the bigger difference between them - at 1024 hex chars string, HexToBinAsm32 double's the execution time: cHexRange = 16777216; cHexLen = 6; HexToBin2 = 818; HexToBinAsm32 = 1208; Diff: HexToBin2 = -32% cHexRange = 800000; cHexLen = 1024; HexToBin2 = 2035; HexToBinAsm32 = 4506; Diff: HexToBin2 = -54% You don't get the similar difference? Share this post Link to post
Stefan Glienke 2002 Posted November 21, 2020 3 hours ago, David Heffernan said: My benchmarking suggests that point 1 has no impact on performance, but point 2 does. Certainly not in a microbenchmark where probably everything fits into L1 cache. Not so in a real program where that array of string might be placed in different locations than the strings it contains. 2 Share this post Link to post
Mahdi Safsafi 225 Posted November 21, 2020 3 hours ago, David Heffernan said: My benchmarking suggests that point 1 has no impact on performance, but point 2 does. Benchmark does not always tell the full story 🙂 Unlike my code, Yours isn't cache friendly const BinaryValues: array [0..15] of string = ( '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001', '1010', '1011', '1100', '1101', '1110', '1111' ); Let's see what happens for x64 : You need 16 * SizeOf(Pointer) to store string reference = 16 * 8 = 128 bytes = 2 Cache Line. You need 16 * (4 * SizeOf(Char) + 2 byte(null terminated) + SizeOf(StrRec)) to store data = 16 *(4 * 2 + 2 + 16) = 416 bytes = 7 Cache Line. N.B: I supposed that compiler did a great job by placing your data in a continuous region. -------------------------------------------------------- You ended up consuming 416 + 128 = 544 bytes. You ended up consuming 9 Cache Line. type TChar4 = array[0..3] of Char; PChar4 = ^TChar4; const Table1: array['0'..'9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); Table2: array['a'..'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111'); I only need 16 * (4 * SizeOf(Char)) to store data = 16 * (4 * 2) = 128 bytes = 2 Cache Line. 3 Share this post Link to post
Guest Posted November 21, 2020 2 hours ago, David Heffernan said: The AV is HexToBinXmm. Perhaps I misunderstood something. My mistake i fixed one and forgot about the second, the compiler fooled/failed me, it is aligned access, to fix it do this Quote //movdqu xmm3,DEC_TO_BIN_REVERSE_MASK // with these movdqu xmm3,DEC_TO_BIN_WORD_MASK pand xmm0, xmm3 3 hours ago, David Heffernan said: I'd love to have a high level explanation as to why my reasoning that a lookup table should not be beaten is wrong. Or is it perhaps that the implementation of the lookup table could be done better? Will try to write for the second time, as the browser for some reason refreshed and erased what i wrote. Now to answer as simple and clear way as i can, your assumption (like many others) on how CPU works is outdated, i am talking about how instructions are executed, now to be more precise how they are decoded then executed, long story short, the CPU's that executed instructions in serial way by reading one by one are almost vanished from our time, CPU technology now grab a block of code, by code i mean block of data/bytes to execute it, most likely 16 bytes, then decode them in parallel at the same time then execute, so some will block as they need result, comparison, calculating address, access memory, access memory within the same cache line, access memory on the same page... many many things CPU utilize many tricks by cheating, and they introduced many technologies in that matter from out-of-order-execution to branch prediction.... Now we established that let see what happen with lookup tables vs calculating, in many cases lookup table are faster but in this case we should look at this instruction (which is essential in all lookup tables) mov ecx, dword ptr [rax*8+0x41e348] This in fact include slow microop which include multiplication and addition, hence this will take full cycle even when when reported as microfused with the following one mov dword ptr [rdx], ecx it is slow and will block all the block execution until both executed, the big problem in the second one is 2 full microops including 0|13|mov ecx, dword ptr [rax*8+0x41e480] : | | | | | 0|13| TYPE_LOAD (1 uops) : s---deeeew----R-------p | | | 0|14|mov dword ptr [rdx], ecx : | | | | | 0|14| TYPE_STOREDATA (1 uops) : A--------dw----R-------p | | | 0|14| TYPE_STOREADDRESS (1 uops) : A-------deeeew----R-------p| | | The general assumption that these will be have same execution time always is very wrong, and let me say it it will change based on the block of the instructions been decoded, and to be sure if you went and disabled all CPU enhancement, then lookup table will have higher chance to be faster. Now on other hand the inplace calculation is using only registers making the CPU have its full chance to decode instruction into small microops mostly are 1 for the first part the CPU managed to even execute them in fraction of a cycle, and with now memory access higher throughput can be achieved. But again it depends on the case, an example i have seen an AES encryption that use very big lookup tables that hindered the speed greatly, while without one it is slower than small table, one big table was affecting the cache that it run 5 times slower. In that Delphi generated code for the lookup table the slowness was coming form the jumps (branching) at first place then from access memory in both read and write. One more thing Intel itself doesn't recommend many things but refer to this https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf it does have very valuable information and recommendation for compiler builder and for software programmers These might be in interest for reading 3.4.1.4 Inlining, Calls and Returns 3.5.1.8 Clearing Registers and Dependency Breaking Idioms 3.6.4 Alignment 3.6.5.1 Store-to-Load-Forwarding Restriction on Size and Alignment (,, in particular) Quote Assembly/Compiler Coding Rule 48. (H impact, M generality) A load that forwards from a store must have the same address start point and therefore the same alignment as the store data. Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of a load which is forwarded from a store must be completely contained within the store data. 3.6.5.2 Store-forwarding Restriction on Data Availability anyway there is many great information and at first might shock you, because they against general assumption. I am no selling anything here, it is intel 🙂 Share this post Link to post
David Heffernan 2345 Posted November 21, 2020 (edited) 13 minutes ago, Kas Ob. said: movdqu xmm3,DEC_TO_BIN_WORD_MASK pand xmm0, xmm3 I'm still seeing an AV here even with that change. It would be really nice to get this fixed and then I can update my benchmark program again. Edited November 21, 2020 by David Heffernan Share this post Link to post
David Heffernan 2345 Posted November 21, 2020 41 minutes ago, Mahdi Safsafi said: Unlike my code, Yours isn't cache friendly I understand, thanks Share this post Link to post
Guest Posted November 21, 2020 procedure CharToBin_XMM(HexChar: Char; HexBuffer: PChar); const DEC_TO_BIN_WORD_MASK: array[0..7] of UInt16 = ($01, $02,$00,$00, $04, $08, $00, $00); DEC_TO_BIN_FF_TO_CHARONE_DISTANCE: array[0..7] of UInt16 = ($FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF); DEC_TO_BIN_REVERSE_MASK: array[0..15] of Byte = (10, $80, 8, $80, 2, $80, 0, $80, $80, $80, $80, $80, $80, $80, $80, $80); asm movzx eax, HexChar mov ecx, 57 sub ecx, eax sar ecx, 31 and ecx, 39 neg ecx add eax, ecx add eax, - 48 // Produce 4 Chars presenting 4 bits of HexChar movd xmm0, eax pxor xmm1, xmm1 movdqu xmm2, DEC_TO_BIN_FF_TO_CHARONE_DISTANCE movdqu xmm3, DEC_TO_BIN_WORD_MASK movdqu xmm4,DEC_TO_BIN_REVERSE_MASK punpckldq xmm0, xmm0 packssdw xmm0, xmm0 pand xmm0, xmm3 pcmpeqw xmm0, xmm1 psubw xmm0, xmm2 PSHUFB xmm0, xmm4 // reverse the result movq qword ptr[HexBuffer], xmm0 end; And there are the result on Sandy Bridge Quote Pascal lookup: 4629 asm32: 5931 xmm: 5312 Notice that the XMM version is doing double the work so with some tweaks the conversion from Dec to Bin will be perform twice the speed out of the box, but will lose around 2 cycle per iteration for the extra char Hex to Dec conversion. Share this post Link to post
Stefan Glienke 2002 Posted November 21, 2020 (edited) Kinda pointless to limit the power of sse to handling single characters instead of simply processing multiple characters at once - especially since you now have a call inside the loop slowing stuff down significantly plus having to move the same stuff over and over into xmm2-4. Using simd should be 2-10times faster than the regular 1 char in a loop implementation Edited November 21, 2020 by Stefan Glienke 2 Share this post Link to post
David Heffernan 2345 Posted November 21, 2020 (edited) Updated benchmark code: {$APPTYPE CONSOLE} uses System.SysUtils, System.Classes, System.Diagnostics; //******************************************** // Constants to tune the benchmark performance const HexStringLength = 64; InputDataCount = 512; IterationCount = 20000; //***************************************************************** // David Heffernan's routine, with Move replace by Int64 assignment {$DEFINE ReplaceMoveWithInt64Assign} function HexToBinHeff(const HexValue: string): string; const BinaryValues: array [0..15] of string = ( '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001', '1010', '1011', '1100', '1101', '1110', '1111' ); var HexDigit: Char; HexDigitValue: Integer; Ptr: PChar; begin SetLength(Result, Length(HexValue) * 4); Ptr := Pointer(Result); for HexDigit in HexValue do begin case HexDigit of '0'..'9': HexDigitValue := Ord(HexDigit) - Ord('0'); 'a'..'f': HexDigitValue := 10 + Ord(HexDigit) - Ord('a'); 'A'..'F': HexDigitValue := 10 + Ord(HexDigit) - Ord('A'); else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; {$IFDEF ReplaceMoveWithInt64Assign} PInt64(Ptr)^ := PInt64(BinaryValues[HexDigitValue])^; {$ELSE} Move(Pointer(BinaryValues[HexDigitValue])^, Ptr^, 4 * SizeOf(Char)); {$ENDIF} Inc(Ptr, 4); end; end; //************************ // Mahdi Safsafi's routine {$DEFINE ReplaceMoveWithInt64Assign} function HexToBinMahdi(const HexValue: string): string; type TChar4 = array [0 .. 3] of Char; PChar4 = ^TChar4; const Table1: array ['0' .. '9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); Table2: array ['a' .. 'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111'); var HexDigit: Char; P: PChar4; begin SetLength(Result, Length(HexValue) * 4); P := PChar4(Result); for HexDigit in HexValue do begin case HexDigit of '0' .. '9': P^ := Table1[HexDigit]; 'a' .. 'f': P^ := Table2[HexDigit]; 'A' .. 'F': P^ := Table2[Chr(Ord(HexDigit) xor $20)]; else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; Inc(P); end; end; {$IFDEF CPUX86} //*********************** // Kas Ob's asm32 version procedure CharToBin_ASM32(HexChar: Char; HexBuffer: PChar); asm push edi mov edi,edx // Get the decimal value of one Hex Char (= half byte) movzx eax, HexChar mov ecx, 57 sub ecx, eax sar ecx, 31 and ecx, 39 neg ecx add eax, ecx add eax, - 48 // Produce 4 Chars presenting 4 bits of HexChar xor ecx,ecx mov dx,$1 test al,4 cmovne cx,dx shl ecx,16 test al,8 cmovne cx,dx add ecx,$00300030 mov [edi],ecx xor ecx,ecx test al,1 cmovne cx,dx shl ecx,16 test al,2 cmovne cx,dx add ecx,$00300030 mov [edi+4],ecx pop edi end; function HexToBinAsm32(const HexValue: string): string; var HexDigit: Char; Ptr: PChar; begin SetLength(Result, Length(HexValue) * 4); Ptr := Pointer(Result); for HexDigit in HexValue do begin case HexDigit of '0'..'9','a'..'f','A'..'F': CharToBin_ASM32(HexDigit, Ptr); else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; Inc(Ptr, 4); end; end; //********************* // Kas Ob's xmm version procedure CharToBin_XMM(HexChar: Char; HexBuffer: PChar); const DEC_TO_BIN_WORD_MASK: array[0..7] of UInt16 = ($01, $02,$00,$00, $04, $08, $00, $00); DEC_TO_BIN_FF_TO_CHARONE_DISTANCE: array[0..7] of UInt16 = ($FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF); DEC_TO_BIN_REVERSE_MASK: array[0..15] of Byte = (10, $80, 8, $80, 2, $80, 0, $80, $80, $80, $80, $80, $80, $80, $80, $80); asm movzx eax, HexChar mov ecx, 57 sub ecx, eax sar ecx, 31 and ecx, 39 neg ecx add eax, ecx add eax, - 48 // Produce 4 Chars presenting 4 bits of HexChar movd xmm0, eax pxor xmm1, xmm1 movdqu xmm2, DEC_TO_BIN_FF_TO_CHARONE_DISTANCE movdqu xmm3, DEC_TO_BIN_WORD_MASK movdqu xmm4,DEC_TO_BIN_REVERSE_MASK punpckldq xmm0, xmm0 packssdw xmm0, xmm0 pand xmm0, xmm3 pcmpeqw xmm0, xmm1 psubw xmm0, xmm2 PSHUFB xmm0, xmm4 // reverse the result movq qword ptr[HexBuffer], xmm0 end; function HexToBinXmm(const HexValue: string): string; var HexDigit: Char; Ptr: PChar; begin SetLength(Result, Length(HexValue) * 4); Ptr := Pointer(Result); for HexDigit in HexValue do begin case HexDigit of '0'..'9','a'..'f','A'..'F': CharToBin_XMM(HexDigit, Ptr); else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; Inc(Ptr, 4); end; end; {$ENDIF} //*************** // Benchmark code function RandomHexDigit: Char; var Ordinal: Integer; begin Ordinal := Random(16); case Ordinal of 0..9: Result := Chr(Ord('0') + Ordinal); 10..15: Result := Chr(Ord('a') + Ordinal - 10); else raise Exception.Create(''); end; end; function RandomHexString(Length: Integer): string; var Index: Integer; begin SetLength(Result, Length); for Index := 1 to Length do Result[Index] := RandomHexDigit; end; procedure TestCorrectness; var Index: Integer; HexStr, BinStr: string; begin for Index := 0 to $fffff do begin HexStr := IntToHex(Index, 6); BinStr := HexToBinHeff(HexStr); if BinStr<>HexToBinMahdi(HexStr) then raise Exception.Create('incorrect implementation'); if BinStr<>HexToBinAsm32(HexStr) then raise Exception.Create('incorrect implementation'); if BinStr<>HexToBinXmm(HexStr) then raise Exception.Create('incorrect implementation'); end; end; procedure Benchmark; var Index, Iteration: Integer; sw: TStopwatch; HexStrings: TArray<string>; binaryString: string; begin SetLength(HexStrings, InputDataCount); for Index := 0 to InputDataCount-1 do HexStrings[Index] := RandomHexString(HexStringLength); sw := TStopwatch.StartNew; for Iteration := 0 to IterationCount-1 do for Index := 0 to InputDataCount-1 do begin binaryString := HexToBinHeff(HexStrings[Index]); binaryString := ''; // force a reallocation of the string every iteration of this loop end; Writeln('Pascal lookup (Heff): ', sw.ElapsedMilliseconds); sw := TStopwatch.StartNew; for Iteration := 0 to IterationCount-1 do for Index := 0 to InputDataCount-1 do begin binaryString := HexToBinMahdi(HexStrings[Index]); binaryString := ''; // force a reallocation of the string every iteration of this loop end; Writeln('Pascal lookup (Mahdi): ', sw.ElapsedMilliseconds); {$IFDEF CPUX86} sw := TStopwatch.StartNew; for Iteration := 0 to IterationCount-1 do for Index := 0 to InputDataCount-1 do begin binaryString := HexToBinAsm32(HexStrings[Index]); binaryString := ''; // force a reallocation of the string every iteration of this loop end; Writeln('asm32: ', sw.ElapsedMilliseconds); sw := TStopwatch.StartNew; for Iteration := 0 to IterationCount-1 do for Index := 0 to InputDataCount-1 do begin binaryString := HexToBinXmm(HexStrings[Index]); binaryString := ''; // force a reallocation of the string every iteration of this loop end; Writeln('xmm: ', sw.ElapsedMilliseconds); {$ENDIF} end; begin try Randomize; TestCorrectness; Benchmark; except on E: Exception do Writeln(E.ClassName, ': ', E.Message); end; Readln; end. OUTPUT: Pascal lookup (Heff): 3717 Pascal lookup (Mahdi): 3443 asm32: 4472 xmm: 4038 Although the output varies from run to run ofc. I think Stefan is right though. If you want to benefit from SIMD you need to process bigger chunks than a single hex digit at a time. Edited November 21, 2020 by David Heffernan Share this post Link to post
Guest Posted November 21, 2020 For XMM i used many registers while in fact only one is needed, if consts can be guaranteed to be 16 byte aligned, and that include the xmm1 which is the zero and can a const, and for that matter the four of these consts are exactly 64 byte means one cache line will be accessed in fixed time in an branchless algorithm, the CPU will do magic in loops with such code, hence we on 32 bit have 8 registers to use in parallel and they will execute at the same time, means 8X speed, then again it is doing work for 2 char, then the speed will be 16X, while on 64bit the registers are double means 32X, and this is only XMM, i am not talking about YMM and ZMM, (SSE3 and SSE4) Share this post Link to post
Guest Posted November 21, 2020 Forgot to mention that for enough like 4 chars then we can remove the base assembly instructions and replace it with SIMD in Hex to Dec conversion too this will boost the speed ever further. Share this post Link to post
Lars Fosdal 1792 Posted November 21, 2020 5 hours ago, David Heffernan said: I think the way I phrased the question should help your intuition! 😉 It's always worth double checking, but a lookup table involves working the answer out before you compile. I would imagine it is hard for runtime code to ever beat that. Well, corner cases apart - it seems that it did? Share this post Link to post
Bernard 18 Posted November 21, 2020 Speed up the HexToBinMahdi function by adding a 3rd table and removing any calculations function HexToBinMahdi(const HexValue: string): string; type TChar4 = array [0 .. 3] of Char; PChar4 = ^TChar4; const Table1: array ['0' .. '9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); Table2: array ['a' .. 'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111'); Table3: array ['A' .. 'F'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111'); var HexDigit: Char; P: PChar4; begin SetLength(Result, Length(HexValue) * 4); P := PChar4(Result); for HexDigit in HexValue do begin case HexDigit of '0' .. '9': P^ := Table1[HexDigit]; 'a' .. 'f': P^ := Table2[HexDigit]; 'A' .. 'F': P^ := Table3[HexDigit]; else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]); end; Inc(P); end; end; Pascal lookup (Heff): 5777 Pascal lookup (Mahdi): 5948 Pascal lookup (Mahdi 2): 5414 asm32: 6800 xmm: 6068 Share this post Link to post
Anders Melander 1783 Posted November 21, 2020 54 minutes ago, Stefan Glienke said: Kinda pointless to limit the power of sse to handling single characters instead of simply processing multiple characters at once That was my thought too but then again, apart from the challenge of it, it's kinda pointless trying to squeeze every last microsecond out of an ASCII HEX to ASCII BIN conversion. It would be more relevant for HEX to binary conversion. 35 minutes ago, Lars Fosdal said: Well, corner cases apart - it seems that it did? Maybe in this case. You can find plenty of examples out there which demonstrates that a LUT isn't always faster. For example: What's the fastest way to convert hex to integer in C++? Share this post Link to post
David Heffernan 2345 Posted November 21, 2020 49 minutes ago, Lars Fosdal said: Well, corner cases apart - it seems that it did? Not in the code produced so far. Small numbers are better here. Share this post Link to post
Clément 148 Posted November 21, 2020 Hi, Just would like to say thank you for this thread!!!! If I might just add my 0.00002c..... The "for in" syntax is not optimization friendly ... I ran the benchmark in my very slow machine with the following modification in Mahdi code: function HexToBinMahdi2(const HexValue: string): string; type TChar4 = array [0 .. 3] of Char; PChar4 = ^TChar4; const Table1: array ['0' .. '9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); Table2: array ['a' .. 'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111'); var HexDigit: Char; P: PChar4; begin SetLength(Result, Length(HexValue) * 4); P := PChar4(Result); for var i: integer := low(HexValue) to high( HexValue ) do begin case HexValue[i] of '0' .. '9': P^ := Table1[HexValue[i]]; 'a' .. 'f': P^ := Table2[HexValue[i]]; 'A' .. 'F': P^ := Table2[Chr(Ord(HexValue[i]) xor $20)]; else raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexValue[i], HexValue]); end; Inc(P); end; end; Pascal lookup (Heff): 5391 Pascal lookup (Mahdi): 5072 Pascal lookup (Mahdi2): 4184 asm32: 8067 xmm: 6369 I know the test is about the conversion, but since "for in" is very slow, it might help straight things up a little extra bit Share this post Link to post
Stefan Glienke 2002 Posted November 21, 2020 (edited) 7 minutes ago, Clément said: The "for in" syntax is not optimization friendly i was just going to comment on that - a for in loop on a string is causing the compiler to do an LStrAsg to a local variable and iterates that one which causes a costly implicit try finally and UstrClr in the epilogue. Also with all that microbenchmarking - please consider the compiler might place code for various implementations good or bad in terms of their layout within cache lines. We had that topic already some while ago where one implementation was simply faster because the hot loop fit into one cache line while another one or even rearranging of code caused it to span two cache lines affecting the results negatively. Edited November 21, 2020 by Stefan Glienke Share this post Link to post