Hex2Binary

Lars Fosdal · November 21, 2020

I don't know, David, which is why I ask.

David Heffernan · November 21, 2020

13 minutes ago, Lars Fosdal said:

I don't know, David, which is why I ask.

I think the way I phrased the question should help your intuition! 😉

It's always worth double checking, but a lookup table involves working the answer out before you compile. I would imagine it is hard for runtime code to ever beat that.

November 21, 2020

9 minutes ago, Lars Fosdal said:

I don't know, David, which is why I ask.

I will dissect it !,

Here modified Mahdi version with lookup table to be handling one char, all build are 32bit with XE8

procedure HexToBin2(const HexValue: Char; Res: PChar);
type
  TChar4 = array[0..3] of Char;
  PChar4 = ^TChar4;
const
  Table1: array['0'..'9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
  Table2: array['a'..'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111');
var
  P: PChar4 absolute Res;
begin
  case HexValue of
    '0'..'9':
      P^ := Table1[HexValue];
    'a'..'f':
      P^ := Table2[HexValue];
    'A'..'F':
      P^ := Table2[Chr(Ord(HexValue) xor $20)];
  else
    //raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
  end;
end;

and here its assembly details

Block Throughput: 7.26 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.8     0.0  |  1.7  |  5.0     5.0  |  5.0     4.0  |  6.0  |  1.7  |  1.7  |  5.0  |
--------------------------------------------------------------------------------------------------

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1*     |             |      |             |             |      |      |      |      | mov edx, eax
|   1      |             | 0.3  |             |             |      | 0.7  |      |      | add edx, 0xffffffd0
|   1*     |             |      |             |             |      |      |      |      | sub dx, 0xa
|   0*F    |             |      |             |             |      |      |      |      | jb 0x16
|   1      | 0.3         | 0.5  |             |             |      | 0.3  |      |      | add edx, 0xfffffff9
|   1*     |             |      |             |             |      |      |      |      | sub dx, 0x6
|   0*F    |             |      |             |             |      |      |      |      | jb 0x43
|   1      | 0.3         | 0.3  |             |             |      | 0.5  |      |      | add edx, 0xffffffe6
|   1*     |             |      |             |             |      |      |      |      | sub dx, 0x6
|   0*F    |             |      |             |             |      |      |      |      | jb 0x1f
|   1      | 0.3         |      |             |             |      |      | 0.7  |      | jmp 0x55
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov edx, dword ptr [rbp-0x4]
|   1*     |             |      |             |             |      |      |      |      | movzx eax, ax
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov ecx, dword ptr [rax*8+0x41e480]
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rdx], ecx
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov ecx, dword ptr [rax*8+0x41e484]
|   2^     |             |      |             | 1.0         | 1.0  |      |      |      | mov dword ptr [rdx+0x4], ecx
|   1      | 0.7         |      |             |             |      |      | 0.3  |      | jmp 0x3a
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov edx, dword ptr [rbp-0x4]
|   1*     |             |      |             |             |      |      |      |      | movzx eax, ax
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov ecx, dword ptr [rax*8+0x41e348]
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rdx], ecx
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov ecx, dword ptr [rax*8+0x41e34c]
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rdx+0x4], ecx
|   1      | 0.3         |      |             |             |      |      | 0.7  |      | jmp 0x1f
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov edx, dword ptr [rbp-0x4]
|   1      |             | 0.7  |             |             |      | 0.3  |      |      | xor ax, 0x20
|   1*     |             |      |             |             |      |      |      |      | movzx eax, ax
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov ecx, dword ptr [rax*8+0x41e348]
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rdx], ecx
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov ecx, dword ptr [rax*8+0x41e34c]
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rdx+0x4], ecx
Total Num Of Uops: 35

Here my base assembly instructions version

Block Throughput: 7.21 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  5.1     0.0  |  4.8  |  1.3     0.6  |  1.4     0.4  |  3.0  |  4.9  |  5.2  |  1.3  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   3^#    |             | 0.1  |             | 0.4         | 1.0  |      |      | 0.6  | push rdi
|   1*     |             |      |             |             |      |      |      |      | mov edi, edx
|   1*     |             |      |             |             |      |      |      |      | movzx eax, ax
|   1      | 0.3         | 0.5  |             |             |      | 0.2  |      |      | mov ecx, 0x39
|   1      | 0.2         | 0.3  |             |             |      | 0.5  |      |      | sub ecx, eax
|   1      | 0.3         |      |             |             |      |      | 0.7  |      | sar ecx, 0x1f
|   1      | 0.2         | 0.5  |             |             |      | 0.3  |      |      | and ecx, 0x27
|   1      | 0.3         | 0.2  |             |             |      | 0.5  |      |      | neg ecx
|   1      | 0.2         | 0.3  |             |             |      | 0.2  | 0.3  |      | add eax, ecx
|   1      | 0.3         | 0.2  |             |             |      | 0.3  | 0.2  |      | add eax, 0xffffffd0
|   1*     |             |      |             |             |      |      |      |      | xor ecx, ecx
|   1      | 0.3         | 0.2  |             |             |      | 0.3  | 0.2  |      | mov dx, 0x1
|   1      | 0.2         | 0.3  |             |             |      | 0.2  | 0.3  |      | test al, 0x4
|   1      | 0.3         |      |             |             |      |      | 0.7  |      | cmovnz cx, dx
|   1      | 0.7         |      |             |             |      |      | 0.3  |      | shl ecx, 0x10
|   1      |             | 0.7  |             |             |      | 0.3  |      |      | test al, 0x8
|   1      | 0.3         |      |             |             |      |      | 0.7  |      | cmovnz cx, dx
|   1      |             | 0.3  |             |             |      | 0.7  |      |      | add ecx, 0x300030
|   2^     |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.4  | mov dword ptr [rdi], ecx
|   1*     |             |      |             |             |      |      |      |      | xor ecx, ecx
|   1      | 0.3         | 0.2  |             |             |      | 0.5  |      |      | test al, 0x1
|   1      | 0.2         |      |             |             |      |      | 0.8  |      | cmovnz cx, dx
|   1      | 0.8         |      |             |             |      |      | 0.2  |      | shl ecx, 0x10
|   1      |             | 0.8  |             |             |      | 0.2  |      |      | test al, 0x2
|   1      | 0.2         |      |             |             |      |      | 0.8  |      | cmovnz cx, dx
|   1      |             | 0.2  |             |             |      | 0.8  |      |      | add ecx, 0x300030
|   2^     |             |      | 0.4         | 0.3         | 1.0  |      |      | 0.3  | mov dword ptr [rdi+0x4], ecx
|   2^     |             |      | 0.6     0.6 | 0.4     0.4 |      |      |      |      | pop rdi
Total Num Of Uops: 33

And this is my MMX one

Block Throughput: 8.84 Cycles       Throughput Bottleneck: Backend
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  9.0     0.0  |  4.5  |  1.0     1.0  |  1.0     1.0  |  1.0  |  9.0  |  4.5  |  1.0  |
--------------------------------------------------------------------------------------------------

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1*     |             |      |             |             |      |      |      |      | movzx eax, ax
|   1      |             | 0.5  |             |             |      |      | 0.5  |      | mov ecx, 0x39
|   1      |             | 0.5  |             |             |      |      | 0.5  |      | sub ecx, eax
|   1      |             |      |             |             |      |      | 1.0  |      | sar ecx, 0x1f
|   1      |             | 1.0  |             |             |      |      |      |      | and ecx, 0x27
|   1      |             | 0.5  |             |             |      |      | 0.5  |      | neg ecx
|   1      |             | 0.5  |             |             |      |      | 0.5  |      | add eax, ecx
|   1      |             | 0.5  |             |             |      |      | 0.5  |      | add eax, 0xffffffd0
|   1      |             |      |             |             |      | 1.0  |      |      | movd mmx0, eax
|   1      | 1.0         |      |             |             |      |      |      |      | pxor mmx1, mmx1
|   1      |             |      |             |             |      | 1.0  |      |      | punpckldq mmx0, mmx0
|   3      |             | 0.5  |             |             |      | 2.0  | 0.5  |      | packssdw mmx0, mmx0
|   2^     | 1.0         |      | 1.0     1.0 |             |      |      |      |      | pand mmx0, qword ptr [rip+0x41e600]
|   1      | 1.0         |      |             |             |      |      |      |      | pcmpeqw mmx0, mmx1
|   2^     | 1.0         |      |             | 1.0     1.0 |      |      |      |      | psubw mmx0, qword ptr [rip+0x41e608]
|   1      |             |      |             |             |      | 1.0  |      |      | pshufw mmx0, mmx0, 0x1b
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movq qword ptr [rdx], mmx0
|  10      | 5.0         | 0.5  |             |             |      | 4.0  | 0.5  |      | emms
Total Num Of Uops: 32
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.

You can see the toll of using EMMS

and without emms

Block Throughput: 6.00 Cycles       Throughput Bottleneck: Backend
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  4.0     0.0  |  4.0  |  1.0     1.0  |  1.0     1.0  |  1.0  |  5.0  |  4.0  |  1.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1*     |             |      |             |             |      |      |      |      | movzx eax, ax
|   1      |             | 1.0  |             |             |      |      |      |      | mov ecx, 0x39
|   1      |             |      |             |             |      |      | 1.0  |      | sub ecx, eax
|   1      |             |      |             |             |      |      | 1.0  |      | sar ecx, 0x1f
|   1      |             | 1.0  |             |             |      |      |      |      | and ecx, 0x27
|   1      |             | 1.0  |             |             |      |      |      |      | neg ecx
|   1      |             |      |             |             |      |      | 1.0  |      | add eax, ecx
|   1      |             | 1.0  |             |             |      |      |      |      | add eax, 0xffffffd0
|   1      |             |      |             |             |      | 1.0  |      |      | movd mmx0, eax
|   1      | 1.0         |      |             |             |      |      |      |      | pxor mmx1, mmx1
|   1      |             |      |             |             |      | 1.0  |      |      | punpckldq mmx0, mmx0
|   3      |             |      |             |             |      | 2.0  | 1.0  |      | packssdw mmx0, mmx0
|   2^     | 1.0         |      | 1.0     1.0 |             |      |      |      |      | pand mmx0, qword ptr [rip+0x41e600]
|   1      | 1.0         |      |             |             |      |      |      |      | pcmpeqw mmx0, mmx1
|   2^     | 1.0         |      |             | 1.0     1.0 |      |      |      |      | psubw mmx0, qword ptr [rip+0x41e608]
|   1      |             |      |             |             |      | 1.0  |      |      | pshufw mmx0, mmx0, 0x1b
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movq qword ptr [rdx], mmx0
Total Num Of Uops: 22
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.

November 21, 2020

2 minutes ago, David Heffernan said:

I would imagine it is hard for runtime code to ever beat that.

I think, i just did.

David Heffernan · November 21, 2020

On 11/19/2020 at 4:12 PM, Mahdi Safsafi said:

@David Heffernan Few remarks about your code if you don't mind :

1- Its pointless to use string when characters are fixed in size ... Simply use static array of X char.

2- Its also pointless to calculate index when you already used a case ... Simply declare your array using char-range. In your case, compiler generated additional instructions to compute the index.



function HexToBin2(const HexValue: string): string;
type
  TChar4 = array [0 .. 3] of Char;
  PChar4 = ^TChar4;
const
  Table1: array ['0' .. '9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
  Table2: array ['a' .. 'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111');
var
  HexDigit: Char;
  P: PChar4;
begin
  SetLength(Result, Length(HexValue) * 4);
  P := PChar4(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
      '0' .. '9':
        P^ := Table1[HexDigit];
      'a' .. 'f':
        P^ := Table2[HexDigit];
      'A' .. 'F':
        P^ := Table2[Chr(Ord(HexDigit) xor $20)];
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
    Inc(P);
  end;
end;

My benchmarking suggests that point 1 has no impact on performance, but point 2 does.

Actually the suggestion made by @Kas Ob. to replace the Move in my code with

PInt64(Ptr)^ := PInt64(BinaryValues[HexDigitValue])^

is all you need to match the performance of your version. Which makes a lot of sense to me.

David Heffernan · November 21, 2020

40 minutes ago, Kas Ob. said:

I think, i just did.

I made this benchmark program, that wraps your single digit asm into a function that handles string input, and included error checking. If you skip the error checking then your asm code is faster. But if you include error checking, not so. How would you go about integrating your single digit code into a string level function with error checking?

Also, I'd love to have a high level explanation as to why my reasoning that a lookup table should not be beaten is wrong. Or is it perhaps that the implementation of the lookup table could be done better?

{$APPTYPE CONSOLE}

uses
  System.SysUtils,
  System.Classes,
  System.Diagnostics;

//********************************************
// Constants to tune the benchmark performance

const
  HexStringLength = 64;
  InputDataCount = 512;
  IterationCount = 20000;

//*****************************************************************
// David Heffernan's routine, with Move replace by Int64 assignment

{$DEFINE ReplaceMoveWithInt64Assign}
function HexToBin(const HexValue: string): string;
const
  BinaryValues: array [0..15] of string = (
    '0000', '0001', '0010', '0011',
    '0100', '0101', '0110', '0111',
    '1000', '1001', '1010', '1011',
    '1100', '1101', '1110', '1111'
  );
var
  HexDigit: Char;
  HexDigitValue: Integer;
  Ptr: PChar;
begin
  SetLength(Result, Length(HexValue) * 4);
  Ptr := Pointer(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
    '0'..'9':
      HexDigitValue := Ord(HexDigit) - Ord('0');
    'a'..'f':
      HexDigitValue := 10 + Ord(HexDigit) - Ord('a');
    'A'..'F':
      HexDigitValue := 10 + Ord(HexDigit) - Ord('A');
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
{$IFDEF ReplaceMoveWithInt64Assign}
    PInt64(Ptr)^ := PInt64(BinaryValues[HexDigitValue])^;
{$ELSE}
    Move(Pointer(BinaryValues[HexDigitValue])^, Ptr^, 4 * SizeOf(Char));
{$ENDIF}
    Inc(Ptr, 4);
  end;
end;

//*********************
// Kas Ob's asm version

procedure CharToBin_ASM32(HexChar: Char; HexBuffer: PChar);
asm
        push edi
        mov edi,edx
        //      Get the decimal value of one Hex Char (= half byte)
        movzx   eax, HexChar
        mov     ecx, 57
        sub     ecx, eax
        sar     ecx, 31
        and     ecx, 39
        neg     ecx
        add     eax, ecx
        add     eax,  - 48
        //      Produce 4 Chars presenting 4 bits of HexChar
        xor     ecx,ecx
        mov     dx,$1
        test    al,4
        cmovne  cx,dx
        shl     ecx,16
        test    al,8
        cmovne  cx,dx
        add     ecx,$00300030
        mov     [edi],ecx
        xor     ecx,ecx
        test    al,1
        cmovne  cx,dx
        shl     ecx,16
        test    al,2
        cmovne  cx,dx
        add     ecx,$00300030
        mov     [edi+4],ecx
        pop edi
end;

function HexToBinAsm(const HexValue: string): string;
var
  HexDigit: Char;
  HexDigitValue: Integer;
  Ptr: PChar;
begin
  SetLength(Result, Length(HexValue) * 4);
  Ptr := Pointer(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
    '0'..'9','a'..'f','A'..'F':
      CharToBin_ASM32(HexDigit, Ptr);
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
    Inc(Ptr, 4);
  end;
end;

//***************
// Benchmark code

function RandomHexDigit: Char;
var
  Ordinal: Integer;
begin
  Ordinal := Random(16);
  case Ordinal of
  0..9:
    Result := Chr(Ord('0') + Ordinal);
  10..15:
    Result := Chr(Ord('a') + Ordinal - 10);
  else
    raise Exception.Create('');
  end;
end;

function RandomHexString(Length: Integer): string;
var
  Index: Integer;
begin
  SetLength(Result, Length);
  for Index := 1 to Length do
    Result[Index] := RandomHexDigit;
end;

procedure Benchmark;
var
  Index, Iteration: Integer;
  sw: TStopwatch;
  HexStrings: TStringList;
  binaryString: string;
begin
  HexStrings := TStringList.Create;
  try
    for Index := 0 to InputDataCount-1 do
      HexStrings.Add(RandomHexString(HexStringLength));
    sw := TStopwatch.StartNew;
    for Iteration := 0 to IterationCount-1 do
      for Index := 0 to HexStrings.Count-1 do
      begin
        binaryString := HexToBin(HexStrings[Index]);
        binaryString := ''; // force a reallocation of the string every iteration of this loop
      end;
    Writeln(sw.ElapsedMilliseconds);
  finally
    HexStrings.Free;
  end;
end;

begin
  try
    Randomize;
    Benchmark;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  Readln;
end.

EDIT: It turns out that this benchmark is possibly useless at the moment, because HexToBinAsm doesn't work correctly (almost certainly my fault). I need to work out what I've done wrong.
EDIT2: No, it's fine, I was mistaken, HexToBinAsm above works correctly.

Edited November 21, 2020 by David Heffernan

November 21, 2020

The XMM version is way better and faster than them all, 6 cycles in all, while MMX instruction are doing the work of full byte, means converting two HEX character with these instruction will be the same speed.

procedure CharToBin_XMM(HexChar: Char; HexBuffer: PChar);
const
  DEC_TO_BIN_WORD_MASK: array[0..7] of UInt16 = ($01, $02,$00,$00, $04, $08, $00, $00);
  DEC_TO_BIN_FF_TO_CHARONE_DISTANCE: array[0..7] of UInt16 = ($FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF);
  DEC_TO_BIN_REVERSE_MASK: array[0..15] of Byte = (10, $80, 8, $80, 2, $80, 0, $80, $80, $80, $80, $80, $80, $80, $80, $80);
asm
        movzx   eax, HexChar
        mov     ecx, 57
        sub     ecx, eax
        sar     ecx, 31
        and     ecx, 39
        neg     ecx
        add     eax, ecx
        add     eax,  - 48
        //      Produce 4 Chars presenting 4 bits of HexChar
        movd    xmm0, eax
        pxor    xmm1, xmm1
        movdqu  xmm2, DEC_TO_BIN_FF_TO_CHARONE_DISTANCE
        punpckldq xmm0, xmm0
        packssdw xmm0, xmm0
        pand    xmm0, dqword ptr[DEC_TO_BIN_WORD_MASK]
        pcmpeqw xmm0, xmm1
        psubw   xmm0, xmm2
        movdqu xmm3,DEC_TO_BIN_REVERSE_MASK
        PSHUFB   xmm0, xmm3                   // reverse the result
        movq    qword ptr[HexBuffer], xmm0
end;

Block Throughput: 6.05 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 3.0    0.0  | 3.0  | 1.5    1.5  | 1.5    1.5  | 1.0  | 4.0  | 4.0  | 1.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1*   |           |     |           |           |     |     |     |     |    | movzx eax, ax
|   1    | 0.1       | 0.1 |           |           |     |     | 0.9 |     |    | mov ecx, 0x39
|   1    | 0.1       | 0.1 |           |           |     |     | 0.9 |     |    | sub ecx, eax
|   1    | 0.3       |     |           |           |     |     | 0.7 |     |    | sar ecx, 0x1f
|   1    | 0.2       | 0.5 |           |           |     |     | 0.3 |     |    | and ecx, 0x27
|   1    | 0.1       | 0.3 |           |           |     |     | 0.6 |     |    | neg ecx
|   1    | 0.3       | 0.6 |           |           |     |     | 0.2 |     |    | add eax, ecx
|   1    | 0.3       | 0.3 |           |           |     |     | 0.5 |     |    | add eax, 0xffffffd0
|   1    |           |     |           |           |     | 1.0 |     |     |    | movd xmm0, eax
|   1*   |           |     |           |           |     |     |     |     |    | pxor xmm1, xmm1
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | movdqu xmm2, xmmword ptr [rip+0x41e610]
|   1    |           |     |           |           |     | 1.0 |     |     |    | punpckldq xmm0, xmm0
|   1    |           |     |           |           |     | 1.0 |     |     |    | packssdw xmm0, xmm0
|   2^   | 0.6       | 0.4 | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | pand xmm0, xmmword ptr [rip+0x41e600]
|   1    | 0.5       | 0.5 |           |           |     |     |     |     |    | pcmpeqw xmm0, xmm1
|   1    | 0.6       | 0.4 |           |           |     |     |     |     |    | psubw xmm0, xmm2
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | movdqu xmm3, xmmword ptr [rip+0x41e620]
|   1    |           |     |           |           |     | 1.0 |     |     |    | pshufb xmm0, xmm3
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | movq qword ptr [rdx], xmm0
Total Num Of Uops: 21

David Heffernan · November 21, 2020

UPDATED BENCHMARK:

{$APPTYPE CONSOLE}

uses
  System.SysUtils,
  System.Classes,
  System.Diagnostics;

//********************************************
// Constants to tune the benchmark performance

const
  HexStringLength = 64;
  InputDataCount = 512;
  IterationCount = 20000;

//*****************************************************************
// David Heffernan's routine, with Move replace by Int64 assignment

{$DEFINE ReplaceMoveWithInt64Assign}
function HexToBin(const HexValue: string): string;
const
  BinaryValues: array [0..15] of string = (
    '0000', '0001', '0010', '0011',
    '0100', '0101', '0110', '0111',
    '1000', '1001', '1010', '1011',
    '1100', '1101', '1110', '1111'
  );
var
  HexDigit: Char;
  HexDigitValue: Integer;
  Ptr: PChar;
begin
  SetLength(Result, Length(HexValue) * 4);
  Ptr := Pointer(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
    '0'..'9':
      HexDigitValue := Ord(HexDigit) - Ord('0');
    'a'..'f':
      HexDigitValue := 10 + Ord(HexDigit) - Ord('a');
    'A'..'F':
      HexDigitValue := 10 + Ord(HexDigit) - Ord('A');
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
{$IFDEF ReplaceMoveWithInt64Assign}
    PInt64(Ptr)^ := PInt64(BinaryValues[HexDigitValue])^;
{$ELSE}
    Move(Pointer(BinaryValues[HexDigitValue])^, Ptr^, 4 * SizeOf(Char));
{$ENDIF}
    Inc(Ptr, 4);
  end;
end;

//***********************
// Kas Ob's asm32 version

procedure CharToBin_ASM32(HexChar: Char; HexBuffer: PChar);
asm
        push edi
        mov edi,edx
        //      Get the decimal value of one Hex Char (= half byte)
        movzx   eax, HexChar
        mov     ecx, 57
        sub     ecx, eax
        sar     ecx, 31
        and     ecx, 39
        neg     ecx
        add     eax, ecx
        add     eax,  - 48
        //      Produce 4 Chars presenting 4 bits of HexChar
        xor     ecx,ecx
        mov     dx,$1
        test    al,4
        cmovne  cx,dx
        shl     ecx,16
        test    al,8
        cmovne  cx,dx
        add     ecx,$00300030
        mov     [edi],ecx
        xor     ecx,ecx
        test    al,1
        cmovne  cx,dx
        shl     ecx,16
        test    al,2
        cmovne  cx,dx
        add     ecx,$00300030
        mov     [edi+4],ecx
        pop edi
end;

function HexToBinAsm32(const HexValue: string): string;
var
  HexDigit: Char;
  Ptr: PChar;
begin
  SetLength(Result, Length(HexValue) * 4);
  Ptr := Pointer(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
    '0'..'9','a'..'f','A'..'F':
      CharToBin_ASM32(HexDigit, Ptr);
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
    Inc(Ptr, 4);
  end;
end;

//*********************
// Kas Ob's xmm version

procedure CharToBin_XMM(HexChar: Char; HexBuffer: PChar);
const
  DEC_TO_BIN_WORD_MASK: array[0..7] of UInt16 = ($01, $02,$00,$00, $04, $08, $00, $00);
  DEC_TO_BIN_FF_TO_CHARONE_DISTANCE: array[0..7] of UInt16 = ($FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF);
  DEC_TO_BIN_REVERSE_MASK: array[0..15] of Byte = (10, $80, 8, $80, 2, $80, 0, $80, $80, $80, $80, $80, $80, $80, $80, $80);
asm
        movzx   eax, HexChar
        mov     ecx, 57
        sub     ecx, eax
        sar     ecx, 31
        and     ecx, 39
        neg     ecx
        add     eax, ecx
        add     eax,  - 48
        //      Produce 4 Chars presenting 4 bits of HexChar
        movd    xmm0, eax
        pxor    xmm1, xmm1
        movdqu  xmm2, DEC_TO_BIN_FF_TO_CHARONE_DISTANCE
        punpckldq xmm0, xmm0
        packssdw xmm0, xmm0
        pand    xmm0, dqword ptr[DEC_TO_BIN_WORD_MASK]
        pcmpeqw xmm0, xmm1
        psubw   xmm0, xmm2
        movdqu xmm3,DEC_TO_BIN_REVERSE_MASK
        PSHUFB   xmm0, xmm3                   // reverse the result
        movq    qword ptr[HexBuffer], xmm0
end;

function HexToBinXmm(const HexValue: string): string;
var
  HexDigit: Char;
  Ptr: PChar;
begin
  SetLength(Result, Length(HexValue) * 4);
  Ptr := Pointer(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
    '0'..'9','a'..'f','A'..'F':
      CharToBin_XMM(HexDigit, Ptr);
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
    Inc(Ptr, 4);
  end;
end;

//***************
// Benchmark code

function RandomHexDigit: Char;
var
  Ordinal: Integer;
begin
  Ordinal := Random(16);
  case Ordinal of
  0..9:
    Result := Chr(Ord('0') + Ordinal);
  10..15:
    Result := Chr(Ord('a') + Ordinal - 10);
  else
    raise Exception.Create('');
  end;
end;

function RandomHexString(Length: Integer): string;
var
  Index: Integer;
begin
  SetLength(Result, Length);
  for Index := 1 to Length do
    Result[Index] := RandomHexDigit;
end;

procedure Benchmark;
var
  Index, Iteration: Integer;
  sw: TStopwatch;
  HexStrings: TStringList;
  binaryString: string;
begin
  HexStrings := TStringList.Create;
  try
    for Index := 0 to InputDataCount-1 do
      HexStrings.Add(RandomHexString(HexStringLength));

    sw := TStopwatch.StartNew;
    for Iteration := 0 to IterationCount-1 do
      for Index := 0 to HexStrings.Count-1 do
      begin
        binaryString := HexToBin(HexStrings[Index]);
        binaryString := ''; // force a reallocation of the string every iteration of this loop
      end;
    Writeln('Pascal lookup: ', sw.ElapsedMilliseconds);

    sw := TStopwatch.StartNew;
    for Iteration := 0 to IterationCount-1 do
      for Index := 0 to HexStrings.Count-1 do
      begin
        binaryString := HexToBinAsm32(HexStrings[Index]);
        binaryString := ''; // force a reallocation of the string every iteration of this loop
      end;
    Writeln('asm32: ', sw.ElapsedMilliseconds);

    sw := TStopwatch.StartNew;
    for Iteration := 0 to IterationCount-1 do
      for Index := 0 to HexStrings.Count-1 do
      begin
        binaryString := HexToBinXmm(HexStrings[Index]);
        binaryString := ''; // force a reallocation of the string every iteration of this loop
      end;
    Writeln('xmm: ', sw.ElapsedMilliseconds);
  finally
    HexStrings.Free;
  end;
end;

begin
  try
    Randomize;
    Benchmark;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  Readln;
end.

Compiled on XE7, 32 bit, release configuration

OUTPUT:

Pascal lookup: 3743
asm32: 4396
EAccessViolation: Access violation at address 004D77D3 in module 'Project14.exe'. Read of address FFFFFFFF

The AV is HexToBinXmm. Perhaps I misunderstood something.

Mike Torrettinni · November 21, 2020

My benchmark shows that Mahdi's HexToBin2 is still faster then HexToBinAsm32. Actually, the longer the Hex string, the bigger difference between them - at 1024 hex chars string, HexToBinAsm32 double's the execution time:

cHexRange = 16777216; cHexLen = 6;

HexToBin2 = 818;

HexToBinAsm32 = 1208;

Diff: HexToBin2 = -32%

cHexRange = 800000; cHexLen = 1024;

HexToBin2 = 2035;

HexToBinAsm32 = 4506;

Diff: HexToBin2 = -54%

You don't get the similar difference?

Stefan Glienke · November 21, 2020

3 hours ago, David Heffernan said:

My benchmarking suggests that point 1 has no impact on performance, but point 2 does.

Certainly not in a microbenchmark where probably everything fits into L1 cache.

Not so in a real program where that array of string might be placed in different locations than the strings it contains.

Mahdi Safsafi · November 21, 2020

3 hours ago, David Heffernan said:

My benchmarking suggests that point 1 has no impact on performance, but point 2 does.

Benchmark does not always tell the full story 🙂

Unlike my code, Yours isn't cache friendly

const
  BinaryValues: array [0..15] of string = (
    '0000', '0001', '0010', '0011',
    '0100', '0101', '0110', '0111',
    '1000', '1001', '1010', '1011',
    '1100', '1101', '1110', '1111'
  );

Let's see what happens for x64 :

You need 16 * SizeOf(Pointer) to store string reference = 16 * 8 = 128 bytes = 2 Cache Line.

You need 16 * (4 * SizeOf(Char) + 2 byte(null terminated) + SizeOf(StrRec)) to store data = 16 *(4 * 2 + 2 + 16) = 416 bytes = 7 Cache Line.

N.B: I supposed that compiler did a great job by placing your data in a continuous region.

--------------------------------------------------------

You ended up consuming 416 + 128 = 544 bytes.

You ended up consuming 9 Cache Line.

type
  TChar4 = array[0..3] of Char;
  PChar4 = ^TChar4;
const
  Table1: array['0'..'9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
  Table2: array['a'..'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111');

I only need 16 * (4 * SizeOf(Char)) to store data = 16 * (4 * 2) = 128 bytes = 2 Cache Line.

November 21, 2020

2 hours ago, David Heffernan said:

The AV is HexToBinXmm. Perhaps I misunderstood something.

My mistake i fixed one and forgot about the second, the compiler fooled/failed me, it is aligned access, to fix it do this

Quote

//movdqu xmm3,DEC_TO_BIN_REVERSE_MASK
// with these
movdqu xmm3,DEC_TO_BIN_WORD_MASK
pand xmm0, xmm3

3 hours ago, David Heffernan said:

I'd love to have a high level explanation as to why my reasoning that a lookup table should not be beaten is wrong. Or is it perhaps that the implementation of the lookup table could be done better?

Will try to write for the second time, as the browser for some reason refreshed and erased what i wrote.

Now to answer as simple and clear way as i can, your assumption (like many others) on how CPU works is outdated, i am talking about how instructions are executed, now to be more precise how they are decoded then executed, long story short, the CPU's that executed instructions in serial way by reading one by one are almost vanished from our time, CPU technology now grab a block of code, by code i mean block of data/bytes to execute it, most likely 16 bytes, then decode them in parallel at the same time then execute, so some will block as they need result, comparison, calculating address, access memory, access memory within the same cache line, access memory on the same page... many many things

CPU utilize many tricks by cheating, and they introduced many technologies in that matter from out-of-order-execution to branch prediction....

Now we established that let see what happen with lookup tables vs calculating, in many cases lookup table are faster but in this case we should look at this instruction (which is essential in all lookup tables)

mov ecx, dword ptr [rax*8+0x41e348]

This in fact include slow microop which include multiplication and addition, hence this will take full cycle even when when reported as microfused with the following one

mov dword ptr [rdx], ecx

it is slow and will block all the block execution until both executed, the big problem in the second one is 2 full microops including

 0|13|mov ecx, dword ptr [rax*8+0x41e480]               :          |         |         |         |         |       
 0|13|    TYPE_LOAD (1 uops)                            :  s---deeeew----R-------p     |         |         |      
 0|14|mov dword ptr [rdx], ecx                          :          |         |         |         |         |      
 0|14|    TYPE_STOREDATA (1 uops)                       :  A--------dw----R-------p    |         |         |    
 0|14|    TYPE_STOREADDRESS (1 uops)                    :   A-------deeeew----R-------p|         |         |

The general assumption that these will be have same execution time always is very wrong, and let me say it it will change based on the block of the instructions been decoded, and to be sure if you went and disabled all CPU enhancement, then lookup table will have higher chance to be faster.

Now on other hand the inplace calculation is using only registers making the CPU have its full chance to decode instruction into small microops mostly are 1 for the first part the CPU managed to even execute them in fraction of a cycle, and with now memory access higher throughput can be achieved.

But again it depends on the case, an example i have seen an AES encryption that use very big lookup tables that hindered the speed greatly, while without one it is slower than small table, one big table was affecting the cache that it run 5 times slower.

In that Delphi generated code for the lookup table the slowness was coming form the jumps (branching) at first place then from access memory in both read and write.

One more thing Intel itself doesn't recommend many things but refer to this

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

it does have very valuable information and recommendation for compiler builder and for software programmers

These might be in interest for reading

3.4.1.4 Inlining, Calls and Returns

3.5.1.8 Clearing Registers and Dependency Breaking Idioms

3.6.4 Alignment

3.6.5.1 Store-to-Load-Forwarding Restriction on Size and Alignment (,, in particular)

Quote

Assembly/Compiler Coding Rule 48. (H impact, M generality) A load that forwards from a store must have the same address start point and therefore the same alignment as the store data.

Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of a load which is forwarded from a store must be completely contained within the store data.

3.6.5.2 Store-forwarding Restriction on Data Availability

anyway there is many great information and at first might shock you, because they against general assumption.

I am no selling anything here, it is intel 🙂

David Heffernan · November 21, 2020

13 minutes ago, Kas Ob. said:

movdqu xmm3,DEC_TO_BIN_WORD_MASK
pand xmm0, xmm3

I'm still seeing an AV here even with that change. It would be really nice to get this fixed and then I can update my benchmark program again.

Edited November 21, 2020 by David Heffernan

David Heffernan · November 21, 2020

41 minutes ago, Mahdi Safsafi said:

Unlike my code, Yours isn't cache friendly

I understand, thanks

November 21, 2020

procedure CharToBin_XMM(HexChar: Char; HexBuffer: PChar);
const
  DEC_TO_BIN_WORD_MASK: array[0..7] of UInt16 = ($01, $02,$00,$00, $04, $08, $00, $00);
  DEC_TO_BIN_FF_TO_CHARONE_DISTANCE: array[0..7] of UInt16 = ($FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF);
  DEC_TO_BIN_REVERSE_MASK: array[0..15] of Byte = (10, $80, 8, $80, 2, $80, 0, $80, $80, $80, $80, $80, $80, $80, $80, $80);
asm
        movzx   eax, HexChar
        mov     ecx, 57
        sub     ecx, eax
        sar     ecx, 31
        and     ecx, 39
        neg     ecx
        add     eax, ecx
        add     eax,  - 48
        //      Produce 4 Chars presenting 4 bits of HexChar
        movd    xmm0, eax
        pxor    xmm1, xmm1
        movdqu  xmm2, DEC_TO_BIN_FF_TO_CHARONE_DISTANCE
        movdqu  xmm3, DEC_TO_BIN_WORD_MASK
        movdqu  xmm4,DEC_TO_BIN_REVERSE_MASK
        punpckldq xmm0, xmm0
        packssdw xmm0, xmm0
        pand    xmm0, xmm3
        pcmpeqw xmm0, xmm1
        psubw   xmm0, xmm2
        PSHUFB  xmm0, xmm4                   // reverse the result
        movq    qword ptr[HexBuffer], xmm0
end;

And there are the result on Sandy Bridge

Quote

Pascal lookup: 4629
asm32: 5931
xmm: 5312

Notice that the XMM version is doing double the work so with some tweaks the conversion from Dec to Bin will be perform twice the speed out of the box, but will lose around 2 cycle per iteration for the extra char Hex to Dec conversion.

Stefan Glienke · November 21, 2020

Kinda pointless to limit the power of sse to handling single characters instead of simply processing multiple characters at once - especially since you now have a call inside the loop slowing stuff down significantly plus having to move the same stuff over and over into xmm2-4.

Using simd should be 2-10times faster than the regular 1 char in a loop implementation

Edited November 21, 2020 by Stefan Glienke

David Heffernan · November 21, 2020

Updated benchmark code:

{$APPTYPE CONSOLE}

uses
  System.SysUtils,
  System.Classes,
  System.Diagnostics;

//********************************************
// Constants to tune the benchmark performance

const
  HexStringLength = 64;
  InputDataCount = 512;
  IterationCount = 20000;

//*****************************************************************
// David Heffernan's routine, with Move replace by Int64 assignment

{$DEFINE ReplaceMoveWithInt64Assign}
function HexToBinHeff(const HexValue: string): string;
const
  BinaryValues: array [0..15] of string = (
    '0000', '0001', '0010', '0011',
    '0100', '0101', '0110', '0111',
    '1000', '1001', '1010', '1011',
    '1100', '1101', '1110', '1111'
  );
var
  HexDigit: Char;
  HexDigitValue: Integer;
  Ptr: PChar;
begin
  SetLength(Result, Length(HexValue) * 4);
  Ptr := Pointer(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
    '0'..'9':
      HexDigitValue := Ord(HexDigit) - Ord('0');
    'a'..'f':
      HexDigitValue := 10 + Ord(HexDigit) - Ord('a');
    'A'..'F':
      HexDigitValue := 10 + Ord(HexDigit) - Ord('A');
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
{$IFDEF ReplaceMoveWithInt64Assign}
    PInt64(Ptr)^ := PInt64(BinaryValues[HexDigitValue])^;
{$ELSE}
    Move(Pointer(BinaryValues[HexDigitValue])^, Ptr^, 4 * SizeOf(Char));
{$ENDIF}
    Inc(Ptr, 4);
  end;
end;

//************************
// Mahdi Safsafi's routine

{$DEFINE ReplaceMoveWithInt64Assign}
function HexToBinMahdi(const HexValue: string): string;
type
  TChar4 = array [0 .. 3] of Char;
  PChar4 = ^TChar4;
const
  Table1: array ['0' .. '9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
  Table2: array ['a' .. 'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111');
var
  HexDigit: Char;
  P: PChar4;
begin
  SetLength(Result, Length(HexValue) * 4);
  P := PChar4(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
      '0' .. '9':
        P^ := Table1[HexDigit];
      'a' .. 'f':
        P^ := Table2[HexDigit];
      'A' .. 'F':
        P^ := Table2[Chr(Ord(HexDigit) xor $20)];
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
    Inc(P);
  end;
end;

{$IFDEF CPUX86}
//***********************
// Kas Ob's asm32 version

procedure CharToBin_ASM32(HexChar: Char; HexBuffer: PChar);
asm
        push edi
        mov edi,edx
        //      Get the decimal value of one Hex Char (= half byte)
        movzx   eax, HexChar
        mov     ecx, 57
        sub     ecx, eax
        sar     ecx, 31
        and     ecx, 39
        neg     ecx
        add     eax, ecx
        add     eax,  - 48
        //      Produce 4 Chars presenting 4 bits of HexChar
        xor     ecx,ecx
        mov     dx,$1
        test    al,4
        cmovne  cx,dx
        shl     ecx,16
        test    al,8
        cmovne  cx,dx
        add     ecx,$00300030
        mov     [edi],ecx
        xor     ecx,ecx
        test    al,1
        cmovne  cx,dx
        shl     ecx,16
        test    al,2
        cmovne  cx,dx
        add     ecx,$00300030
        mov     [edi+4],ecx
        pop edi
end;

function HexToBinAsm32(const HexValue: string): string;
var
  HexDigit: Char;
  Ptr: PChar;
begin
  SetLength(Result, Length(HexValue) * 4);
  Ptr := Pointer(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
    '0'..'9','a'..'f','A'..'F':
      CharToBin_ASM32(HexDigit, Ptr);
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
    Inc(Ptr, 4);
  end;
end;

//*********************
// Kas Ob's xmm version

procedure CharToBin_XMM(HexChar: Char; HexBuffer: PChar);
const
  DEC_TO_BIN_WORD_MASK: array[0..7] of UInt16 = ($01, $02,$00,$00, $04, $08, $00, $00);
  DEC_TO_BIN_FF_TO_CHARONE_DISTANCE: array[0..7] of UInt16 = ($FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF, $FFCF);
  DEC_TO_BIN_REVERSE_MASK: array[0..15] of Byte = (10, $80, 8, $80, 2, $80, 0, $80, $80, $80, $80, $80, $80, $80, $80, $80);
asm
        movzx   eax, HexChar
        mov     ecx, 57
        sub     ecx, eax
        sar     ecx, 31
        and     ecx, 39
        neg     ecx
        add     eax, ecx
        add     eax,  - 48
        //      Produce 4 Chars presenting 4 bits of HexChar
        movd    xmm0, eax
        pxor    xmm1, xmm1
        movdqu  xmm2, DEC_TO_BIN_FF_TO_CHARONE_DISTANCE
        movdqu  xmm3, DEC_TO_BIN_WORD_MASK
        movdqu  xmm4,DEC_TO_BIN_REVERSE_MASK
        punpckldq xmm0, xmm0
        packssdw xmm0, xmm0
        pand    xmm0, xmm3
        pcmpeqw xmm0, xmm1
        psubw   xmm0, xmm2
        PSHUFB  xmm0, xmm4                   // reverse the result
        movq    qword ptr[HexBuffer], xmm0
end;

function HexToBinXmm(const HexValue: string): string;
var
  HexDigit: Char;
  Ptr: PChar;
begin
  SetLength(Result, Length(HexValue) * 4);
  Ptr := Pointer(Result);
  for HexDigit in HexValue do
  begin
    case HexDigit of
    '0'..'9','a'..'f','A'..'F':
      CharToBin_XMM(HexDigit, Ptr);
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
    Inc(Ptr, 4);
  end;
end;
{$ENDIF}

//***************
// Benchmark code

function RandomHexDigit: Char;
var
  Ordinal: Integer;
begin
  Ordinal := Random(16);
  case Ordinal of
  0..9:
    Result := Chr(Ord('0') + Ordinal);
  10..15:
    Result := Chr(Ord('a') + Ordinal - 10);
  else
    raise Exception.Create('');
  end;
end;

function RandomHexString(Length: Integer): string;
var
  Index: Integer;
begin
  SetLength(Result, Length);
  for Index := 1 to Length do
    Result[Index] := RandomHexDigit;
end;

procedure TestCorrectness;
var
  Index: Integer;
  HexStr, BinStr: string;
begin
  for Index := 0 to $fffff do
  begin
    HexStr := IntToHex(Index, 6);
    BinStr := HexToBinHeff(HexStr);

    if BinStr<>HexToBinMahdi(HexStr) then
      raise Exception.Create('incorrect implementation');
    if BinStr<>HexToBinAsm32(HexStr) then
      raise Exception.Create('incorrect implementation');
    if BinStr<>HexToBinXmm(HexStr) then
      raise Exception.Create('incorrect implementation');
  end;
end;

procedure Benchmark;
var
  Index, Iteration: Integer;
  sw: TStopwatch;
  HexStrings: TArray<string>;
  binaryString: string;
begin
  SetLength(HexStrings, InputDataCount);
  for Index := 0 to InputDataCount-1 do
    HexStrings[Index] := RandomHexString(HexStringLength);

  sw := TStopwatch.StartNew;
  for Iteration := 0 to IterationCount-1 do
    for Index := 0 to InputDataCount-1 do
    begin
      binaryString := HexToBinHeff(HexStrings[Index]);
      binaryString := ''; // force a reallocation of the string every iteration of this loop
    end;
  Writeln('Pascal lookup (Heff): ', sw.ElapsedMilliseconds);

  sw := TStopwatch.StartNew;
  for Iteration := 0 to IterationCount-1 do
    for Index := 0 to InputDataCount-1 do
    begin
      binaryString := HexToBinMahdi(HexStrings[Index]);
      binaryString := ''; // force a reallocation of the string every iteration of this loop
    end;
  Writeln('Pascal lookup (Mahdi): ', sw.ElapsedMilliseconds);

{$IFDEF CPUX86}
  sw := TStopwatch.StartNew;
  for Iteration := 0 to IterationCount-1 do
    for Index := 0 to InputDataCount-1 do
    begin
      binaryString := HexToBinAsm32(HexStrings[Index]);
      binaryString := ''; // force a reallocation of the string every iteration of this loop
    end;
  Writeln('asm32: ', sw.ElapsedMilliseconds);

  sw := TStopwatch.StartNew;
  for Iteration := 0 to IterationCount-1 do
    for Index := 0 to InputDataCount-1 do
    begin
      binaryString := HexToBinXmm(HexStrings[Index]);
      binaryString := ''; // force a reallocation of the string every iteration of this loop
    end;
  Writeln('xmm: ', sw.ElapsedMilliseconds);
{$ENDIF}
end;

begin
  try
    Randomize;
    TestCorrectness;
    Benchmark;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  Readln;
end.

OUTPUT:

Pascal lookup (Heff): 3717
Pascal lookup (Mahdi): 3443
asm32: 4472
xmm: 4038

Although the output varies from run to run ofc.

I think Stefan is right though. If you want to benefit from SIMD you need to process bigger chunks than a single hex digit at a time.

Edited November 21, 2020 by David Heffernan

November 21, 2020

For XMM i used many registers while in fact only one is needed, if consts can be guaranteed to be 16 byte aligned, and that include the xmm1 which is the zero and can a const, and for that matter the four of these consts are exactly 64 byte means one cache line will be accessed in fixed time in an branchless algorithm, the CPU will do magic in loops with such code, hence we on 32 bit have 8 registers to use in parallel and they will execute at the same time, means 8X speed, then again it is doing work for 2 char, then the speed will be 16X, while on 64bit the registers are double means 32X, and this is only XMM, i am not talking about YMM and ZMM, (SSE3 and SSE4)

November 21, 2020

Forgot to mention that for enough like 4 chars then we can remove the base assembly instructions and replace it with SIMD in Hex to Dec conversion too this will boost the speed ever further.

Lars Fosdal · November 21, 2020

5 hours ago, David Heffernan said:

I think the way I phrased the question should help your intuition! 😉

It's always worth double checking, but a lookup table involves working the answer out before you compile. I would imagine it is hard for runtime code to ever beat that.

Well, corner cases apart - it seems that it did?

Bernard · November 21, 2020

Speed up the HexToBinMahdi function by adding a 3rd table and removing any calculations

function HexToBinMahdi(const HexValue: string): string;
type
TChar4 = array [0 .. 3] of Char;
PChar4 = ^TChar4;
const
Table1: array ['0' .. '9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
Table2: array ['a' .. 'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111');
Table3: array ['A' .. 'F'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111');
var
HexDigit: Char;
P: PChar4;
begin
SetLength(Result, Length(HexValue) * 4);
P := PChar4(Result);
for HexDigit in HexValue do
begin
    case HexDigit of
      '0' .. '9':
        P^ := Table1[HexDigit];
      'a' .. 'f':
        P^ := Table2[HexDigit];
      'A' .. 'F':
        P^ := Table3[HexDigit];
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexDigit, HexValue]);
    end;
    Inc(P);
end;
end;

Pascal lookup (Heff): 5777
Pascal lookup (Mahdi): 5948
Pascal lookup (Mahdi 2): 5414
asm32: 6800
xmm: 6068

Anders Melander · November 21, 2020

54 minutes ago, Stefan Glienke said:

Kinda pointless to limit the power of sse to handling single characters instead of simply processing multiple characters at once

That was my thought too but then again, apart from the challenge of it, it's kinda pointless trying to squeeze every last microsecond out of an ASCII HEX to ASCII BIN conversion. It would be more relevant for HEX to binary conversion.

35 minutes ago, Lars Fosdal said:

Well, corner cases apart - it seems that it did?

Maybe in this case. You can find plenty of examples out there which demonstrates that a LUT isn't always faster. For example: What's the fastest way to convert hex to integer in C++?

David Heffernan · November 21, 2020

49 minutes ago, Lars Fosdal said:

Well, corner cases apart - it seems that it did?

Not in the code produced so far. Small numbers are better here.

Clément · November 21, 2020

Hi,

Just would like to say thank you for this thread!!!!

If I might just add my 0.00002c.....

The "for in" syntax is not optimization friendly ... I ran the benchmark in my very slow machine with the following modification in Mahdi code:

function HexToBinMahdi2(const HexValue: string): string;
type
  TChar4 = array [0 .. 3] of Char;
  PChar4 = ^TChar4;
const
  Table1: array ['0' .. '9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
  Table2: array ['a' .. 'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111');
var
  HexDigit: Char;
  P: PChar4;
begin
  SetLength(Result, Length(HexValue) * 4);
  P := PChar4(Result);
  for var i: integer := low(HexValue) to high( HexValue ) do
  begin
    case HexValue[i] of
      '0' .. '9':
        P^ := Table1[HexValue[i]];
      'a' .. 'f':
        P^ := Table2[HexValue[i]];
      'A' .. 'F':
        P^ := Table2[Chr(Ord(HexValue[i]) xor $20)];
    else
      raise EConvertError.CreateFmt('Invalid hex digit ''%s'' found in ''%s''', [HexValue[i], HexValue]);
    end;
    Inc(P);
  end;
end;

Pascal lookup (Heff): 5391
Pascal lookup (Mahdi): 5072
Pascal lookup (Mahdi2): 4184
asm32: 8067
xmm: 6369

I know the test is about the conversion, but since "for in" is very slow, it might help straight things up a little extra bit

Stefan Glienke · November 21, 2020

7 minutes ago, Clément said:

The "for in" syntax is not optimization friendly

i was just going to comment on that - a for in loop on a string is causing the compiler to do an LStrAsg to a local variable and iterates that one which causes a costly implicit try finally and UstrClr in the epilogue.

Also with all that microbenchmarking - please consider the compiler might place code for various implementations good or bad in terms of their layout within cache lines.

We had that topic already some while ago where one implementation was simply faster because the hot loop fit into one cache line while another one or even rearranging of code caused it to span two cache lines affecting the results negatively.

Edited November 21, 2020 by Stefan Glienke

Sign In

Hex2Binary

Recommended Posts

Lars Fosdal 1877

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Guest

Share this post

Link to post

Guest

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Guest

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Mike Torrettinni 199

Share this post

Link to post

Stefan Glienke 2148

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Guest

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Guest

Share this post

Link to post

Stefan Glienke 2148

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Guest

Share this post

Link to post

Guest

Share this post

Link to post

Lars Fosdal 1877

Share this post

Link to post

Bernard 20

Share this post

Link to post

Anders Melander 2060

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Clément 153

Share this post

Link to post

Stefan Glienke 2148

Share this post

Link to post

Create an account or sign in to comment

Create an account