Hex2Binary

Mahdi Safsafi · November 22, 2020

3 hours ago, Stefan Glienke said:

Especially since one of its selling points is "it compiles to native code" - if that native code is garbage for modern CPUs because its written like in the 90s that's kinda poor.

Agree !

One thing I noticed, people complain too much about compilation time(which could be a principal factor that influences EMB decision about optimization) ! I clearly understood that as a fast compilation time makes us all happy. But this shouldn't come at a cost of emitting a poor code. If someone likes a fastest compiler ... Know for sure that clients also like fastest app! A well generated code as general means things gets done quickly, for server it means less power consumption (saving bill). On mobile, friendly for battery life (happy client), ... etc.

Edited November 22, 2020 by Mahdi Safsafi

Stefan Glienke · November 22, 2020

5 minutes ago, Mahdi Safsafi said:

One thing I noticed, people complain too much about compilation time ! I clearly understood that as a fast compilation time makes us all happy. But this shouldn't come at a cost of emitting a poor code. If someone likes a fastest compiler ... Know for sure that clients also like fastest app! A well generated code as general means things gets done quickly, for server it means less power consumption (saving bill). On mobile, friendly for battery life (happy client), ... etc.

Why can't we have both - a fast compilation generating debug friendly non optimized code and one that churns a little longer and emits those juicy optimizations.

Anyhow the current slowliness in the compiler comes from sloppy code in the compiler and not because it does so many amazing things.

Lars Fosdal · November 23, 2020

I certainly would not mind a Delphi for .NET - but it would need to plug into the rest of the .NET world, and not try to take it on alone.

The last time they tried, MS did not play nice, and the other company that did an Object Pascal for .NET - well, they sort of went rogue.

There is a lot of Azure and .NET in my future, and I believe that I will have to leave Delphi behind for C#.

But - perhaps is Delphi better off as a compiler for native code on x86/x64/ARM64 on various platforms.

November 23, 2020

The thing is we already here and and Delphi IDE does and can utilize LLVM, so why on earth they neutered it, instead of using it right !

Just was looking for ideas then found some C code and decided to check how different compiler handle loop unrolling on their own, stopped checking after seeing the result of CLang 10.0 , the generated assembly made me very angry.

Have a look for yourself https://godbolt.org/z/1cva96

LLVM can do magic like the above, and also can do unthinkable stuff, that should be the focus, if not for full project then just parts where developer deemed crucial, that is more useful than a linker, or shopping for $1000 of free software to bundle.

Mahdi Safsafi · November 24, 2020

@Stefan Glienke @Kas Ob. Got some time and tried to do full simd ... The result is awesoooome !!!
Can we go further ? yep (using ymm but this is not gonna be easy as Delphi doesn't support it yet)

type
  TChar4 = array [0 .. 3] of Char;
  PChar4 = ^TChar4;

const
  { Source Data Format : Imm8[1:0] }
  DF_UNSIGNED_BYTES = 0;
  DF_UNSIGNED_WORDS = 1;
  DF_SIGNED_BYTES = 2;
  DF_SIGNED_WORDS = 3;
  { Aggregation Operation : Imm8[3:2] }
  AGGREGATION_OP_EQUAL_ANY = 0 shl 2;
  AGGREGATION_OP_RANGES = 1 shl 2;
  AGGREGATION_OP_EQUAL_EACH = 2 shl 2;
  AGGREGATION_OP_EQUAL_ORDERED = 3 shl 2;
  { Polarity : Imm8[5:4] }
  POLARITY_POSITIVE = 0 shl 4;
  POLARITY_NEGATIVE = 1 shl 4;
  POLARITY_MASKED_POSITIVE = 2 shl 4;
  POLARITY_MASKED_NEGATIVE = 3 shl 4;
  { Output Selection : Imm8[6] }
  OS_LSI = 0 shl 6;
  OS_MSI = 1 shl 6;
  OS_BIT_MASK = 0 shl 6;
  OS_BYTE_WORD_MASK = 1 shl 6;

const
  [Align(16)]
  AndMaskData: array [0 .. 7] of SmallInt = (31, 31, 31, 31, 31, 31, 31, 31);
  RangeData: array [0 .. 7] of WideChar = '09afAF' + #0000;
  TableData: array [0 .. 25] of TChar4 = ('????', '1010', '1011', '1100', '1101', '1110', '1111', '????', '????', '????', '????', '????', '????', '????',
    '????', '????', '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');

function _HexToBinGhost(P, Q: PChar; Len: Integer): Boolean;
asm
  push        ebx
  push        esi
  push        edi
  mov         ebx, ecx
  jz          @@Empty
  lea         esi, [TableData  ]
  mov         edi, ecx
  and         ebx, 7                          // trailing
  shr         edi, 3                          // number of simd-run
  jz          @@HandleTrailing                // too small to be done using simd

  movdqa      xmm0, [AndMaskData]
  movdqa      xmm1, [RangeData  ]

@@SimdLoop:
  movdqu      xmm3, [eax]
  { check if its a valid hex }
  pcmpistri   xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI
  ja          @@Process
  { either end (null terminated) or invalid char }
  test        Word [eax + ecx *2], -1         // is NT ? => continue
  jnz         @@InvalidChar
@@Process:
  pand        xmm3, xmm0                      // and each element with 31
  pxor        xmm2, xmm2
  { --- first four chars --- }
  movdqa      xmm4, xmm3
  { first two char }
  punpcklbw   xmm4, xmm2                      // unpack low data
  pcmpeqq     xmm2, xmm2                      // generate mask
  // gather two TChar4 = 2*SizeOf(TChar4) = 16 bytes
  db          $c4, $e2, $e9, $90, $2c, $e6    // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2
  movdqu      [edx], xmm5                     // store to result
  { second two char }
  pshufd      xmm4, xmm4, $0E                 // move next two elements to low
  pcmpeqq     xmm2, xmm2
  // gather two TChar4
  db          $c4, $e2, $e9, $90, $2c, $e6    // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2
  movdqu      [edx+16], xmm5

  { --- last four chars --- }
  { first two char }
  pxor        xmm2, xmm2
  punpckhbw   xmm3, xmm2                      // unpack high data
  pcmpeqq     xmm2, xmm2
  db          $c4, $e2, $e9, $90, $2c, $de    // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2
  movdqu      [edx+32], xmm5
  { second two char }
  pshufd      xmm3, xmm3, $0E
  pcmpeqq     xmm2, xmm2
  db          $c4, $e2, $e9, $90, $2c, $de    // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2
  movdqu      [edx+48], xmm5
  add         eax, 16
  add         edx, 16 * 4
  dec         edi
  jnz         @@SimdLoop

  test        ebx, ebx
  jz          @@NoTrailing

@@HandleTrailing:
  mov         ecx, ebx

@@LegacyLoop:
  movzx     edi, word [eax]
  lea       ebx, [edi - 48]
  cmp       ebx, 9
  jbe       @@Copy
  lea       ebx, [edi - 65]
  cmp       ebx, 5
  jbe       @@Copy
  lea       ebx, [edi - 97]
  cmp       ebx, 8
  ja        @@InvalidChar
@@Copy:
  and       edi, 31
  movq      xmm0,  [esi + edi * 8]
  movq      [edx], xmm0
  add       eax, 2
  add       edx, 8
  dec       ecx
  jnz       @@LegacyLoop

@@NoTrailing:
  mov       word[edx], 0                      // NT.
@@Empty:
  mov       eax, True

@@End:
  pop        edi
  pop        esi
  pop        ebx
  ret

@@InvalidChar:
  xor         eax, eax
  jmp         @@End
end;

function HexToBinGhost(const Value: string): string;
begin
  SetLength(Result, Length(Value) * 4);
  if _HexToBinGhost(Pointer(Value), Pointer(Result), Length(Value)) then
    exit;
  raise EConvertError.CreateFmt('Invalid hex digit found in ''%s''', [Value]);
end;

Stefan Glienke · November 24, 2020

4 minutes ago, Mahdi Safsafi said:

The result is awesoooome !!!

Nice, now make it run on multiple cores 😅

Tom F · November 24, 2020

@David Heffernan I am surprised to see you using a pointer. I've always felt that one of Delphi's strengths was the diminished need to use pointers, since pointers introduce the risk of weird, unpredictable errors.

I'm not criticizing, but rather asking why you choose to use pointers here. Is it because you wanted to make it as fast as possible, or do you feel pointers are clearer and sufficiently safe or ?

Edited November 24, 2020 by Tom F

November 24, 2020

@Mahdi Safsafi I think it can be far better with SIMD

1) use such barebone to remove the extra call dependency

function HexToBinBareBone(const Value:string):string;
  procedure ToRaiseError(const Value:string);
  begin
    raise EConvertError.CreateFmt('Invalid hex digit found in ''%s''', [Value]);
  end;
asm    
  
  test eax,eax              //  Empty input string -> Empty result
  jz @Exit

  push edi
  //.. 
  push edx                  //  Value : string in stack , we keep this at last so we can access it with [esp]
  push eax                  //  Value : string in stack , we keep this at last so we can access it with [esp+4]  
  
  mov edx , [eax-4]         // get length
  shl edx , 2     
  mov eax , [esp+4]
  mov ecx , offset system.@UStrSetLength  
  call ecx
  mov edx , [esp+4]         // restore result pointer
  mov edx , [edx]           // edx -> result[1]
  mov eax , [esp]
  
  ..
  mov [edx],$00410042 // testing
  ..
  
  pop edx
  pop eax
  pop edi     
@Exit:  
  ret

@InvalidChar:
  pop edx
  pop eax                    // we need eax value to be the parameter passed to show the error message right
  pop edi
  jmp ToRaiseError 
end;

2) You build it to use lookup table while using SIMD is most valued benefit is to reduce memory access ( aka lookup table), so you are far better by calculating generating the 0 and 1 by masking.
3) This should be removed, why are you trying to check for NT while you have the length in of the string, so may be switching to explicit string compare instead of implicit will help here

  ja          @@Process
  { either end (null terminated) or invalid char }
  test        Word [eax + ecx *2], -1         // is NT ? => continue
  jnz         @@InvalidChar

4) You used AVX512 instruction, my CPU is old and doesn't have either AVX2 or AVX512, couldn't run it, but keep in mind AVX2 and AVX512 does have huge hidden drawback, i think Stefan was pointing to with some may be hidden sarcasm intentionally or not, AVX2 and AVX512 in multi core concurrent usage can literally melt the CPU !, so the CPU will decrease its frequency, such decrease is not momentarily so the impact of over using AVX512 and AVX2 can be huge, to dig this further, start here https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency

5) When i wrote my MMX version i thought how i miss the shuffle instruction that does exist with SSEx, but when i wrote XMM i forgot to use 🙂 , so the idea instead of manipulating a unicode char better to translate the Ansi equivalent (one byte) then generate the binary bit string one byte per char too, after that unpack them using shuffle to memory then interleave the zeros, such interleaving with two unpack, three SIMD instruction that will be executed in 1 cycle.

6) To convert from hex char to decimal byte value, a more simpler and faster algorithm can performed, as to subtract the ord('0') then you will have the mask value for those with 'FF' then just AND them, compare after subtraction for letters again for first char , i think you got the idea .... at last just Or them together and you will have your decimal values.

I am sleepy now and can't open my eyes so i might missed something or wrote something wrong, for that i am sorry.

David Heffernan · November 24, 2020

2 hours ago, Tom F said:

I'm not criticizing, but rather asking why you choose to use pointers here. Is it because you wanted to make it as fast as possible, or do you feel pointers are clearer and sufficiently safe or ?

This code is very easy to verify that it is safe. For more complex code, that may not be the case, and so pointers could be risky.

I was using pointers here for performance reasons. Copy a bunch of characters in one go rather than character by character. Others showed much better ways to do it than my initial attempt.

Mahdi Safsafi · November 25, 2020

@Kas Ob.

Quote

2) You build it to use lookup table while using SIMD is most valued benefit is to reduce memory access ( aka lookup table), so you are far better by calculating generating the 0 and 1 by masking.

Agree ! Just for clarification : what I did is promoting a legacy algorithm to use SIMD (the algorithm used in SIMD is the same used in the legacy version) just to demonstrate the benefic of going full SIMD.

Quote

3) This should be removed, why are you trying to check for NT while you have the length in of the string, so may be switching to explicit string compare instead of implicit will help here

No !
1- The explicit version is slower than the implicit. You can check that yourself.
2- The explicit uses 3 register (eax,edx,ecx) ! while the implicit uses only ecx ! This can be a serious drawback on x86 (we have less registers).

Quote

You used AVX512 instruction, my CPU is old and doesn't have either AVX2 or AVX512

No. I didn't used any AVX512 (my CPU doesn't even support it). All what I'm using is a single instruction vpgatherdq from AVX2. The rest is SSEx.

Quote

but keep in mind AVX2 and AVX512 does have huge hidden drawback, i think Stefan was pointing to with some may be hidden sarcasm intentionally or not, AVX2 and AVX512 in multi core concurrent usage can literally melt the CPU !, so the CPU will decrease its frequency, such decrease is not momentarily so the impact of over using AVX512 and AVX2 can be huge, to dig this further, start here https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency

You gave a link and apparently you didn't read it all 🙂

-1 The instructions that were spotted to cause frequency decrease are those called heavy instructions(AVX512 instructions+++, some 256-bit instructions and those that execute on the FP unit). While scalar and all 128-bit instructions are fine (do not cause frequency decrease) ! Mine doesn't use AVX512, doesn't use any FP instruction, and all are 128 bit.
-2 Keep in mind that CPU can decrease frequency even if you don't use any (SSEx/AVX/AVX512) for different reason (thermal condition, power consumption, ...).

Quote

5) ... $

Nice idea. Working on byte instead of word and shuffling at the end probably will yield better result.

Quote

6) ... $

Nice idea too.

Quote

I am sleepy now and can't open my eyes so i might missed something or wrote something wrong, for that i am sorry.

Totally Ok ... Don't worry 🙂

November 25, 2020

and now, my code can be used or not? NOTE: 100.000.000 it's enought! EOUTMemory! --- Im using 2 Array dynamic :))))

Intel i7 4770K (4 x 8 cpus at stock)
RAM 16GB at 1600MHz
test in DEBUG mode

unit uFormMain;

interface

uses
  Winapi.Windows,
  Winapi.Messages,
  System.SysUtils,
  System.Variants,
  System.Classes,
  Vcl.Graphics,
  Vcl.Controls,
  Vcl.Forms,
  Vcl.Dialogs,
  Vcl.StdCtrls;

type
  TForm1 = class(TForm)
    btn_Creating_Hex_Randoms: TButton;
    btn_Convert_HexToBin_Values: TButton;
    mmMyMemoLog: TMemo;
    edtTotalArraysToHexValues: TEdit;
    edtCreatingHexValuesRandomic: TEdit;
    Label1: TLabel;
    edtConvertingHexValueToBin: TEdit;
    btn_Showing_HexValues_In_Arrays: TButton;
    edtShowingHexValuesAndBinInArrays: TEdit;
    btn_STOP_ME_PLEASE: TButton;
    procedure btn_Creating_Hex_RandomsClick(Sender: TObject);
    procedure btn_Convert_HexToBin_ValuesClick(Sender: TObject);
    procedure FormCreate(Sender: TObject);
    procedure btn_Showing_HexValues_In_ArraysClick(Sender: TObject);
    procedure btn_STOP_ME_PLEASEClick(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

uses
  System.Diagnostics;

const
  lTotalHexGenerated: integer               = 1000000; // 1.000.000 Hex values for tests!!!
  lHexChars: array [0 .. 15] of char        = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F');
  lBinValues: array [0 .. 15] of Ansistring = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001', '1010', '1011', '1100', '1101', '1110', '1111');

var
  lArraysWithMyHexGenerated: array of string;
  lArraysWithMyBinGenerated: array of string;
  //
  lMyStopWatch: TStopwatch;
  lCounter    : integer;

function fncMyHexToBin(const lHexValue: string): string;
var
  lEachHexChar: char;
begin
  Result := '';
  //
  for lEachHexChar in lHexValue do // now, ONLY HexChars in const it should be accepted, as expected! But, Im not verifying here for while!
    //
    // now is not necessary verify the exception... for while!!!
    // try
    Result := Result + lBinValues[Pos(lEachHexChar, lHexChars) - 1];
  // except // case the "char" is not found, we have a "AV"! then.... doesnt matter for us!
  // end;
end;

procedure TForm1.btn_Creating_Hex_RandomsClick(Sender: TObject);
var
  i            : integer;
  z            : integer;
  lNewHexString: string;
begin
  lNewHexString := '';
  //
  if (high(lArraysWithMyHexGenerated) = -1) then
  begin
    ShowMessage('lArraysWithMyHexGenerated = 0 values generated');
    exit;
  end;
  //
  mmMyMemoLog.Clear;
  //
  Randomize;
  //
  lMyStopWatch := TStopwatch.StartNew;
  //
  try
    //
    for i := 0 to Pred(lTotalHexGenerated) do
    begin
      for z           := 1 to 8 do
        lNewHexString := lNewHexString + lHexChars[(Random(high(lHexChars)) + 1)];
      //
      lArraysWithMyHexGenerated[i] := lNewHexString;
      //
      lNewHexString := '';
      //
      // Memo1.Lines.Add(i.ToString + ' - ' + lArraysWithMyHexGenerated[i]);
    end;
    //
  finally
    lMyStopWatch.Stop;
    //
    edtCreatingHexValuesRandomic.Text := Format('[%d] was created in [%d]ms', [lTotalHexGenerated, lMyStopWatch.ElapsedMilliseconds]);
  end;
end;

procedure TForm1.btn_Showing_HexValues_In_ArraysClick(Sender: TObject);
var
  i: integer;
begin
  if (high(lArraysWithMyHexGenerated) = -1) then
  begin
    ShowMessage('lArraysWithMyHexGenerated = 0 values generated');
    exit;
  end;
  //
  mmMyMemoLog.Clear;
  //
  lMyStopWatch := TStopwatch.StartNew;
  //
  try
    //
    for lCounter := 0 to Pred(lTotalHexGenerated) do
    begin
      mmMyMemoLog.Lines.Add(lCounter.ToString + ' - ' + lArraysWithMyHexGenerated[lCounter] + ' - ' + lArraysWithMyBinGenerated[lCounter]);
      //
      Application.ProcessMessages;
      //
      if (lCounter = lTotalHexGenerated) then // forcing a "stop!"
        exit;
    end;
    //
  finally
    lMyStopWatch.Stop;
    //
    edtShowingHexValuesAndBinInArrays.Text := Format('[%d] Showing Hex and Bin on MemoLog in [%d]ms', [lTotalHexGenerated, lMyStopWatch.ElapsedMilliseconds]);
  end;
end;

procedure TForm1.btn_STOP_ME_PLEASEClick(Sender: TObject);
begin
  lCounter := lTotalHexGenerated; // forcing a "stop!"
end;

procedure TForm1.FormCreate(Sender: TObject);
begin
  edtTotalArraysToHexValues.Text := FormatFloat('###,###,###,###,##0', high(lArraysWithMyHexGenerated) + 1);
end;

procedure TForm1.btn_Convert_HexToBin_ValuesClick(Sender: TObject);
var
  i: integer;
begin
  if (high(lArraysWithMyHexGenerated) = -1) then
  begin
    ShowMessage('lArraysWithMyHexGenerated = 0 values generated');
    exit;
  end;
  //
  lMyStopWatch := TStopwatch.StartNew;
  //
  try
    //
    for i := 0 to Pred(lTotalHexGenerated) do
    begin
      lArraysWithMyBinGenerated[i] := fncMyHexToBin(lArraysWithMyHexGenerated[i]);
      //
      // Memo1.Lines.Add(i.ToString + ' - ' + fncMyHexToBin(lArraysWithMyHexGenerated[i]));
    end;
    //
  finally
    lMyStopWatch.Stop;
    //
    edtConvertingHexValueToBin.Text := Format('[%d] HexString was converted in Bin in [%d]ms', [lTotalHexGenerated, lMyStopWatch.ElapsedMilliseconds]);
  end;
end;

initialization

ReportMemoryLeaksOnShutdown := true;
//
SetLength(lArraysWithMyHexGenerated, lTotalHexGenerated); // if low memory = EOutOfMemory exception!
//
SetLength(lArraysWithMyBinGenerated, lTotalHexGenerated); // if low memory = EOutOfMemory exception!

finalization

end.

hug

Edited November 25, 2020 by Guest

November 25, 2020

6 hours ago, Mahdi Safsafi said:

No !
1- The explicit version is slower than the implicit. You can check that yourself.
2- The explicit uses 3 register (eax,edx,ecx) ! while the implicit uses only ecx ! This can be a serious drawback on x86 (we have less registers).

It is slower https://stackoverflow.com/questions/46762813/how-much-faster-are-sse4-2-string-instructions-than-sse2-for-memcmp

But

1) The gain from removing the two conditional branching will totally worth it, because they are forward and one will be taken almost always, in other words that "ja @@Process" is anti pattern, and bad for branch prediction.

2) You can save these register in XMM registers, this is faster than pushing them on stack, or on any memory at all.

Just an idea that worth investigating.

6 hours ago, Mahdi Safsafi said:

You gave a link and apparently you didn't read it all 🙂

-1 The instructions that were spotted to cause frequency decrease are those called heavy instructions(AVX512 instructions+++, some 256-bit instructions and those that execute on the FP unit). While scalar and all 128-bit instructions are fine (do not cause frequency decrease) ! Mine doesn't use AVX512, doesn't use any FP instruction, and all are 128 bit.
-2 Keep in mind that CPU can decrease frequency even if you don't use any (SSEx/AVX/AVX512) for different reason (thermal condition, power consumption, ...).

What i am missing here?

I have SandyBridge and it raise exception on vpgatherdq , and looking here https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq ,it is promoted as AVX512, what CPU do you have ?

I read that but what i missed that it the xmm version of that instruction in the reference page, my bad, sorry for that.

David Heffernan · November 25, 2020

4 hours ago, emailx45 said:

my code can be used or not?

You can extend my benchmark program in earlier posts to see how it performs.

Mahdi Safsafi · November 25, 2020

3 hours ago, Kas Ob. said:

It is slower https://stackoverflow.com/questions/46762813/how-much-faster-are-sse4-2-string-instructions-than-sse2-for-memcmp

But

1) The gain from removing the two conditional branching will totally worth it, because they are forward and one will be taken almost always, in other words that "ja @@Process" is anti pattern, and bad for branch prediction.

2) You can save these register in XMM registers, this is faster than pushing them on stack, or on any memory at all.

Just an idea that worth investigating.

1- I told you its slower ! Using explicit version ... Will just make things crappy.

2- The purpose of xmm is not to store general purpose register ! Doing that and inside a loop is definitely an anti-pattern because you need to save and restore back the register and you need more than one instruction to just save one register which means that you are just going to ruin your cache because xmm instruction usually are long in opcodes (prefixes/vex/evex) !

3- A possibility to negate ja @@Process to jna @@CheckInvalidCharOrNT exists and would be much friendly for the static predictor (without the need to save any register) :

  pcmpistri   xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI
  jna          @@CheckInvalidCharOrNT         // in most time jump is not taken. 
@@Process:
  ...
  
 
@@CheckInvalidCharOrNT:
  { either end (null terminated) or invalid char }
  test        Word [eax + ecx *2], -1         // is NT ? => continue
  jnz         @@InvalidChar
  jmp         @@Process

Quote

I have SandyBridge and it raise exception on vpgatherdq , and looking here https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq ,it is promoted as AVX512

Here is the one that I used AVX2: https://www.felixcloutier.com/x86/vpgatherdq:vpgatherqq

SandyBridge does not support AVX2.

Quote

what CPU do you have ?

Not better than yours ! A Haswell !

Stefan Glienke · November 25, 2020

Keep in mind the slightly different tendencies to branch predict on different CPUs when microbenchmarking cold code.

https://xania.org/201602/bpu-part-one

Mahdi Safsafi · November 25, 2020

2 hours ago, Stefan Glienke said:

Keep in mind the slightly different tendencies to branch predict on different CPUs when microbenchmarking cold code.

https://xania.org/201602/bpu-part-one

Very interesting (specially the results). Although Intel in its guideline recommends :

Quote

Branch Prediction Optimization:

Arrange code to be consistent with the static branch prediction algorithm.

Stefan Glienke · November 25, 2020

24 minutes ago, Mahdi Safsafi said:

Branch Prediction Optimization:

Arrange code to be consistent with the static branch prediction algorithm.

Exactly - that's why I just recently rearranged some of my code from this pattern (which I personally like very much for its readability and less indention):

if not check then
  SomeErrorMessageStuff/raise/exit
Usual stuff;

to:

if check then
begin
  do usual stuff
  optionally exit
end
else
  Error stuff

With some noticable improvements.

Not only will the common part be on the fallthrough but also it avoids unnecessary register preserving when you have an error raising subroutine that will never return which the compiler does not know of.

I wish Delphi would have something like noreturn

Edited November 25, 2020 by Stefan Glienke

Mahdi Safsafi · November 26, 2020

@Stefan Glienke

Quote

(which I personally like very much for its readability and less indention)

Bad habit dies hard 🙂

Quote

With some noticable improvements.

Indeed ! Many may consider it as a premature optimization ... but in fact in real app, it should give a noticeable difference. Mostly you can't judge micro-optimization using a toy/simple benchmark because those do not tell the truth all the time.

Since you're interested in this subject ... I'm going to spin it a little bit and introduce switch/case statement.
Look how Delphi compiler generated an ugly (anti-pattern) code. And see how the same msvc compiler generated a friendly codes !

// Delphi
{$O+}
function foo(I: Integer): Integer;
begin
  case I of
    0: Exit(random(255));
    1: Exit(3);
    2: Exit(4);
    3: Exit(5);
    4: Exit(6);
    5: Exit(7);
  else
    Exit(0);
  end;
end;
// --- asm ---
{
0041C564 55               push ebp
0041C565 8BEC             mov ebp,esp
Test.dpr.14: case I of
0041C567 83F805           cmp eax,$05
0041C56A 774E             jnbe $0041c5ba
0041C56C FF248573C54100   jmp dword ptr [eax*4+$41c573]  // unaligned jmp-table
0041C573 8BC5             mov eax,ebp                    // compiler inserted jmp-table just after the branch !!!
0041C575 41               inc ecx
0041C576 0097C541009E     add [edi-$61ffbe3b],dl
0041C57C C54100           lds eax,[ecx+$00]
0041C57F A5               movsd 
0041C580 C54100           lds eax,[ecx+$00]
0041C583 AC               lodsb 
0041C584 C54100           lds eax,[ecx+$00]
0041C587 B3C5             mov bl,$c5
0041C589 41               inc ecx
0041C58A 00B8FF000000     add [eax+$000000ff],bh
0041C590 E81F82FEFF       call Random
0041C595 5D               pop ebp
0041C596 C3               ret 
Test.dpr.18: Exit(3);
0041C597 B803000000       mov eax,$00000003
0041C59C 5D               pop ebp
0041C59D C3               ret 
Test.dpr.20: Exit(4);
0041C59E B804000000       mov eax,$00000004
0041C5A3 5D               pop ebp
0041C5A4 C3               ret 
Test.dpr.22: Exit(5);
0041C5A5 B805000000       mov eax,$00000005
0041C5AA 5D               pop ebp
0041C5AB C3               ret 
Test.dpr.24: Exit(6);
0041C5AC B806000000       mov eax,$00000006
0041C5B1 5D               pop ebp
0041C5B2 C3               ret 
Test.dpr.26: Exit(7);
0041C5B3 B807000000       mov eax,$00000007
0041C5B8 5D               pop ebp
0041C5B9 C3               ret 
Test.dpr.28: Exit(0);
0041C5BA 33C0             xor eax,eax
Test.dpr.30: end;
0041C5BC 5D               pop ebp
0041C5BD C3               ret 
}

//  godbolt msvc x86
int foo(int i)
{
#define Exit(x) return x
    switch (i) {
    case 0: Exit(rand()); // just to prevent c optimization
    case 1: Exit(3);
    case 2: Exit(4);
    case 3: Exit(5);
    case 4: Exit(6);
    case 5: Exit(7);
    default: Exit(0);
    }
}
// --- asm ---
/*
int foo(int) PROC                             ; foo
        push    ebp
        mov     ebp, esp
        push    ecx
        mov     eax, DWORD PTR _i$[ebp]
        mov     DWORD PTR tv64[ebp], eax
        cmp     DWORD PTR tv64[ebp], 5
        ja      SHORT $LN10@foo
        mov     ecx, DWORD PTR tv64[ebp]
        jmp     DWORD PTR $LN12@foo[ecx*4]  // aligned jmp-table
		// fall through !!!
$LN4@foo:
        call    _rand
        jmp     SHORT $LN1@foo
$LN5@foo:
        mov     eax, 3
        jmp     SHORT $LN1@foo
$LN6@foo:
        mov     eax, 4
        jmp     SHORT $LN1@foo
$LN7@foo:
        mov     eax, 5
        jmp     SHORT $LN1@foo
$LN8@foo:
        mov     eax, 6
        jmp     SHORT $LN1@foo
$LN9@foo:
        mov     eax, 7
        jmp     SHORT $LN1@foo
$LN10@foo:
        xor     eax, eax
$LN1@foo:
        mov     esp, ebp
        pop     ebp
        ret     0
        npad    2     // alignement 
		
// compiler inseted jmp-table at the end and its aligned !: 
$LN12@foo:
        DD      $LN4@foo
        DD      $LN5@foo
        DD      $LN6@foo
        DD      $LN7@foo
        DD      $LN8@foo
        DD      $LN9@foo
int foo(int) ENDP                                 ; foo
*/

Alexander Elagin · November 26, 2020

FreePascal version of the same function in mode O1 (quick optimization):

function foo(I: Integer): Integer;
begin
  case I of
    0: Exit(random(255));
    1: Exit(3);
    2: Exit(4);
    3: Exit(5);
    4: Exit(6);
    5: Exit(7);
  else
    Exit(0);
  end;
end;

FPC_Test.lpr:50                           begin
00000001000018C0 55                       push   %rbp
00000001000018C1 4889e5                   mov    %rsp,%rbp
00000001000018C4 488d6424d0               lea    -0x30(%rsp),%rsp
00000001000018C9 894df8                   mov    %ecx,-0x8(%rbp)
FPC_Test.lpr:51                           case I of
00000001000018CC 89c8                     mov    %ecx,%eax
00000001000018CE 85c9                     test   %ecx,%ecx
00000001000018D0 0f8c6e000000             jl     0x100001944 <FOO+132>
00000001000018D6 85c0                     test   %eax,%eax
00000001000018D8 741e                     je     0x1000018f8 <FOO+56>
00000001000018DA 83e801                   sub    $0x1,%eax
00000001000018DD 7429                     je     0x100001908 <FOO+72>
00000001000018DF 83e801                   sub    $0x1,%eax
00000001000018E2 7430                     je     0x100001914 <FOO+84>
00000001000018E4 83e801                   sub    $0x1,%eax
00000001000018E7 7437                     je     0x100001920 <FOO+96>
00000001000018E9 83e801                   sub    $0x1,%eax
00000001000018EC 743e                     je     0x10000192c <FOO+108>
00000001000018EE 83e801                   sub    $0x1,%eax
00000001000018F1 7445                     je     0x100001938 <FOO+120>
00000001000018F3 eb4f                     jmp    0x100001944 <FOO+132>
00000001000018F5 0f1f00                   nopl   (%rax)
FPC_Test.lpr:52                           0: Exit(random(255));
00000001000018F8 b9ff000000               mov    $0xff,%ecx
00000001000018FD e8bed70000               callq  0x10000f0c0 <SYSTEM_$$_RANDOM$LONGINT$$LONGINT>
0000000100001902 8945f4                   mov    %eax,-0xc(%rbp)
0000000100001905 eb45                     jmp    0x10000194c <FOO+140>
0000000100001907 90                       nop
FPC_Test.lpr:53                           1: Exit(3);
0000000100001908 c745f403000000           movl   $0x3,-0xc(%rbp)
000000010000190F eb3b                     jmp    0x10000194c <FOO+140>
0000000100001911 0f1f00                   nopl   (%rax)
FPC_Test.lpr:54                           2: Exit(4);
0000000100001914 c745f404000000           movl   $0x4,-0xc(%rbp)
000000010000191B eb2f                     jmp    0x10000194c <FOO+140>
000000010000191D 0f1f00                   nopl   (%rax)
FPC_Test.lpr:55                           3: Exit(5);
0000000100001920 c745f405000000           movl   $0x5,-0xc(%rbp)
0000000100001927 eb23                     jmp    0x10000194c <FOO+140>
0000000100001929 0f1f00                   nopl   (%rax)
FPC_Test.lpr:56                           4: Exit(6);
000000010000192C c745f406000000           movl   $0x6,-0xc(%rbp)
0000000100001933 eb17                     jmp    0x10000194c <FOO+140>
0000000100001935 0f1f00                   nopl   (%rax)
FPC_Test.lpr:57                           5: Exit(7);
0000000100001938 c745f407000000           movl   $0x7,-0xc(%rbp)
000000010000193F eb0b                     jmp    0x10000194c <FOO+140>
0000000100001941 0f1f00                   nopl   (%rax)
FPC_Test.lpr:59                           Exit(0);
0000000100001944 c745f400000000           movl   $0x0,-0xc(%rbp)
000000010000194B 90                       nop
FPC_Test.lpr:61                           end;
000000010000194C 8b45f4                   mov    -0xc(%rbp),%eax
000000010000194F 90                       nop
0000000100001950 488d6500                 lea    0x0(%rbp),%rsp
0000000100001954 5d                       pop    %rbp
0000000100001955 c3                       retq

Judge it yourself

David Heffernan · November 26, 2020

13 hours ago, Stefan Glienke said:
Exactly - that's why I just recently rearranged some of my code from this pattern (which I personally like very much for its readability and less indention):
if not check then
  SomeErrorMessageStuff/raise/exit
Usual stuff;
to:
if check then
begin
  do usual stuff
  optionally exit
end
else
  Error stuff
With some noticable improvements.

Not only will the common part be on the fallthrough but also it avoids unnecessary register preserving when you have an error raising subroutine that will never return which the compiler does not know of.

I wish Delphi would have something like noreturn

This is not something we should be happy about though. The compiler should be doing this for us.

David Heffernan · November 26, 2020

On 11/25/2020 at 2:21 AM, emailx45 said:

my code can be used or not?

On 11/25/2020 at 7:19 AM, David Heffernan said:

You can extend my benchmark program in earlier posts to see how it performs.

When you do this, you find that your code is around more than 10 times slower than even the slowest versions in the benchmark, and that's before we've added code to handle errors which is currently missing from your code.

November 26, 2020

8 minutes ago, David Heffernan said:

The compiler should be doing this for us.

Can't agree more, only one problem though,

How to open ticket for the compiler to ask it for better job ? and may be to point few internet resources to read.

David Heffernan · November 26, 2020

1 hour ago, Kas Ob. said:

Can't agree more, only one problem though,

How to open ticket for the compiler to ask it for better job ? and may be to point few internet resources to read.

Submit a report to QualityPortal

November 26, 2020

1 minute ago, David Heffernan said:

Submit a report to QualityPortal

I was trying to be funny, like holding the compiler from his ear and point to some assembly code, and say bad boy !, bad boy!

Mahdi Safsafi · November 26, 2020

2 hours ago, Alexander Elagin said:

FreePascal version of the same function in mode O1 (quick optimization):

In the example you gave FPC didn't used a jump table ! instead it lowered the case statement into if statement. But adding more case, will make FPC use jump-table. I tested that my self on x86 (O4). Here is some comparison for jump-table:

MSVC && FPC:

- Both generate a friendly code for branch predictor and the fall through executes the first case.

- Both generate aligned jump-table.

- MSVC inserted the table at the function end. While FPC inserted the table in the const data region.

- The first case likely is going to be in the same cache line as the jmp instruction.

Delphi:

- Generated an ugly pattern for the branch predictor and the fall through points to random instructions(because the table was immediately inserted after the jmp instruction -decompiled data-). Things can get worse if those random instructions contain data that gets decompiled into branch !!!

- The table is not aligned ! This is just ugly and means that the data (jump location) can be split into two cache line !!!

- The first case is unlikely going to be in the same cache line as the jmp instruction.

Bravo FPC team ! 🙂

Edited November 26, 2020 by Mahdi Safsafi

Sign In

Hex2Binary

Recommended Posts

Mahdi Safsafi 225

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

Lars Fosdal 1866

Share this post

Link to post

Guest

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

Tom F 85

Share this post

Link to post

Guest

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Guest

Share this post

Link to post

Guest

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Alexander Elagin 143

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

Guest

Share this post

Link to post

David Heffernan 2448

Share this post

Link to post

Guest

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

Create an account or sign in to comment

Create an account