Jump to content

Recommended Posts

3 hours ago, Stefan Glienke said:

Especially since one of its selling points is "it compiles to native code" - if that native code is garbage for modern CPUs because its written like in the 90s that's kinda poor.

Agree ! 

One thing I noticed, people complain too much about compilation time(which could be a principal factor that influences EMB decision about optimization) ! I clearly understood that as a fast compilation time makes us all happy. But this shouldn't come at a cost of emitting a poor code. If someone likes a fastest compiler ... Know for sure that clients also like fastest app! A well generated code as general means things gets done quickly, for server it means less power consumption (saving bill). On mobile, friendly for battery life (happy client), ... etc.

 

Edited by Mahdi Safsafi
  • Like 2

Share this post


Link to post
5 minutes ago, Mahdi Safsafi said:

One thing I noticed, people complain too much about compilation time ! I clearly understood that as a fast compilation time makes us all happy. But this shouldn't come at a cost of emitting a poor code. If someone likes a fastest compiler ... Know for sure that clients also like fastest app! A well generated code as general means things gets done quickly, for server it means less power consumption (saving bill). On mobile, friendly for battery life (happy client), ... etc.

Why can't we have both - a fast compilation generating debug friendly non optimized code and one that churns a little longer and emits those juicy optimizations.

 

Anyhow the current slowliness in the compiler comes from sloppy code in the compiler and not because it does so many amazing things.

  • Like 4

Share this post


Link to post

I certainly would not mind a Delphi for .NET - but it would need to plug into the rest of the .NET world, and not try to take it on alone.

The last time they tried, MS did not play nice, and the other company that did an Object Pascal for .NET - well, they sort of went rogue.

 

There is a lot of Azure and .NET in my future, and I believe that I will have to leave Delphi behind for C#.  

 

But - perhaps is Delphi better off as a compiler for native code on x86/x64/ARM64 on various platforms.

  • Like 1

Share this post


Link to post

The thing is we already here and and Delphi IDE does and can utilize LLVM, so why on earth they neutered it, instead of using it right !

Just was looking for ideas then found some C code and decided to check how different compiler handle loop unrolling on their own, stopped checking after seeing the result of CLang 10.0 , the generated assembly made me very angry.

Have a look for yourself https://godbolt.org/z/1cva96

 

LLVM can do magic like the above, and also can do unthinkable stuff, that should be the focus, if not for full project then just parts where developer deemed crucial, that is more useful than a linker, or shopping for $1000 of free software to bundle.

  • Like 1

Share this post


Link to post

@Stefan Glienke @Kas Ob. Got some time and tried to do full simd ... The result is awesoooome !!!
Can we go further ? yep (using ymm but this is not gonna be easy as Delphi doesn't support it yet) 

type
  TChar4 = array [0 .. 3] of Char;
  PChar4 = ^TChar4;

const
  { Source Data Format : Imm8[1:0] }
  DF_UNSIGNED_BYTES = 0;
  DF_UNSIGNED_WORDS = 1;
  DF_SIGNED_BYTES = 2;
  DF_SIGNED_WORDS = 3;
  { Aggregation Operation : Imm8[3:2] }
  AGGREGATION_OP_EQUAL_ANY = 0 shl 2;
  AGGREGATION_OP_RANGES = 1 shl 2;
  AGGREGATION_OP_EQUAL_EACH = 2 shl 2;
  AGGREGATION_OP_EQUAL_ORDERED = 3 shl 2;
  { Polarity : Imm8[5:4] }
  POLARITY_POSITIVE = 0 shl 4;
  POLARITY_NEGATIVE = 1 shl 4;
  POLARITY_MASKED_POSITIVE = 2 shl 4;
  POLARITY_MASKED_NEGATIVE = 3 shl 4;
  { Output Selection : Imm8[6] }
  OS_LSI = 0 shl 6;
  OS_MSI = 1 shl 6;
  OS_BIT_MASK = 0 shl 6;
  OS_BYTE_WORD_MASK = 1 shl 6;

const
  [Align(16)]
  AndMaskData: array [0 .. 7] of SmallInt = (31, 31, 31, 31, 31, 31, 31, 31);
  RangeData: array [0 .. 7] of WideChar = '09afAF' + #0000;
  TableData: array [0 .. 25] of TChar4 = ('????', '1010', '1011', '1100', '1101', '1110', '1111', '????', '????', '????', '????', '????', '????', '????',
    '????', '????', '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');

function _HexToBinGhost(P, Q: PChar; Len: Integer): Boolean;
asm
  push        ebx
  push        esi
  push        edi
  mov         ebx, ecx
  jz          @@Empty
  lea         esi, [TableData  ]
  mov         edi, ecx
  and         ebx, 7                          // trailing
  shr         edi, 3                          // number of simd-run
  jz          @@HandleTrailing                // too small to be done using simd

  movdqa      xmm0, [AndMaskData]
  movdqa      xmm1, [RangeData  ]

@@SimdLoop:
  movdqu      xmm3, [eax]
  { check if its a valid hex }
  pcmpistri   xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI
  ja          @@Process
  { either end (null terminated) or invalid char }
  test        Word [eax + ecx *2], -1         // is NT ? => continue
  jnz         @@InvalidChar
@@Process:
  pand        xmm3, xmm0                      // and each element with 31
  pxor        xmm2, xmm2
  { --- first four chars --- }
  movdqa      xmm4, xmm3
  { first two char }
  punpcklbw   xmm4, xmm2                      // unpack low data
  pcmpeqq     xmm2, xmm2                      // generate mask
  // gather two TChar4 = 2*SizeOf(TChar4) = 16 bytes
  db          $c4, $e2, $e9, $90, $2c, $e6    // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2
  movdqu      [edx], xmm5                     // store to result
  { second two char }
  pshufd      xmm4, xmm4, $0E                 // move next two elements to low
  pcmpeqq     xmm2, xmm2
  // gather two TChar4
  db          $c4, $e2, $e9, $90, $2c, $e6    // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2
  movdqu      [edx+16], xmm5

  { --- last four chars --- }
  { first two char }
  pxor        xmm2, xmm2
  punpckhbw   xmm3, xmm2                      // unpack high data
  pcmpeqq     xmm2, xmm2
  db          $c4, $e2, $e9, $90, $2c, $de    // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2
  movdqu      [edx+32], xmm5
  { second two char }
  pshufd      xmm3, xmm3, $0E
  pcmpeqq     xmm2, xmm2
  db          $c4, $e2, $e9, $90, $2c, $de    // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2
  movdqu      [edx+48], xmm5
  add         eax, 16
  add         edx, 16 * 4
  dec         edi
  jnz         @@SimdLoop

  test        ebx, ebx
  jz          @@NoTrailing

@@HandleTrailing:
  mov         ecx, ebx

@@LegacyLoop:
  movzx     edi, word [eax]
  lea       ebx, [edi - 48]
  cmp       ebx, 9
  jbe       @@Copy
  lea       ebx, [edi - 65]
  cmp       ebx, 5
  jbe       @@Copy
  lea       ebx, [edi - 97]
  cmp       ebx, 8
  ja        @@InvalidChar
@@Copy:
  and       edi, 31
  movq      xmm0,  [esi + edi * 8]
  movq      [edx], xmm0
  add       eax, 2
  add       edx, 8
  dec       ecx
  jnz       @@LegacyLoop

@@NoTrailing:
  mov       word[edx], 0                      // NT.
@@Empty:
  mov       eax, True

@@End:
  pop        edi
  pop        esi
  pop        ebx
  ret

@@InvalidChar:
  xor         eax, eax
  jmp         @@End
end;

function HexToBinGhost(const Value: string): string;
begin
  SetLength(Result, Length(Value) * 4);
  if _HexToBinGhost(Pointer(Value), Pointer(Result), Length(Value)) then
    exit;
  raise EConvertError.CreateFmt('Invalid hex digit found in ''%s''', [Value]);
end;

 

  • Like 1

Share this post


Link to post
4 minutes ago, Mahdi Safsafi said:

The result is awesoooome !!!

Nice, now make it run on multiple cores 😅

  • Haha 3

Share this post


Link to post

@David Heffernan I am surprised to see you using a pointer.  I've always felt that one of Delphi's strengths was the diminished need to use pointers, since pointers introduce the risk of weird, unpredictable errors.

I'm not criticizing, but rather asking why you choose to use pointers here. Is it because you wanted to make it as fast as possible, or do you feel pointers are clearer and sufficiently safe or ?

 

Edited by Tom F

Share this post


Link to post

@Mahdi Safsafi I think it can be far better with SIMD

1) use such barebone to remove the extra call dependency
 

function HexToBinBareBone(const Value:string):string;
  procedure ToRaiseError(const Value:string);
  begin
    raise EConvertError.CreateFmt('Invalid hex digit found in ''%s''', [Value]);
  end;
asm    
  
  test eax,eax              //  Empty input string -> Empty result
  jz @Exit

  push edi
  //.. 
  push edx                  //  Value : string in stack , we keep this at last so we can access it with [esp]
  push eax                  //  Value : string in stack , we keep this at last so we can access it with [esp+4]  
  
  mov edx , [eax-4]         // get length
  shl edx , 2     
  mov eax , [esp+4]
  mov ecx , offset system.@UStrSetLength  
  call ecx
  mov edx , [esp+4]         // restore result pointer
  mov edx , [edx]           // edx -> result[1]
  mov eax , [esp]
  
  ..
  mov [edx],$00410042 // testing
  ..
  
  pop edx
  pop eax
  pop edi     
@Exit:  
  ret

@InvalidChar:
  pop edx
  pop eax                    // we need eax value to be the parameter passed to show the error message right
  pop edi
  jmp ToRaiseError 
end;

2) You build it to use lookup table while using SIMD is most valued benefit is to reduce memory access ( aka lookup table), so you are far better by calculating generating the 0 and 1 by masking.
3) This should be removed, why are you trying to check for NT while you have the length in of the string, so may be switching to explicit string compare instead of implicit will help here

  ja          @@Process
  { either end (null terminated) or invalid char }
  test        Word [eax + ecx *2], -1         // is NT ? => continue
  jnz         @@InvalidChar

4) You used AVX512 instruction, my CPU is old and doesn't have either AVX2 or AVX512, couldn't run it, but keep in mind AVX2 and AVX512 does have huge hidden drawback, i think Stefan was pointing to with some may be hidden sarcasm intentionally or not, AVX2 and AVX512 in multi core concurrent usage can literally melt the CPU !, so the CPU will decrease its frequency, such decrease is not momentarily so the impact of over using AVX512 and AVX2 can be huge, to dig this further, start here  https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency

5) When i wrote my MMX version i thought how i miss the shuffle instruction that does exist with SSEx, but when i wrote XMM i forgot to use 🙂 , so the idea instead of manipulating a unicode char better to translate the Ansi equivalent (one byte) then generate the binary bit string one byte per char too, after that unpack them using shuffle to memory then interleave the zeros, such interleaving with two unpack, three SIMD instruction that will be executed in 1 cycle.

 

6) To convert from hex char to decimal byte value, a more simpler and faster algorithm can performed, as to subtract the ord('0') then you will have the mask value for those with 'FF' then just AND them, compare after subtraction for letters again for first char , i think you got the idea .... at last just Or them together and you will have your decimal values. 

I am sleepy now and can't open my eyes so i might missed something or wrote something wrong, for that i am sorry.

Share this post


Link to post
2 hours ago, Tom F said:

I'm not criticizing, but rather asking why you choose to use pointers here. Is it because you wanted to make it as fast as possible, or do you feel pointers are clearer and sufficiently safe or ?

This code is very easy to verify that it is safe. For more complex code, that may not be the case, and so pointers could be risky. 

 

I was using pointers here for performance reasons. Copy a bunch of characters in one go rather than character by character. Others showed much better ways to do it than my initial attempt. 

Share this post


Link to post

@Kas Ob. 

Quote

2) You build it to use lookup table while using SIMD is most valued benefit is to reduce memory access ( aka lookup table), so you are far better by calculating generating the 0 and 1 by masking.

Agree ! Just for clarification : what I did is promoting a legacy algorithm to use SIMD (the algorithm used in SIMD is the same used in the legacy version) just to demonstrate the benefic of going full SIMD.

Quote

 3) This should be removed, why are you trying to check for NT while you have the length in of the string, so may be switching to explicit string compare instead of implicit will help here

No !
1- The explicit version is slower than the implicit. You can check that yourself.
2- The explicit uses 3 register (eax,edx,ecx) ! while the implicit uses only ecx ! This can be a serious drawback on x86 (we have less registers).

Quote

You used AVX512 instruction, my CPU is old and doesn't have either AVX2 or AVX512

No. I didn't used any AVX512 (my CPU doesn't even support it). All what I'm using is a single instruction vpgatherdq from AVX2. The rest is SSEx.

Quote

 but keep in mind AVX2 and AVX512 does have huge hidden drawback, i think Stefan was pointing to with some may be hidden sarcasm intentionally or not, AVX2 and AVX512 in multi core concurrent usage can literally melt the CPU !, so the CPU will decrease its frequency, such decrease is not momentarily so the impact of over using AVX512 and AVX2 can be huge, to dig this further, start here  https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency

You gave a link and apparently you didn't read it all 🙂 

-1 The instructions that were spotted to cause frequency decrease are those called heavy instructions(AVX512 instructions+++, some 256-bit instructions and those that execute on the FP unit). While scalar and all 128-bit instructions are fine (do not cause frequency decrease) ! Mine doesn't use AVX512, doesn't use any FP instruction, and all are 128 bit.
-2 Keep in mind that CPU can decrease frequency even if you don't use any (SSEx/AVX/AVX512) for different reason (thermal condition, power consumption, ...).

Quote

5) ... $

Nice idea. Working on byte instead of word and shuffling at the end probably will yield better result.

Quote

6) ... $

Nice idea too.

Quote

I am sleepy now and can't open my eyes so i might missed something or wrote something wrong, for that i am sorry.

Totally Ok ... Don't worry 🙂 

Share this post


Link to post

and now, my code can be used or not?  NOTE: 100.000.000 it's enought! EOUTMemory!  --- Im using 2 Array dynamic :))))

  • Intel i7 4770K (4 x 8 cpus at stock)
  • RAM 16GB at 1600MHz
  • test in DEBUG mode

 

image.thumb.png.8054dab947f2dc88d823b89d9c12ec68.png      image.thumb.png.7af3eb1f1299daf8968b902d1d764f38.png

 

unit uFormMain;

interface

uses
  Winapi.Windows,
  Winapi.Messages,
  System.SysUtils,
  System.Variants,
  System.Classes,
  Vcl.Graphics,
  Vcl.Controls,
  Vcl.Forms,
  Vcl.Dialogs,
  Vcl.StdCtrls;

type
  TForm1 = class(TForm)
    btn_Creating_Hex_Randoms: TButton;
    btn_Convert_HexToBin_Values: TButton;
    mmMyMemoLog: TMemo;
    edtTotalArraysToHexValues: TEdit;
    edtCreatingHexValuesRandomic: TEdit;
    Label1: TLabel;
    edtConvertingHexValueToBin: TEdit;
    btn_Showing_HexValues_In_Arrays: TButton;
    edtShowingHexValuesAndBinInArrays: TEdit;
    btn_STOP_ME_PLEASE: TButton;
    procedure btn_Creating_Hex_RandomsClick(Sender: TObject);
    procedure btn_Convert_HexToBin_ValuesClick(Sender: TObject);
    procedure FormCreate(Sender: TObject);
    procedure btn_Showing_HexValues_In_ArraysClick(Sender: TObject);
    procedure btn_STOP_ME_PLEASEClick(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

uses
  System.Diagnostics;

const
  lTotalHexGenerated: integer               = 1000000; // 1.000.000 Hex values for tests!!!
  lHexChars: array [0 .. 15] of char        = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F');
  lBinValues: array [0 .. 15] of Ansistring = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001', '1010', '1011', '1100', '1101', '1110', '1111');

var
  lArraysWithMyHexGenerated: array of string;
  lArraysWithMyBinGenerated: array of string;
  //
  lMyStopWatch: TStopwatch;
  lCounter    : integer;

function fncMyHexToBin(const lHexValue: string): string;
var
  lEachHexChar: char;
begin
  Result := '';
  //
  for lEachHexChar in lHexValue do // now, ONLY HexChars in const it should be accepted, as expected! But, Im not verifying here for while!
    //
    // now is not necessary verify the exception... for while!!!
    // try
    Result := Result + lBinValues[Pos(lEachHexChar, lHexChars) - 1];
  // except // case the "char" is not found, we have a "AV"! then.... doesnt matter for us!
  // end;
end;

procedure TForm1.btn_Creating_Hex_RandomsClick(Sender: TObject);
var
  i            : integer;
  z            : integer;
  lNewHexString: string;
begin
  lNewHexString := '';
  //
  if (high(lArraysWithMyHexGenerated) = -1) then
  begin
    ShowMessage('lArraysWithMyHexGenerated = 0 values generated');
    exit;
  end;
  //
  mmMyMemoLog.Clear;
  //
  Randomize;
  //
  lMyStopWatch := TStopwatch.StartNew;
  //
  try
    //
    for i := 0 to Pred(lTotalHexGenerated) do
    begin
      for z           := 1 to 8 do
        lNewHexString := lNewHexString + lHexChars[(Random(high(lHexChars)) + 1)];
      //
      lArraysWithMyHexGenerated[i] := lNewHexString;
      //
      lNewHexString := '';
      //
      // Memo1.Lines.Add(i.ToString + ' - ' + lArraysWithMyHexGenerated[i]);
    end;
    //
  finally
    lMyStopWatch.Stop;
    //
    edtCreatingHexValuesRandomic.Text := Format('[%d] was created in [%d]ms', [lTotalHexGenerated, lMyStopWatch.ElapsedMilliseconds]);
  end;
end;

procedure TForm1.btn_Showing_HexValues_In_ArraysClick(Sender: TObject);
var
  i: integer;
begin
  if (high(lArraysWithMyHexGenerated) = -1) then
  begin
    ShowMessage('lArraysWithMyHexGenerated = 0 values generated');
    exit;
  end;
  //
  mmMyMemoLog.Clear;
  //
  lMyStopWatch := TStopwatch.StartNew;
  //
  try
    //
    for lCounter := 0 to Pred(lTotalHexGenerated) do
    begin
      mmMyMemoLog.Lines.Add(lCounter.ToString + ' - ' + lArraysWithMyHexGenerated[lCounter] + ' - ' + lArraysWithMyBinGenerated[lCounter]);
      //
      Application.ProcessMessages;
      //
      if (lCounter = lTotalHexGenerated) then // forcing a "stop!"
        exit;
    end;
    //
  finally
    lMyStopWatch.Stop;
    //
    edtShowingHexValuesAndBinInArrays.Text := Format('[%d] Showing Hex and Bin on MemoLog in [%d]ms', [lTotalHexGenerated, lMyStopWatch.ElapsedMilliseconds]);
  end;
end;

procedure TForm1.btn_STOP_ME_PLEASEClick(Sender: TObject);
begin
  lCounter := lTotalHexGenerated; // forcing a "stop!"
end;

procedure TForm1.FormCreate(Sender: TObject);
begin
  edtTotalArraysToHexValues.Text := FormatFloat('###,###,###,###,##0', high(lArraysWithMyHexGenerated) + 1);
end;

procedure TForm1.btn_Convert_HexToBin_ValuesClick(Sender: TObject);
var
  i: integer;
begin
  if (high(lArraysWithMyHexGenerated) = -1) then
  begin
    ShowMessage('lArraysWithMyHexGenerated = 0 values generated');
    exit;
  end;
  //
  lMyStopWatch := TStopwatch.StartNew;
  //
  try
    //
    for i := 0 to Pred(lTotalHexGenerated) do
    begin
      lArraysWithMyBinGenerated[i] := fncMyHexToBin(lArraysWithMyHexGenerated[i]);
      //
      // Memo1.Lines.Add(i.ToString + ' - ' + fncMyHexToBin(lArraysWithMyHexGenerated[i]));
    end;
    //
  finally
    lMyStopWatch.Stop;
    //
    edtConvertingHexValueToBin.Text := Format('[%d] HexString was converted in Bin in [%d]ms', [lTotalHexGenerated, lMyStopWatch.ElapsedMilliseconds]);
  end;
end;

initialization

ReportMemoryLeaksOnShutdown := true;
//
SetLength(lArraysWithMyHexGenerated, lTotalHexGenerated); // if low memory = EOutOfMemory exception!
//
SetLength(lArraysWithMyBinGenerated, lTotalHexGenerated); // if low memory = EOutOfMemory exception!

finalization

end.

 

hug

Edited by emailx45

Share this post


Link to post
6 hours ago, Mahdi Safsafi said:

No !
1- The explicit version is slower than the implicit. You can check that yourself.
2- The explicit uses 3 register (eax,edx,ecx) ! while the implicit uses only ecx ! This can be a serious drawback on x86 (we have less registers).

It is slower https://stackoverflow.com/questions/46762813/how-much-faster-are-sse4-2-string-instructions-than-sse2-for-memcmp

But 

1) The gain  from removing the two conditional branching will totally worth it, because they are forward and one will be taken almost always, in other words that "ja @@Process" is anti pattern, and bad for branch prediction.

2) You can save these register in XMM registers, this is faster than pushing them on stack, or on any memory at all.

Just an idea that worth investigating.

 

6 hours ago, Mahdi Safsafi said:

You gave a link and apparently you didn't read it all 🙂 

-1 The instructions that were spotted to cause frequency decrease are those called heavy instructions(AVX512 instructions+++, some 256-bit instructions and those that execute on the FP unit). While scalar and all 128-bit instructions are fine (do not cause frequency decrease) ! Mine doesn't use AVX512, doesn't use any FP instruction, and all are 128 bit.
-2 Keep in mind that CPU can decrease frequency even if you don't use any (SSEx/AVX/AVX512) for different reason (thermal condition, power consumption, ...).

What i am missing here?

I have SandyBridge and it raise exception on vpgatherdq , and looking here https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq ,it is promoted as AVX512, what CPU do you have ?

I read that but what i missed that it the xmm version of that instruction in the reference page, my bad, sorry for that.

 

 

Share this post


Link to post
4 hours ago, emailx45 said:

my code can be used or not?

You can extend my benchmark program in earlier posts to see how it performs. 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×