Mahdi Safsafi

December 8, 2020

9 hours ago, Lars Fosdal said:

We are making the plunge from SVN to git (on GitHub) and would like some input. Note that although there are other forms of git hosting than GitHub, we don't have a choice in the matter, so recommendations of other forms of hosting are off topic.

GitHub just got a cool update : dark mode && discussions 🙂

December 8, 2020

37 minutes ago, pyscripter said:

The whole Windows code base has been leaked if I remember well. Any employee can steal the source code.

And if it wasn't leaked, someone may reverse it : http://ReactOs.org

December 8, 2020

Tried GitHub desktop and didn't like it. Don't know about other tools but I'm using TortoiseGit for years now and I really like it. The most things I love about is the great shell integration and spell checking when I write comment.

December 1, 2020

There was a thread on SO that summarized pascal funniest syntax but unfortunately I couldn't find it. Instead I found this : http://delphi.org/2014/02/hidden-features-in-the-delphi-object-pascal-language/

December 1, 2020

4 minutes ago, Anders Melander said:

Easy to say when you have already enjoyed the benefit of that backward compatibility.

Who didn't ? Hey why can't have both ? An example : Python 2 and Python 3 or sitting up a deadline for an old feature, ...

My point is that if breaking change is good at the long run ... then we should adopt it rather than keeping the ugly one.

December 1, 2020

9 minutes ago, Stefan Glienke said:

- the world moved on -

Sadly we don't move.

Quote

"if it aint break it might be ok" (ok, I am exaggerating here).

The fear of compatibility issues and breaking change is alone an evil ! An example : the exit routine (Delphi) was implemented as a function-like instead of a true keyword just to make some lazy developer happy ! I know ton of tools/compiler that break compatibility for good(LLVM did this many many time; JavaScript did as well and changed completely some of its core logic, Perl, ...). Sometimes breaking change is a must have and would be much good for the long run.

November 30, 2020

8 hours ago, Anders Melander said:

Also it seems they've opted to keep the map file info separate from the application. I guess that makes sense on mobile due to the size. Not so much on desktop.

There was something called "PE Overlay" that originally was so popular in the stone era where memory was in MB! If someone had to include something(file, picture, ...) with his exe/dll without shipping out the file separately, he used the overlay technique instead of the resource ... because using the resource means that the file gets mapped automatically by the loader to the main memory, while using the Overlay, the file gets mapped to memory under the developer command (say an exception occurred ... now you need to map the map-file from exe which is on disk into the memory and finally produce the log). The idea was that a developer inserts the file at the end of the executable and since PE size doesn't correspond to the actual size, the loader won't map the trailing (file) to memory and file remains on disk.

Today the Overlay technique is used widely in the malware industry and in the PE signature/certificate to store some metadata (BTW, Delphi uses Overlay for signature/certificate).

I'm wondering if this applies also to mobile (I've a strong feeling that it does).

November 29, 2020

1 hour ago, Stefan Glienke said:

@Mahdi Safsafi Add this on the issue please.

Done!

November 29, 2020

13 hours ago, Stefan Glienke said:

And on win32 those try/finally have a significant effect even worse than a heap allocation at times because they completely trash a part of the CPUs branch prediction mechanism - see RSP-27375

Yes that's right ! I've seen your proposal as well and I've a better proposal that solves your proposal issues(Step Over/replicating the codes) and it's a little bit slightly faster !

To begin, the issue arise because there was a mismatch between a call and a ret instruction (a ret instruction that doesn't correspond to a call instruction). In your proposal, you introduced a call to fix the issue but that also introduced Step Over issue !

Here is my proposal : if we jumped without using a call instruction then we simply return without using a ret instruction. How ? we do a lazy stack pop (add esp, 4) to remove the return address from the stack then we jump back to the return address (jmp [esp - 4]).

Program Test;

{$APPTYPE CONSOLE}
{$R *.res}
{$O+,W-}

uses
  Diagnostics, Windows;

{$DEFINE PATCH_TRY_FINALLY}
{$DEFINE REPLACE_RET_WITH_JMP}

procedure Test;
var
  i: Integer;
begin
  i := 0;
  try
    Inc(i);
    asm
      nop
      nop
    end;
  finally
    Dec(i);
    Dec(i);
    Dec(i);
{$IFDEF REPLACE_RET_WITH_JMP}
    {
      payload :
      ---------
      add  esp, 4      // remove return address from the stack
      jmp  [esp - 4]   // jmp back (return address)
    }
    Dec(i);
    Dec(i);
{$ENDIF}
  end;
  if i = 0 then;
end;

procedure PatchTryFinally1(address: Pointer);
const
  jmp: array [0 .. 14] of Byte = ($33, $C0, $5A, $59, $59, $64, $89, $10, $E8, $02, $00, $00, $00, $EB, $00);
var
  n: NativeUInt;
  target: Pointer;
  offset: Byte;
begin
  target := PPointer(PByte(address) + 11)^;
  offset := PByte(target) - (PByte(address) + 10) - 5;

  WriteProcessMemory(GetCurrentProcess, address, @jmp, SizeOf(jmp), n);
  WriteProcessMemory(GetCurrentProcess, PByte(address) + SizeOf(jmp) - 1, @offset, 1, n);
  FlushInstructionCache(GetCurrentProcess, address, SizeOf(jmp));
end;

procedure PatchTryFinally2(address: Pointer);
const
  Data: array [0 .. 6] of Byte = ($83, $C4, $04, $FF, $64, $24, $FC);
var
  n: NativeUInt;
begin
  WriteProcessMemory(GetCurrentProcess, address, @Data, SizeOf(Data), n);
end;

procedure PatchTryFinally(address: Pointer);
begin
{$IFDEF REPLACE_RET_WITH_JMP}
  PatchTryFinally2(PByte(@Test) + $32);
{$ELSE}
  PatchTryFinally1(PByte(@Test) + 26);
{$ENDIF}
end;

var
  i: Integer;
  sw: TStopwatch;

begin
{$IFDEF PATCH_TRY_FINALLY}
  PatchTryFinally(PByte(@Test));
{$ENDIF}
  sw := TStopwatch.StartNew;
  Sleep(1);
  sw.ElapsedMilliseconds;

  sw := TStopwatch.StartNew;
  for i := 1 to 100000000 do
    Test;

  Writeln(sw.ElapsedMilliseconds);
  Readln;
end.

November 26, 2020

Do you use any dlls other than system dll ?

November 26, 2020

9 hours ago, Anders Melander said:

I have no idea about what that is.

https://github.com/RRUZ/vcl-styles-utils/tree/master/Tools/Vcl Styles Equalizer (Tool)

November 26, 2020

@Stefan Glienke You're unbelievable man ! The next time I'm going to demonstrate something ... I'll make sure that Stefan will not find a link between the input and the output (hmm TRNGs)

November 26, 2020

2 hours ago, Alexander Elagin said:

FreePascal version of the same function in mode O1 (quick optimization):

In the example you gave FPC didn't used a jump table ! instead it lowered the case statement into if statement. But adding more case, will make FPC use jump-table. I tested that my self on x86 (O4). Here is some comparison for jump-table:

MSVC && FPC:

- Both generate a friendly code for branch predictor and the fall through executes the first case.

- Both generate aligned jump-table.

- MSVC inserted the table at the function end. While FPC inserted the table in the const data region.

- The first case likely is going to be in the same cache line as the jmp instruction.

Delphi:

- Generated an ugly pattern for the branch predictor and the fall through points to random instructions(because the table was immediately inserted after the jmp instruction -decompiled data-). Things can get worse if those random instructions contain data that gets decompiled into branch !!!

- The table is not aligned ! This is just ugly and means that the data (jump location) can be split into two cache line !!!

- The first case is unlikely going to be in the same cache line as the jmp instruction.

Bravo FPC team ! 🙂

November 26, 2020

@Stefan Glienke

Quote

(which I personally like very much for its readability and less indention)

Bad habit dies hard 🙂

Quote

With some noticable improvements.

Indeed ! Many may consider it as a premature optimization ... but in fact in real app, it should give a noticeable difference. Mostly you can't judge micro-optimization using a toy/simple benchmark because those do not tell the truth all the time.

Since you're interested in this subject ... I'm going to spin it a little bit and introduce switch/case statement.
Look how Delphi compiler generated an ugly (anti-pattern) code. And see how the same msvc compiler generated a friendly codes !

// Delphi
{$O+}
function foo(I: Integer): Integer;
begin
  case I of
    0: Exit(random(255));
    1: Exit(3);
    2: Exit(4);
    3: Exit(5);
    4: Exit(6);
    5: Exit(7);
  else
    Exit(0);
  end;
end;
// --- asm ---
{
0041C564 55               push ebp
0041C565 8BEC             mov ebp,esp
Test.dpr.14: case I of
0041C567 83F805           cmp eax,$05
0041C56A 774E             jnbe $0041c5ba
0041C56C FF248573C54100   jmp dword ptr [eax*4+$41c573]  // unaligned jmp-table
0041C573 8BC5             mov eax,ebp                    // compiler inserted jmp-table just after the branch !!!
0041C575 41               inc ecx
0041C576 0097C541009E     add [edi-$61ffbe3b],dl
0041C57C C54100           lds eax,[ecx+$00]
0041C57F A5               movsd 
0041C580 C54100           lds eax,[ecx+$00]
0041C583 AC               lodsb 
0041C584 C54100           lds eax,[ecx+$00]
0041C587 B3C5             mov bl,$c5
0041C589 41               inc ecx
0041C58A 00B8FF000000     add [eax+$000000ff],bh
0041C590 E81F82FEFF       call Random
0041C595 5D               pop ebp
0041C596 C3               ret 
Test.dpr.18: Exit(3);
0041C597 B803000000       mov eax,$00000003
0041C59C 5D               pop ebp
0041C59D C3               ret 
Test.dpr.20: Exit(4);
0041C59E B804000000       mov eax,$00000004
0041C5A3 5D               pop ebp
0041C5A4 C3               ret 
Test.dpr.22: Exit(5);
0041C5A5 B805000000       mov eax,$00000005
0041C5AA 5D               pop ebp
0041C5AB C3               ret 
Test.dpr.24: Exit(6);
0041C5AC B806000000       mov eax,$00000006
0041C5B1 5D               pop ebp
0041C5B2 C3               ret 
Test.dpr.26: Exit(7);
0041C5B3 B807000000       mov eax,$00000007
0041C5B8 5D               pop ebp
0041C5B9 C3               ret 
Test.dpr.28: Exit(0);
0041C5BA 33C0             xor eax,eax
Test.dpr.30: end;
0041C5BC 5D               pop ebp
0041C5BD C3               ret 
}

//  godbolt msvc x86
int foo(int i)
{
#define Exit(x) return x
    switch (i) {
    case 0: Exit(rand()); // just to prevent c optimization
    case 1: Exit(3);
    case 2: Exit(4);
    case 3: Exit(5);
    case 4: Exit(6);
    case 5: Exit(7);
    default: Exit(0);
    }
}
// --- asm ---
/*
int foo(int) PROC                             ; foo
        push    ebp
        mov     ebp, esp
        push    ecx
        mov     eax, DWORD PTR _i$[ebp]
        mov     DWORD PTR tv64[ebp], eax
        cmp     DWORD PTR tv64[ebp], 5
        ja      SHORT $LN10@foo
        mov     ecx, DWORD PTR tv64[ebp]
        jmp     DWORD PTR $LN12@foo[ecx*4]  // aligned jmp-table
		// fall through !!!
$LN4@foo:
        call    _rand
        jmp     SHORT $LN1@foo
$LN5@foo:
        mov     eax, 3
        jmp     SHORT $LN1@foo
$LN6@foo:
        mov     eax, 4
        jmp     SHORT $LN1@foo
$LN7@foo:
        mov     eax, 5
        jmp     SHORT $LN1@foo
$LN8@foo:
        mov     eax, 6
        jmp     SHORT $LN1@foo
$LN9@foo:
        mov     eax, 7
        jmp     SHORT $LN1@foo
$LN10@foo:
        xor     eax, eax
$LN1@foo:
        mov     esp, ebp
        pop     ebp
        ret     0
        npad    2     // alignement 
		
// compiler inseted jmp-table at the end and its aligned !: 
$LN12@foo:
        DD      $LN4@foo
        DD      $LN5@foo
        DD      $LN6@foo
        DD      $LN7@foo
        DD      $LN8@foo
        DD      $LN9@foo
int foo(int) ENDP                                 ; foo
*/

November 25, 2020

2 hours ago, Stefan Glienke said:

Keep in mind the slightly different tendencies to branch predict on different CPUs when microbenchmarking cold code.

https://xania.org/201602/bpu-part-one

Very interesting (specially the results). Although Intel in its guideline recommends :

Quote

Branch Prediction Optimization:

Arrange code to be consistent with the static branch prediction algorithm.

November 25, 2020

3 hours ago, Kas Ob. said:

It is slower https://stackoverflow.com/questions/46762813/how-much-faster-are-sse4-2-string-instructions-than-sse2-for-memcmp

But

1) The gain from removing the two conditional branching will totally worth it, because they are forward and one will be taken almost always, in other words that "ja @@Process" is anti pattern, and bad for branch prediction.

2) You can save these register in XMM registers, this is faster than pushing them on stack, or on any memory at all.

Just an idea that worth investigating.

1- I told you its slower ! Using explicit version ... Will just make things crappy.

2- The purpose of xmm is not to store general purpose register ! Doing that and inside a loop is definitely an anti-pattern because you need to save and restore back the register and you need more than one instruction to just save one register which means that you are just going to ruin your cache because xmm instruction usually are long in opcodes (prefixes/vex/evex) !

3- A possibility to negate ja @@Process to jna @@CheckInvalidCharOrNT exists and would be much friendly for the static predictor (without the need to save any register) :

  pcmpistri   xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI
  jna          @@CheckInvalidCharOrNT         // in most time jump is not taken. 
@@Process:
  ...
  
 
@@CheckInvalidCharOrNT:
  { either end (null terminated) or invalid char }
  test        Word [eax + ecx *2], -1         // is NT ? => continue
  jnz         @@InvalidChar
  jmp         @@Process

Quote

I have SandyBridge and it raise exception on vpgatherdq , and looking here https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq ,it is promoted as AVX512

Here is the one that I used AVX2: https://www.felixcloutier.com/x86/vpgatherdq:vpgatherqq

SandyBridge does not support AVX2.

Quote

what CPU do you have ?

Not better than yours ! A Haswell !

November 25, 2020

@Kas Ob.

Quote

2) You build it to use lookup table while using SIMD is most valued benefit is to reduce memory access ( aka lookup table), so you are far better by calculating generating the 0 and 1 by masking.

Agree ! Just for clarification : what I did is promoting a legacy algorithm to use SIMD (the algorithm used in SIMD is the same used in the legacy version) just to demonstrate the benefic of going full SIMD.

Quote

3) This should be removed, why are you trying to check for NT while you have the length in of the string, so may be switching to explicit string compare instead of implicit will help here

No !
1- The explicit version is slower than the implicit. You can check that yourself.
2- The explicit uses 3 register (eax,edx,ecx) ! while the implicit uses only ecx ! This can be a serious drawback on x86 (we have less registers).

Quote

You used AVX512 instruction, my CPU is old and doesn't have either AVX2 or AVX512

No. I didn't used any AVX512 (my CPU doesn't even support it). All what I'm using is a single instruction vpgatherdq from AVX2. The rest is SSEx.

Quote

but keep in mind AVX2 and AVX512 does have huge hidden drawback, i think Stefan was pointing to with some may be hidden sarcasm intentionally or not, AVX2 and AVX512 in multi core concurrent usage can literally melt the CPU !, so the CPU will decrease its frequency, such decrease is not momentarily so the impact of over using AVX512 and AVX2 can be huge, to dig this further, start here https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency

You gave a link and apparently you didn't read it all 🙂

-1 The instructions that were spotted to cause frequency decrease are those called heavy instructions(AVX512 instructions+++, some 256-bit instructions and those that execute on the FP unit). While scalar and all 128-bit instructions are fine (do not cause frequency decrease) ! Mine doesn't use AVX512, doesn't use any FP instruction, and all are 128 bit.
-2 Keep in mind that CPU can decrease frequency even if you don't use any (SSEx/AVX/AVX512) for different reason (thermal condition, power consumption, ...).

Quote

5) ... $

Nice idea. Working on byte instead of word and shuffling at the end probably will yield better result.

Quote

6) ... $

Nice idea too.

Quote

I am sleepy now and can't open my eyes so i might missed something or wrote something wrong, for that i am sorry.

Totally Ok ... Don't worry 🙂

November 24, 2020

@Stefan Glienke @Kas Ob. Got some time and tried to do full simd ... The result is awesoooome !!!
Can we go further ? yep (using ymm but this is not gonna be easy as Delphi doesn't support it yet)

type
  TChar4 = array [0 .. 3] of Char;
  PChar4 = ^TChar4;

const
  { Source Data Format : Imm8[1:0] }
  DF_UNSIGNED_BYTES = 0;
  DF_UNSIGNED_WORDS = 1;
  DF_SIGNED_BYTES = 2;
  DF_SIGNED_WORDS = 3;
  { Aggregation Operation : Imm8[3:2] }
  AGGREGATION_OP_EQUAL_ANY = 0 shl 2;
  AGGREGATION_OP_RANGES = 1 shl 2;
  AGGREGATION_OP_EQUAL_EACH = 2 shl 2;
  AGGREGATION_OP_EQUAL_ORDERED = 3 shl 2;
  { Polarity : Imm8[5:4] }
  POLARITY_POSITIVE = 0 shl 4;
  POLARITY_NEGATIVE = 1 shl 4;
  POLARITY_MASKED_POSITIVE = 2 shl 4;
  POLARITY_MASKED_NEGATIVE = 3 shl 4;
  { Output Selection : Imm8[6] }
  OS_LSI = 0 shl 6;
  OS_MSI = 1 shl 6;
  OS_BIT_MASK = 0 shl 6;
  OS_BYTE_WORD_MASK = 1 shl 6;

const
  [Align(16)]
  AndMaskData: array [0 .. 7] of SmallInt = (31, 31, 31, 31, 31, 31, 31, 31);
  RangeData: array [0 .. 7] of WideChar = '09afAF' + #0000;
  TableData: array [0 .. 25] of TChar4 = ('????', '1010', '1011', '1100', '1101', '1110', '1111', '????', '????', '????', '????', '????', '????', '????',
    '????', '????', '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');

function _HexToBinGhost(P, Q: PChar; Len: Integer): Boolean;
asm
  push        ebx
  push        esi
  push        edi
  mov         ebx, ecx
  jz          @@Empty
  lea         esi, [TableData  ]
  mov         edi, ecx
  and         ebx, 7                          // trailing
  shr         edi, 3                          // number of simd-run
  jz          @@HandleTrailing                // too small to be done using simd

  movdqa      xmm0, [AndMaskData]
  movdqa      xmm1, [RangeData  ]

@@SimdLoop:
  movdqu      xmm3, [eax]
  { check if its a valid hex }
  pcmpistri   xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI
  ja          @@Process
  { either end (null terminated) or invalid char }
  test        Word [eax + ecx *2], -1         // is NT ? => continue
  jnz         @@InvalidChar
@@Process:
  pand        xmm3, xmm0                      // and each element with 31
  pxor        xmm2, xmm2
  { --- first four chars --- }
  movdqa      xmm4, xmm3
  { first two char }
  punpcklbw   xmm4, xmm2                      // unpack low data
  pcmpeqq     xmm2, xmm2                      // generate mask
  // gather two TChar4 = 2*SizeOf(TChar4) = 16 bytes
  db          $c4, $e2, $e9, $90, $2c, $e6    // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2
  movdqu      [edx], xmm5                     // store to result
  { second two char }
  pshufd      xmm4, xmm4, $0E                 // move next two elements to low
  pcmpeqq     xmm2, xmm2
  // gather two TChar4
  db          $c4, $e2, $e9, $90, $2c, $e6    // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2
  movdqu      [edx+16], xmm5

  { --- last four chars --- }
  { first two char }
  pxor        xmm2, xmm2
  punpckhbw   xmm3, xmm2                      // unpack high data
  pcmpeqq     xmm2, xmm2
  db          $c4, $e2, $e9, $90, $2c, $de    // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2
  movdqu      [edx+32], xmm5
  { second two char }
  pshufd      xmm3, xmm3, $0E
  pcmpeqq     xmm2, xmm2
  db          $c4, $e2, $e9, $90, $2c, $de    // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2
  movdqu      [edx+48], xmm5
  add         eax, 16
  add         edx, 16 * 4
  dec         edi
  jnz         @@SimdLoop

  test        ebx, ebx
  jz          @@NoTrailing

@@HandleTrailing:
  mov         ecx, ebx

@@LegacyLoop:
  movzx     edi, word [eax]
  lea       ebx, [edi - 48]
  cmp       ebx, 9
  jbe       @@Copy
  lea       ebx, [edi - 65]
  cmp       ebx, 5
  jbe       @@Copy
  lea       ebx, [edi - 97]
  cmp       ebx, 8
  ja        @@InvalidChar
@@Copy:
  and       edi, 31
  movq      xmm0,  [esi + edi * 8]
  movq      [edx], xmm0
  add       eax, 2
  add       edx, 8
  dec       ecx
  jnz       @@LegacyLoop

@@NoTrailing:
  mov       word[edx], 0                      // NT.
@@Empty:
  mov       eax, True

@@End:
  pop        edi
  pop        esi
  pop        ebx
  ret

@@InvalidChar:
  xor         eax, eax
  jmp         @@End
end;

function HexToBinGhost(const Value: string): string;
begin
  SetLength(Result, Length(Value) * 4);
  if _HexToBinGhost(Pointer(Value), Pointer(Result), Length(Value)) then
    exit;
  raise EConvertError.CreateFmt('Invalid hex digit found in ''%s''', [Value]);
end;

November 22, 2020

3 hours ago, Stefan Glienke said:

Especially since one of its selling points is "it compiles to native code" - if that native code is garbage for modern CPUs because its written like in the 90s that's kinda poor.

Agree !

One thing I noticed, people complain too much about compilation time(which could be a principal factor that influences EMB decision about optimization) ! I clearly understood that as a fast compilation time makes us all happy. But this shouldn't come at a cost of emitting a poor code. If someone likes a fastest compiler ... Know for sure that clients also like fastest app! A well generated code as general means things gets done quickly, for server it means less power consumption (saving bill). On mobile, friendly for battery life (happy client), ... etc.

November 22, 2020

11 hours ago, Mike Torrettinni said:

No need to go beyond Delphi's capabilities. This whole topic was great to follow, all this knowledge! 🙂

Of course ! I'm just going to underlay a serious problem. In the past, no one bothered himself to understand optimization (even great developer/company didn't) because CPUs were evolving so fast and each new generation beats the previous one by a large margin(people were just upgrading their CPUs and seeing performance x2,x3). But today we are reaching a dead point (Moore's Law) and the difference between a new generation and the previous isn't really significant ! So today the effort is jumping from CPUs to compilers/parallel programming. In the past few years, we have seeing the emergence of LLVM as a powerful compiler infrastructure and big whales started to communicate with each other more than ever.

In a nutshell, tomorrow problem is optimization. The way how many choose to deal with it, is improving compiler. And I really think that Delphi should join this race ASAP.

November 22, 2020

@Mike Torrettinni it can be further optimized by using full simd for all the operation but unfortunately this requires using some instructions that are not supported by Delphi and also by some cpus.

November 21, 2020

@Stefan Glienke SIMD are very powerful but they come with their issues too(portability, alignment for some instructions,... ). Some compilers have a great stuff to embarrasse those issues. But neither delphi compiler nor the RTL helps (we don't even have AVX) 😥 I really wish if at some point of time Delphi supports SIMD through intrinsics or some vector types.

November 21, 2020

Partially SIMDed without trailing ... and it beat all 🙂

const
  { Source Data Format : Imm8[1:0] }
  DF_UNSIGNED_BYTES = 0;
  DF_UNSIGNED_WORDS = 1;
  DF_SIGNED_BYTES = 2;
  DF_SIGNED_WORDS = 3;
  { Aggregation Operation : Imm8[3:2] }
  AGGREGATION_OP_EQUAL_ANY = 0 shl 2;
  AGGREGATION_OP_RANGES = 1 shl 2;
  AGGREGATION_OP_EQUAL_EACH = 2 shl 2;
  AGGREGATION_OP_EQUAL_ORDERED = 3 shl 2;
  { Polarity : Imm8[5:4] }
  POLARITY_POSITIVE = 0 shl 4;
  POLARITY_NEGATIVE = 1 shl 4;
  POLARITY_MASKED_POSITIVE = 2 shl 4;
  POLARITY_MASKED_NEGATIVE = 3 shl 4;
  { Output Selection : Imm8[6] }
  OS_LSI = 0 shl 6;
  OS_MSI = 1 shl 6;
  OS_BIT_MASK = 0 shl 6;
  OS_BYTE_WORD_MASK = 1 shl 6;

const
  [Align(16)]
  Range: array [0 .. 7] of Char = '09afAF' + #00;

function IsValidHex(P: Pointer): Boolean;
asm
  movdqa    xmm1, [Range]
  sub       eax, 16
@@SimdLoop:
  add       eax, 16
  movdqu    xmm2, [eax]
  pcmpistri xmm1, xmm2, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE
  ja        @@SimdLoop
  test      Word [eax + ecx *2], -1
  setz      al
end;

function HexToBinMahdiOneShot(const HexValue: string): string;
type
  TChar4 = array [0 .. 3] of Char;
  PChar4 = ^TChar4;
const
  Table: array [0 .. 25] of TChar4 = ('0000', '1010', '1011', '1100', '1101', '1110', '1111', '0000', '0000', '0000', '0000', '0000', '0000', '0000', '0000',
    '0000', '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
var
  P: PChar4;
  I, Len: Integer;
begin
  SetLength(Result, Length(HexValue) * 4);
  P := PChar4(Result);
  if IsValidHex(Pointer(HexValue)) then
  begin
    Len := Length(HexValue);
    for I := 1 to Len do
    begin
      P^ := Table[Ord(HexValue[I]) and 31];
      Inc(P);
    end;
  end
  else
    raise EConvertError.CreateFmt('Invalid hex : %s', [HexValue]);
end;

November 21, 2020

3 hours ago, David Heffernan said:

My benchmarking suggests that point 1 has no impact on performance, but point 2 does.

Benchmark does not always tell the full story 🙂

Unlike my code, Yours isn't cache friendly

const
  BinaryValues: array [0..15] of string = (
    '0000', '0001', '0010', '0011',
    '0100', '0101', '0110', '0111',
    '1000', '1001', '1010', '1011',
    '1100', '1101', '1110', '1111'
  );

Let's see what happens for x64 :

You need 16 * SizeOf(Pointer) to store string reference = 16 * 8 = 128 bytes = 2 Cache Line.

You need 16 * (4 * SizeOf(Char) + 2 byte(null terminated) + SizeOf(StrRec)) to store data = 16 *(4 * 2 + 2 + 16) = 416 bytes = 7 Cache Line.

N.B: I supposed that compiler did a great job by placing your data in a continuous region.

--------------------------------------------------------

You ended up consuming 416 + 128 = 544 bytes.

You ended up consuming 9 Cache Line.

type
  TChar4 = array[0..3] of Char;
  PChar4 = ^TChar4;
const
  Table1: array['0'..'9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
  Table2: array['a'..'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111');

I only need 16 * (4 * SizeOf(Char)) to store data = 16 * (4 * 2) = 128 bytes = 2 Cache Line.

November 20, 2020

18 minutes ago, Attila Kovacs said:

Too bad there is no multiple inheritance.

multiple inheritance = multiple trouble

Quote

Is it possible to hook a constructor on startup?

Yep if you install it before it gets created.

Sign In

Mahdi Safsafi

Content Count

Joined

Last visited

Days Won

Content Type

Profiles

Forums

Calendar

Posts posted by Mahdi Safsafi

git and Delphi tooling?

git and Delphi tooling?

git and Delphi tooling?

In Case You Didn't Know

Micro optimization - effect of defined and not used local variables

Micro optimization - effect of defined and not used local variables

Catch details on AV

Micro optimization - effect of defined and not used local variables

Micro optimization - effect of defined and not used local variables

How to optimize exe loading times

Drag an Drop

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Hex2Binary

Default value

Browse

Activity