Jump to content

Mahdi Safsafi

Members
  • Content Count

    383
  • Joined

  • Last visited

  • Days Won

    10

Posts posted by Mahdi Safsafi


  1. 9 hours ago, Lars Fosdal said:

    We are making the plunge from SVN to git (on GitHub) and would like some input. Note that although there are other forms of git hosting than GitHub, we don't have a choice in the matter, so recommendations of other forms of hosting are off topic.

    GitHub just got a cool update : dark mode && discussions 🙂 


  2. 4 minutes ago, Anders Melander said:

    Easy to say when you have already enjoyed the benefit of that backward compatibility.

    Who didn't ? Hey why can't have both ? An example Python 2 and Python 3 or sitting up a deadline for an old feature, ...

    My point is that if breaking change is good at the long run ... then we should adopt it rather than keeping the ugly one.


  3. 9 minutes ago, Stefan Glienke said:

    - the world moved on -

    Sadly we don't move.

    Quote

    "if it aint break it might be ok" (ok, I am exaggerating here).

    The fear of compatibility issues and breaking change is alone an evil ! An example : the exit routine (Delphi) was implemented as a function-like instead of a true keyword just to make some lazy developer happy ! I know ton of tools/compiler that break compatibility for good(LLVM did this many many time; JavaScript did as well and changed completely some of its core logic, Perl, ...).  Sometimes breaking change is a must have and would be much good for the long run. 


  4. 8 hours ago, Anders Melander said:

    Also it seems they've opted to keep the map file info separate from the application. I guess that makes sense on mobile due to the size. Not so much on desktop.

    There was something called "PE Overlay" that originally was so popular in the stone era where memory was in MB! If someone had to include something(file, picture, ...) with his exe/dll without shipping out the file separately, he used the overlay technique instead of the resource ... because using the resource means that the file gets mapped automatically by the loader to the main memory, while using the Overlay, the file gets mapped to memory under the developer command (say an exception occurred ... now you need to map the map-file from exe which is on disk into the memory and finally produce the log). The idea was that a developer inserts the file at the end of the executable and since PE size doesn't correspond to the actual size, the loader won't map the trailing (file) to memory and file remains on disk. 

    Today the Overlay technique is used widely in the malware industry and in the PE signature/certificate to store some metadata (BTW, Delphi uses Overlay for signature/certificate).

    I'm wondering if this applies also to mobile (I've a strong feeling that it does).


  5. 13 hours ago, Stefan Glienke said:

    And on win32 those try/finally have a significant effect even worse than a heap allocation at times because they completely trash a part of the CPUs branch prediction mechanism - see RSP-27375

    Yes that's right ! I've seen your proposal as well and I've a better proposal that solves your proposal issues(Step Over/replicating the codes) and it's a little bit slightly faster !

    To begin, the issue arise because there was a mismatch between a call and a ret instruction (a ret instruction that doesn't correspond to a call instruction). In your proposal, you introduced a call to fix the issue but that also introduced Step Over issue !

    Here is my proposal if we jumped without using a call instruction then we simply return without using a ret instruction. How ? we do a lazy stack pop (add esp, 4) to remove the return address from the stack then we jump back to the return address (jmp [esp - 4]).

    Program Test;
    
    {$APPTYPE CONSOLE}
    {$R *.res}
    {$O+,W-}
    
    uses
      Diagnostics, Windows;
    
    {$DEFINE PATCH_TRY_FINALLY}
    {$DEFINE REPLACE_RET_WITH_JMP}
    
    procedure Test;
    var
      i: Integer;
    begin
      i := 0;
      try
        Inc(i);
        asm
          nop
          nop
        end;
      finally
        Dec(i);
        Dec(i);
        Dec(i);
    {$IFDEF REPLACE_RET_WITH_JMP}
        {
          payload :
          ---------
          add  esp, 4      // remove return address from the stack
          jmp  [esp - 4]   // jmp back (return address)
        }
        Dec(i);
        Dec(i);
    {$ENDIF}
      end;
      if i = 0 then;
    end;
    
    procedure PatchTryFinally1(address: Pointer);
    const
      jmp: array [0 .. 14] of Byte = ($33, $C0, $5A, $59, $59, $64, $89, $10, $E8, $02, $00, $00, $00, $EB, $00);
    var
      n: NativeUInt;
      target: Pointer;
      offset: Byte;
    begin
      target := PPointer(PByte(address) + 11)^;
      offset := PByte(target) - (PByte(address) + 10) - 5;
    
      WriteProcessMemory(GetCurrentProcess, address, @jmp, SizeOf(jmp), n);
      WriteProcessMemory(GetCurrentProcess, PByte(address) + SizeOf(jmp) - 1, @offset, 1, n);
      FlushInstructionCache(GetCurrentProcess, address, SizeOf(jmp));
    end;
    
    procedure PatchTryFinally2(address: Pointer);
    const
      Data: array [0 .. 6] of Byte = ($83, $C4, $04, $FF, $64, $24, $FC);
    var
      n: NativeUInt;
    begin
      WriteProcessMemory(GetCurrentProcess, address, @Data, SizeOf(Data), n);
    end;
    
    procedure PatchTryFinally(address: Pointer);
    begin
    {$IFDEF REPLACE_RET_WITH_JMP}
      PatchTryFinally2(PByte(@Test) + $32);
    {$ELSE}
      PatchTryFinally1(PByte(@Test) + 26);
    {$ENDIF}
    end;
    
    var
      i: Integer;
      sw: TStopwatch;
    
    begin
    {$IFDEF PATCH_TRY_FINALLY}
      PatchTryFinally(PByte(@Test));
    {$ENDIF}
      sw := TStopwatch.StartNew;
      Sleep(1);
      sw.ElapsedMilliseconds;
    
      sw := TStopwatch.StartNew;
      for i := 1 to 100000000 do
        Test;
    
      Writeln(sw.ElapsedMilliseconds);
      Readln;
    end.

     

    • Like 4

  6. 2 hours ago, Alexander Elagin said:

    FreePascal version of the same function in mode O1 (quick optimization):

    In the example you gave FPC didn't used a jump table ! instead it lowered the case statement into if statement. But adding more case, will make FPC use jump-table. I tested that my self on x86 (O4). Here is some comparison for jump-table:

    MSVC && FPC:

    - Both generate a friendly code for branch predictor and the fall through executes the first case.

    - Both generate aligned jump-table.

    - MSVC inserted the table at the function end. While FPC inserted the table in the const data region.

    - The first case likely is going to be in the same cache line as the jmp instruction. 

    Delphi:

    - Generated an ugly pattern for the branch predictor and the fall through points to random instructions(because the table was immediately inserted after the jmp instruction -decompiled data-). Things can get worse if those random instructions contain data that gets decompiled into branch !!!

    - The table is not aligned ! This is just ugly and means that the data (jump location) can be split into two cache line !!!

    - The first case is unlikely going to be in the same cache line as the jmp instruction. 

    Bravo FPC  team ! 🙂 

    • Like 1

  7. @Stefan Glienke

    Quote

    (which I personally like very much for its readability and less indention)

    Bad habit dies hard 🙂

    Quote

    With some noticable improvements.

    Indeed ! Many may consider it as a premature optimization ... but in fact in real app, it should give a noticeable difference. Mostly you can't judge micro-optimization using a toy/simple benchmark because those do not tell the truth all the time.

    Since you're interested in this subject ... I'm going to spin it a little bit and introduce switch/case statement.
    Look how Delphi compiler generated an ugly (anti-pattern) code. And see how the same msvc compiler generated a friendly codes !
     

    // Delphi
    {$O+}
    function foo(I: Integer): Integer;
    begin
      case I of
        0: Exit(random(255));
        1: Exit(3);
        2: Exit(4);
        3: Exit(5);
        4: Exit(6);
        5: Exit(7);
      else
        Exit(0);
      end;
    end;
    // --- asm ---
    {
    0041C564 55               push ebp
    0041C565 8BEC             mov ebp,esp
    Test.dpr.14: case I of
    0041C567 83F805           cmp eax,$05
    0041C56A 774E             jnbe $0041c5ba
    0041C56C FF248573C54100   jmp dword ptr [eax*4+$41c573]  // unaligned jmp-table
    0041C573 8BC5             mov eax,ebp                    // compiler inserted jmp-table just after the branch !!!
    0041C575 41               inc ecx
    0041C576 0097C541009E     add [edi-$61ffbe3b],dl
    0041C57C C54100           lds eax,[ecx+$00]
    0041C57F A5               movsd 
    0041C580 C54100           lds eax,[ecx+$00]
    0041C583 AC               lodsb 
    0041C584 C54100           lds eax,[ecx+$00]
    0041C587 B3C5             mov bl,$c5
    0041C589 41               inc ecx
    0041C58A 00B8FF000000     add [eax+$000000ff],bh
    0041C590 E81F82FEFF       call Random
    0041C595 5D               pop ebp
    0041C596 C3               ret 
    Test.dpr.18: Exit(3);
    0041C597 B803000000       mov eax,$00000003
    0041C59C 5D               pop ebp
    0041C59D C3               ret 
    Test.dpr.20: Exit(4);
    0041C59E B804000000       mov eax,$00000004
    0041C5A3 5D               pop ebp
    0041C5A4 C3               ret 
    Test.dpr.22: Exit(5);
    0041C5A5 B805000000       mov eax,$00000005
    0041C5AA 5D               pop ebp
    0041C5AB C3               ret 
    Test.dpr.24: Exit(6);
    0041C5AC B806000000       mov eax,$00000006
    0041C5B1 5D               pop ebp
    0041C5B2 C3               ret 
    Test.dpr.26: Exit(7);
    0041C5B3 B807000000       mov eax,$00000007
    0041C5B8 5D               pop ebp
    0041C5B9 C3               ret 
    Test.dpr.28: Exit(0);
    0041C5BA 33C0             xor eax,eax
    Test.dpr.30: end;
    0041C5BC 5D               pop ebp
    0041C5BD C3               ret 
    }
    
    //  godbolt msvc x86
    int foo(int i)
    {
    #define Exit(x) return x
        switch (i) {
        case 0: Exit(rand()); // just to prevent c optimization
        case 1: Exit(3);
        case 2: Exit(4);
        case 3: Exit(5);
        case 4: Exit(6);
        case 5: Exit(7);
        default: Exit(0);
        }
    }
    // --- asm ---
    /*
    int foo(int) PROC                             ; foo
            push    ebp
            mov     ebp, esp
            push    ecx
            mov     eax, DWORD PTR _i$[ebp]
            mov     DWORD PTR tv64[ebp], eax
            cmp     DWORD PTR tv64[ebp], 5
            ja      SHORT $LN10@foo
            mov     ecx, DWORD PTR tv64[ebp]
            jmp     DWORD PTR $LN12@foo[ecx*4]  // aligned jmp-table
    		// fall through !!!
    $LN4@foo:
            call    _rand
            jmp     SHORT $LN1@foo
    $LN5@foo:
            mov     eax, 3
            jmp     SHORT $LN1@foo
    $LN6@foo:
            mov     eax, 4
            jmp     SHORT $LN1@foo
    $LN7@foo:
            mov     eax, 5
            jmp     SHORT $LN1@foo
    $LN8@foo:
            mov     eax, 6
            jmp     SHORT $LN1@foo
    $LN9@foo:
            mov     eax, 7
            jmp     SHORT $LN1@foo
    $LN10@foo:
            xor     eax, eax
    $LN1@foo:
            mov     esp, ebp
            pop     ebp
            ret     0
            npad    2     // alignement 
    		
    // compiler inseted jmp-table at the end and its aligned !: 
    $LN12@foo:
            DD      $LN4@foo
            DD      $LN5@foo
            DD      $LN6@foo
            DD      $LN7@foo
            DD      $LN8@foo
            DD      $LN9@foo
    int foo(int) ENDP                                 ; foo
    */

     


  8. 2 hours ago, Stefan Glienke said:

    Keep in mind the slightly different tendencies to branch predict on different CPUs when microbenchmarking cold code.

    https://xania.org/201602/bpu-part-one

    Very interesting (specially the results). Although Intel in its guideline recommends :

    Quote

    Branch Prediction Optimization:

    Arrange code to be consistent with the static branch prediction algorithm.

     


  9. 3 hours ago, Kas Ob. said:

    It is slower https://stackoverflow.com/questions/46762813/how-much-faster-are-sse4-2-string-instructions-than-sse2-for-memcmp

    But 

    1) The gain  from removing the two conditional branching will totally worth it, because they are forward and one will be taken almost always, in other words that "ja @@Process" is anti pattern, and bad for branch prediction.

    2) You can save these register in XMM registers, this is faster than pushing them on stack, or on any memory at all.

    Just an idea that worth investigating.

     

    1- I told you its slower ! Using explicit version ... Will just make things crappy. 

    2- The purpose of xmm is not to store general purpose register ! Doing that and inside a loop is definitely an anti-pattern because you need to save and restore back the register and you need more than one instruction to just save one register which means that you are just going to ruin your cache because xmm instruction usually are long in opcodes (prefixes/vex/evex) !

    3- A possibility to negate ja @@Process to jna @@CheckInvalidCharOrNT exists and would be much friendly for the static predictor (without the need to save any register) :

      pcmpistri   xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI
      jna          @@CheckInvalidCharOrNT         // in most time jump is not taken. 
    @@Process:
      ...
      
     
    @@CheckInvalidCharOrNT:
      { either end (null terminated) or invalid char }
      test        Word [eax + ecx *2], -1         // is NT ? => continue
      jnz         @@InvalidChar
      jmp         @@Process

     

    Quote

    I have SandyBridge and it raise exception on vpgatherdq , and looking here https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq ,it is promoted as AVX512

    Here is the one that I used AVX2: https://www.felixcloutier.com/x86/vpgatherdq:vpgatherqq

    SandyBridge does not support AVX2.

    Quote

    what CPU do you have ?

    Not better than yours ! A Haswell !

     


  10. @Kas Ob. 

    Quote

    2) You build it to use lookup table while using SIMD is most valued benefit is to reduce memory access ( aka lookup table), so you are far better by calculating generating the 0 and 1 by masking.

    Agree ! Just for clarification : what I did is promoting a legacy algorithm to use SIMD (the algorithm used in SIMD is the same used in the legacy version) just to demonstrate the benefic of going full SIMD.

    Quote

     3) This should be removed, why are you trying to check for NT while you have the length in of the string, so may be switching to explicit string compare instead of implicit will help here

    No !
    1- The explicit version is slower than the implicit. You can check that yourself.
    2- The explicit uses 3 register (eax,edx,ecx) ! while the implicit uses only ecx ! This can be a serious drawback on x86 (we have less registers).

    Quote

    You used AVX512 instruction, my CPU is old and doesn't have either AVX2 or AVX512

    No. I didn't used any AVX512 (my CPU doesn't even support it). All what I'm using is a single instruction vpgatherdq from AVX2. The rest is SSEx.

    Quote

     but keep in mind AVX2 and AVX512 does have huge hidden drawback, i think Stefan was pointing to with some may be hidden sarcasm intentionally or not, AVX2 and AVX512 in multi core concurrent usage can literally melt the CPU !, so the CPU will decrease its frequency, such decrease is not momentarily so the impact of over using AVX512 and AVX2 can be huge, to dig this further, start here  https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency

    You gave a link and apparently you didn't read it all 🙂 

    -1 The instructions that were spotted to cause frequency decrease are those called heavy instructions(AVX512 instructions+++, some 256-bit instructions and those that execute on the FP unit). While scalar and all 128-bit instructions are fine (do not cause frequency decrease) ! Mine doesn't use AVX512, doesn't use any FP instruction, and all are 128 bit.
    -2 Keep in mind that CPU can decrease frequency even if you don't use any (SSEx/AVX/AVX512) for different reason (thermal condition, power consumption, ...).

    Quote

    5) ... $

    Nice idea. Working on byte instead of word and shuffling at the end probably will yield better result.

    Quote

    6) ... $

    Nice idea too.

    Quote

    I am sleepy now and can't open my eyes so i might missed something or wrote something wrong, for that i am sorry.

    Totally Ok ... Don't worry 🙂 


  11. @Stefan Glienke @Kas Ob. Got some time and tried to do full simd ... The result is awesoooome !!!
    Can we go further ? yep (using ymm but this is not gonna be easy as Delphi doesn't support it yet) 

    type
      TChar4 = array [0 .. 3] of Char;
      PChar4 = ^TChar4;
    
    const
      { Source Data Format : Imm8[1:0] }
      DF_UNSIGNED_BYTES = 0;
      DF_UNSIGNED_WORDS = 1;
      DF_SIGNED_BYTES = 2;
      DF_SIGNED_WORDS = 3;
      { Aggregation Operation : Imm8[3:2] }
      AGGREGATION_OP_EQUAL_ANY = 0 shl 2;
      AGGREGATION_OP_RANGES = 1 shl 2;
      AGGREGATION_OP_EQUAL_EACH = 2 shl 2;
      AGGREGATION_OP_EQUAL_ORDERED = 3 shl 2;
      { Polarity : Imm8[5:4] }
      POLARITY_POSITIVE = 0 shl 4;
      POLARITY_NEGATIVE = 1 shl 4;
      POLARITY_MASKED_POSITIVE = 2 shl 4;
      POLARITY_MASKED_NEGATIVE = 3 shl 4;
      { Output Selection : Imm8[6] }
      OS_LSI = 0 shl 6;
      OS_MSI = 1 shl 6;
      OS_BIT_MASK = 0 shl 6;
      OS_BYTE_WORD_MASK = 1 shl 6;
    
    const
      [Align(16)]
      AndMaskData: array [0 .. 7] of SmallInt = (31, 31, 31, 31, 31, 31, 31, 31);
      RangeData: array [0 .. 7] of WideChar = '09afAF' + #0000;
      TableData: array [0 .. 25] of TChar4 = ('????', '1010', '1011', '1100', '1101', '1110', '1111', '????', '????', '????', '????', '????', '????', '????',
        '????', '????', '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
    
    function _HexToBinGhost(P, Q: PChar; Len: Integer): Boolean;
    asm
      push        ebx
      push        esi
      push        edi
      mov         ebx, ecx
      jz          @@Empty
      lea         esi, [TableData  ]
      mov         edi, ecx
      and         ebx, 7                          // trailing
      shr         edi, 3                          // number of simd-run
      jz          @@HandleTrailing                // too small to be done using simd
    
      movdqa      xmm0, [AndMaskData]
      movdqa      xmm1, [RangeData  ]
    
    @@SimdLoop:
      movdqu      xmm3, [eax]
      { check if its a valid hex }
      pcmpistri   xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI
      ja          @@Process
      { either end (null terminated) or invalid char }
      test        Word [eax + ecx *2], -1         // is NT ? => continue
      jnz         @@InvalidChar
    @@Process:
      pand        xmm3, xmm0                      // and each element with 31
      pxor        xmm2, xmm2
      { --- first four chars --- }
      movdqa      xmm4, xmm3
      { first two char }
      punpcklbw   xmm4, xmm2                      // unpack low data
      pcmpeqq     xmm2, xmm2                      // generate mask
      // gather two TChar4 = 2*SizeOf(TChar4) = 16 bytes
      db          $c4, $e2, $e9, $90, $2c, $e6    // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2
      movdqu      [edx], xmm5                     // store to result
      { second two char }
      pshufd      xmm4, xmm4, $0E                 // move next two elements to low
      pcmpeqq     xmm2, xmm2
      // gather two TChar4
      db          $c4, $e2, $e9, $90, $2c, $e6    // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2
      movdqu      [edx+16], xmm5
    
      { --- last four chars --- }
      { first two char }
      pxor        xmm2, xmm2
      punpckhbw   xmm3, xmm2                      // unpack high data
      pcmpeqq     xmm2, xmm2
      db          $c4, $e2, $e9, $90, $2c, $de    // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2
      movdqu      [edx+32], xmm5
      { second two char }
      pshufd      xmm3, xmm3, $0E
      pcmpeqq     xmm2, xmm2
      db          $c4, $e2, $e9, $90, $2c, $de    // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2
      movdqu      [edx+48], xmm5
      add         eax, 16
      add         edx, 16 * 4
      dec         edi
      jnz         @@SimdLoop
    
      test        ebx, ebx
      jz          @@NoTrailing
    
    @@HandleTrailing:
      mov         ecx, ebx
    
    @@LegacyLoop:
      movzx     edi, word [eax]
      lea       ebx, [edi - 48]
      cmp       ebx, 9
      jbe       @@Copy
      lea       ebx, [edi - 65]
      cmp       ebx, 5
      jbe       @@Copy
      lea       ebx, [edi - 97]
      cmp       ebx, 8
      ja        @@InvalidChar
    @@Copy:
      and       edi, 31
      movq      xmm0,  [esi + edi * 8]
      movq      [edx], xmm0
      add       eax, 2
      add       edx, 8
      dec       ecx
      jnz       @@LegacyLoop
    
    @@NoTrailing:
      mov       word[edx], 0                      // NT.
    @@Empty:
      mov       eax, True
    
    @@End:
      pop        edi
      pop        esi
      pop        ebx
      ret
    
    @@InvalidChar:
      xor         eax, eax
      jmp         @@End
    end;
    
    function HexToBinGhost(const Value: string): string;
    begin
      SetLength(Result, Length(Value) * 4);
      if _HexToBinGhost(Pointer(Value), Pointer(Result), Length(Value)) then
        exit;
      raise EConvertError.CreateFmt('Invalid hex digit found in ''%s''', [Value]);
    end;

     

    • Like 1

  12. 3 hours ago, Stefan Glienke said:

    Especially since one of its selling points is "it compiles to native code" - if that native code is garbage for modern CPUs because its written like in the 90s that's kinda poor.

    Agree ! 

    One thing I noticed, people complain too much about compilation time(which could be a principal factor that influences EMB decision about optimization) ! I clearly understood that as a fast compilation time makes us all happy. But this shouldn't come at a cost of emitting a poor code. If someone likes a fastest compiler ... Know for sure that clients also like fastest app! A well generated code as general means things gets done quickly, for server it means less power consumption (saving bill). On mobile, friendly for battery life (happy client), ... etc.

     

    • Like 1

  13. 11 hours ago, Mike Torrettinni said:

    No need to go beyond Delphi's capabilities. This whole topic was great to follow, all this knowledge! 🙂

    Of course ! I'm just going to underlay a serious problem. In the past, no one bothered himself to understand optimization (even great developer/company didn't) because CPUs were evolving so fast and each new generation beats the previous one by a large margin(people were just upgrading their CPUs and seeing performance x2,x3). But today we are reaching a dead point (Moore's Law) and the difference between a new generation and the previous isn't really significant ! So today the effort is jumping from CPUs to compilers/parallel programming. In the past few years, we have seeing the emergence of LLVM as a powerful compiler infrastructure and big whales started to communicate with each other more than ever.

    In a nutshell, tomorrow problem is optimization. The way how many choose to deal with it, is improving compiler. And I really think that Delphi should join this race ASAP. 

    • Like 5

  14. @Stefan Glienke SIMD are very powerful but they come with their issues too(portability, alignment for some instructions,... ). Some compilers have a great stuff to embarrasse those issues. But neither delphi compiler nor the RTL helps (we don't even have AVX) 😥 I really wish if at some point of time Delphi supports SIMD through intrinsics or some vector types.

    • Like 1

  15. Partially SIMDed without trailing ... and it beat all 🙂

    const
      { Source Data Format : Imm8[1:0] }
      DF_UNSIGNED_BYTES = 0;
      DF_UNSIGNED_WORDS = 1;
      DF_SIGNED_BYTES = 2;
      DF_SIGNED_WORDS = 3;
      { Aggregation Operation : Imm8[3:2] }
      AGGREGATION_OP_EQUAL_ANY = 0 shl 2;
      AGGREGATION_OP_RANGES = 1 shl 2;
      AGGREGATION_OP_EQUAL_EACH = 2 shl 2;
      AGGREGATION_OP_EQUAL_ORDERED = 3 shl 2;
      { Polarity : Imm8[5:4] }
      POLARITY_POSITIVE = 0 shl 4;
      POLARITY_NEGATIVE = 1 shl 4;
      POLARITY_MASKED_POSITIVE = 2 shl 4;
      POLARITY_MASKED_NEGATIVE = 3 shl 4;
      { Output Selection : Imm8[6] }
      OS_LSI = 0 shl 6;
      OS_MSI = 1 shl 6;
      OS_BIT_MASK = 0 shl 6;
      OS_BYTE_WORD_MASK = 1 shl 6;
    
    const
      [Align(16)]
      Range: array [0 .. 7] of Char = '09afAF' + #00;
    
    function IsValidHex(P: Pointer): Boolean;
    asm
      movdqa    xmm1, [Range]
      sub       eax, 16
    @@SimdLoop:
      add       eax, 16
      movdqu    xmm2, [eax]
      pcmpistri xmm1, xmm2, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE
      ja        @@SimdLoop
      test      Word [eax + ecx *2], -1
      setz      al
    end;
    
    function HexToBinMahdiOneShot(const HexValue: string): string;
    type
      TChar4 = array [0 .. 3] of Char;
      PChar4 = ^TChar4;
    const
      Table: array [0 .. 25] of TChar4 = ('0000', '1010', '1011', '1100', '1101', '1110', '1111', '0000', '0000', '0000', '0000', '0000', '0000', '0000', '0000',
        '0000', '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
    var
      P: PChar4;
      I, Len: Integer;
    begin
      SetLength(Result, Length(HexValue) * 4);
      P := PChar4(Result);
      if IsValidHex(Pointer(HexValue)) then
      begin
        Len := Length(HexValue);
        for I := 1 to Len do
        begin
          P^ := Table[Ord(HexValue[I]) and 31];
          Inc(P);
        end;
      end
      else
        raise EConvertError.CreateFmt('Invalid hex : %s', [HexValue]);
    end;

     

    • Like 1

  16. 3 hours ago, David Heffernan said:

    My benchmarking suggests that point 1 has no impact on performance, but point 2 does.

    Benchmark does not always tell the full story 🙂 

    Unlike my code, Yours isn't cache friendly 

    const
      BinaryValues: array [0..15] of string = (
        '0000', '0001', '0010', '0011',
        '0100', '0101', '0110', '0111',
        '1000', '1001', '1010', '1011',
        '1100', '1101', '1110', '1111'
      );

    Let's see what happens for x64

    You need 16 * SizeOf(Pointer) to store string reference = 16 * 8 = 128 bytes = 2 Cache Line.

    You need 16 * (4 * SizeOf(Char) + 2 byte(null terminated) + SizeOf(StrRec)) to store data = 16 *(4 * 2 + 2 + 16) = 416 bytes = 7 Cache Line.

    N.B: I supposed that compiler did a great job by placing your data in a continuous region.

    --------------------------------------------------------

    You ended up consuming 416 + 128 = 544 bytes.

    You ended up consuming 9 Cache Line.

    type
      TChar4 = array[0..3] of Char;
      PChar4 = ^TChar4;
    const
      Table1: array['0'..'9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001');
      Table2: array['a'..'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111');

    I only need 16 * (4 * SizeOf(Char)) to store data = 16 * (4 * 2) = 128 bytes = 2 Cache Line.

     

    • Like 3

  17. 18 minutes ago, Attila Kovacs said:

    Too bad there is no multiple inheritance.

    multiple inheritance = multiple trouble 

    Quote

    Is it possible to hook a constructor on startup?

    Yep if you install it before it gets created.

×