Jump to content

Mahdi Safsafi

  • Content Count

  • Joined

  • Last visited

  • Days Won


Everything posted by Mahdi Safsafi

  1. Mahdi Safsafi

    git and Delphi tooling?

    GitHub just got a cool update : dark mode && discussions 🙂
  2. Mahdi Safsafi

    git and Delphi tooling?

    And if it wasn't leaked, someone may reverse it : http://ReactOs.org
  3. Mahdi Safsafi

    git and Delphi tooling?

    Tried GitHub desktop and didn't like it. Don't know about other tools but I'm using TortoiseGit for years now and I really like it. The most things I love about is the great shell integration and spell checking when I write comment.
  4. Mahdi Safsafi

    In Case You Didn't Know

    There was a thread on SO that summarized pascal funniest syntax but unfortunately I couldn't find it. Instead I found this : http://delphi.org/2014/02/hidden-features-in-the-delphi-object-pascal-language/
  5. Who didn't ? Hey why can't have both ? An example : Python 2 and Python 3 or sitting up a deadline for an old feature, ... My point is that if breaking change is good at the long run ... then we should adopt it rather than keeping the ugly one.
  6. Sadly we don't move. The fear of compatibility issues and breaking change is alone an evil ! An example : the exit routine (Delphi) was implemented as a function-like instead of a true keyword just to make some lazy developer happy ! I know ton of tools/compiler that break compatibility for good(LLVM did this many many time; JavaScript did as well and changed completely some of its core logic, Perl, ...). Sometimes breaking change is a must have and would be much good for the long run.
  7. Mahdi Safsafi

    Catch details on AV

    There was something called "PE Overlay" that originally was so popular in the stone era where memory was in MB! If someone had to include something(file, picture, ...) with his exe/dll without shipping out the file separately, he used the overlay technique instead of the resource ... because using the resource means that the file gets mapped automatically by the loader to the main memory, while using the Overlay, the file gets mapped to memory under the developer command (say an exception occurred ... now you need to map the map-file from exe which is on disk into the memory and finally produce the log). The idea was that a developer inserts the file at the end of the executable and since PE size doesn't correspond to the actual size, the loader won't map the trailing (file) to memory and file remains on disk. Today the Overlay technique is used widely in the malware industry and in the PE signature/certificate to store some metadata (BTW, Delphi uses Overlay for signature/certificate). I'm wondering if this applies also to mobile (I've a strong feeling that it does).
  8. Yes that's right ! I've seen your proposal as well and I've a better proposal that solves your proposal issues(Step Over/replicating the codes) and it's a little bit slightly faster ! To begin, the issue arise because there was a mismatch between a call and a ret instruction (a ret instruction that doesn't correspond to a call instruction). In your proposal, you introduced a call to fix the issue but that also introduced Step Over issue ! Here is my proposal : if we jumped without using a call instruction then we simply return without using a ret instruction. How ? we do a lazy stack pop (add esp, 4) to remove the return address from the stack then we jump back to the return address (jmp [esp - 4]). Program Test; {$APPTYPE CONSOLE} {$R *.res} {$O+,W-} uses Diagnostics, Windows; {$DEFINE PATCH_TRY_FINALLY} {$DEFINE REPLACE_RET_WITH_JMP} procedure Test; var i: Integer; begin i := 0; try Inc(i); asm nop nop end; finally Dec(i); Dec(i); Dec(i); {$IFDEF REPLACE_RET_WITH_JMP} { payload : --------- add esp, 4 // remove return address from the stack jmp [esp - 4] // jmp back (return address) } Dec(i); Dec(i); {$ENDIF} end; if i = 0 then; end; procedure PatchTryFinally1(address: Pointer); const jmp: array [0 .. 14] of Byte = ($33, $C0, $5A, $59, $59, $64, $89, $10, $E8, $02, $00, $00, $00, $EB, $00); var n: NativeUInt; target: Pointer; offset: Byte; begin target := PPointer(PByte(address) + 11)^; offset := PByte(target) - (PByte(address) + 10) - 5; WriteProcessMemory(GetCurrentProcess, address, @jmp, SizeOf(jmp), n); WriteProcessMemory(GetCurrentProcess, PByte(address) + SizeOf(jmp) - 1, @offset, 1, n); FlushInstructionCache(GetCurrentProcess, address, SizeOf(jmp)); end; procedure PatchTryFinally2(address: Pointer); const Data: array [0 .. 6] of Byte = ($83, $C4, $04, $FF, $64, $24, $FC); var n: NativeUInt; begin WriteProcessMemory(GetCurrentProcess, address, @Data, SizeOf(Data), n); end; procedure PatchTryFinally(address: Pointer); begin {$IFDEF REPLACE_RET_WITH_JMP} PatchTryFinally2(PByte(@Test) + $32); {$ELSE} PatchTryFinally1(PByte(@Test) + 26); {$ENDIF} end; var i: Integer; sw: TStopwatch; begin {$IFDEF PATCH_TRY_FINALLY} PatchTryFinally(PByte(@Test)); {$ENDIF} sw := TStopwatch.StartNew; Sleep(1); sw.ElapsedMilliseconds; sw := TStopwatch.StartNew; for i := 1 to 100000000 do Test; Writeln(sw.ElapsedMilliseconds); Readln; end.
  9. Do you use any dlls other than system dll ?
  10. Mahdi Safsafi

    Drag an Drop

    https://github.com/RRUZ/vcl-styles-utils/tree/master/Tools/Vcl Styles Equalizer (Tool)
  11. Mahdi Safsafi


    @Stefan Glienke You're unbelievable man ! The next time I'm going to demonstrate something ... I'll make sure that Stefan will not find a link between the input and the output (hmm TRNGs)
  12. Mahdi Safsafi


    In the example you gave FPC didn't used a jump table ! instead it lowered the case statement into if statement. But adding more case, will make FPC use jump-table. I tested that my self on x86 (O4). Here is some comparison for jump-table: MSVC && FPC: - Both generate a friendly code for branch predictor and the fall through executes the first case. - Both generate aligned jump-table. - MSVC inserted the table at the function end. While FPC inserted the table in the const data region. - The first case likely is going to be in the same cache line as the jmp instruction. Delphi: - Generated an ugly pattern for the branch predictor and the fall through points to random instructions(because the table was immediately inserted after the jmp instruction -decompiled data-). Things can get worse if those random instructions contain data that gets decompiled into branch !!! - The table is not aligned ! This is just ugly and means that the data (jump location) can be split into two cache line !!! - The first case is unlikely going to be in the same cache line as the jmp instruction. Bravo FPC team ! 🙂
  13. Mahdi Safsafi


    @Stefan Glienke Bad habit dies hard 🙂 Indeed ! Many may consider it as a premature optimization ... but in fact in real app, it should give a noticeable difference. Mostly you can't judge micro-optimization using a toy/simple benchmark because those do not tell the truth all the time. Since you're interested in this subject ... I'm going to spin it a little bit and introduce switch/case statement. Look how Delphi compiler generated an ugly (anti-pattern) code. And see how the same msvc compiler generated a friendly codes ! // Delphi {$O+} function foo(I: Integer): Integer; begin case I of 0: Exit(random(255)); 1: Exit(3); 2: Exit(4); 3: Exit(5); 4: Exit(6); 5: Exit(7); else Exit(0); end; end; // --- asm --- { 0041C564 55 push ebp 0041C565 8BEC mov ebp,esp Test.dpr.14: case I of 0041C567 83F805 cmp eax,$05 0041C56A 774E jnbe $0041c5ba 0041C56C FF248573C54100 jmp dword ptr [eax*4+$41c573] // unaligned jmp-table 0041C573 8BC5 mov eax,ebp // compiler inserted jmp-table just after the branch !!! 0041C575 41 inc ecx 0041C576 0097C541009E add [edi-$61ffbe3b],dl 0041C57C C54100 lds eax,[ecx+$00] 0041C57F A5 movsd 0041C580 C54100 lds eax,[ecx+$00] 0041C583 AC lodsb 0041C584 C54100 lds eax,[ecx+$00] 0041C587 B3C5 mov bl,$c5 0041C589 41 inc ecx 0041C58A 00B8FF000000 add [eax+$000000ff],bh 0041C590 E81F82FEFF call Random 0041C595 5D pop ebp 0041C596 C3 ret Test.dpr.18: Exit(3); 0041C597 B803000000 mov eax,$00000003 0041C59C 5D pop ebp 0041C59D C3 ret Test.dpr.20: Exit(4); 0041C59E B804000000 mov eax,$00000004 0041C5A3 5D pop ebp 0041C5A4 C3 ret Test.dpr.22: Exit(5); 0041C5A5 B805000000 mov eax,$00000005 0041C5AA 5D pop ebp 0041C5AB C3 ret Test.dpr.24: Exit(6); 0041C5AC B806000000 mov eax,$00000006 0041C5B1 5D pop ebp 0041C5B2 C3 ret Test.dpr.26: Exit(7); 0041C5B3 B807000000 mov eax,$00000007 0041C5B8 5D pop ebp 0041C5B9 C3 ret Test.dpr.28: Exit(0); 0041C5BA 33C0 xor eax,eax Test.dpr.30: end; 0041C5BC 5D pop ebp 0041C5BD C3 ret } // godbolt msvc x86 int foo(int i) { #define Exit(x) return x switch (i) { case 0: Exit(rand()); // just to prevent c optimization case 1: Exit(3); case 2: Exit(4); case 3: Exit(5); case 4: Exit(6); case 5: Exit(7); default: Exit(0); } } // --- asm --- /* int foo(int) PROC ; foo push ebp mov ebp, esp push ecx mov eax, DWORD PTR _i$[ebp] mov DWORD PTR tv64[ebp], eax cmp DWORD PTR tv64[ebp], 5 ja SHORT $LN10@foo mov ecx, DWORD PTR tv64[ebp] jmp DWORD PTR $LN12@foo[ecx*4] // aligned jmp-table // fall through !!! $LN4@foo: call _rand jmp SHORT $LN1@foo $LN5@foo: mov eax, 3 jmp SHORT $LN1@foo $LN6@foo: mov eax, 4 jmp SHORT $LN1@foo $LN7@foo: mov eax, 5 jmp SHORT $LN1@foo $LN8@foo: mov eax, 6 jmp SHORT $LN1@foo $LN9@foo: mov eax, 7 jmp SHORT $LN1@foo $LN10@foo: xor eax, eax $LN1@foo: mov esp, ebp pop ebp ret 0 npad 2 // alignement // compiler inseted jmp-table at the end and its aligned !: $LN12@foo: DD $LN4@foo DD $LN5@foo DD $LN6@foo DD $LN7@foo DD $LN8@foo DD $LN9@foo int foo(int) ENDP ; foo */
  14. Mahdi Safsafi


    Very interesting (specially the results). Although Intel in its guideline recommends :
  15. Mahdi Safsafi


    1- I told you its slower ! Using explicit version ... Will just make things crappy. 2- The purpose of xmm is not to store general purpose register ! Doing that and inside a loop is definitely an anti-pattern because you need to save and restore back the register and you need more than one instruction to just save one register which means that you are just going to ruin your cache because xmm instruction usually are long in opcodes (prefixes/vex/evex) ! 3- A possibility to negate ja @@Process to jna @@CheckInvalidCharOrNT exists and would be much friendly for the static predictor (without the need to save any register) : pcmpistri xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI jna @@CheckInvalidCharOrNT // in most time jump is not taken. @@Process: ... @@CheckInvalidCharOrNT: { either end (null terminated) or invalid char } test Word [eax + ecx *2], -1 // is NT ? => continue jnz @@InvalidChar jmp @@Process Here is the one that I used AVX2: https://www.felixcloutier.com/x86/vpgatherdq:vpgatherqq SandyBridge does not support AVX2. Not better than yours ! A Haswell !
  16. Mahdi Safsafi


    @Kas Ob. Agree ! Just for clarification : what I did is promoting a legacy algorithm to use SIMD (the algorithm used in SIMD is the same used in the legacy version) just to demonstrate the benefic of going full SIMD. No ! 1- The explicit version is slower than the implicit. You can check that yourself. 2- The explicit uses 3 register (eax,edx,ecx) ! while the implicit uses only ecx ! This can be a serious drawback on x86 (we have less registers). No. I didn't used any AVX512 (my CPU doesn't even support it). All what I'm using is a single instruction vpgatherdq from AVX2. The rest is SSEx. You gave a link and apparently you didn't read it all 🙂 -1 The instructions that were spotted to cause frequency decrease are those called heavy instructions(AVX512 instructions+++, some 256-bit instructions and those that execute on the FP unit). While scalar and all 128-bit instructions are fine (do not cause frequency decrease) ! Mine doesn't use AVX512, doesn't use any FP instruction, and all are 128 bit. -2 Keep in mind that CPU can decrease frequency even if you don't use any (SSEx/AVX/AVX512) for different reason (thermal condition, power consumption, ...). Nice idea. Working on byte instead of word and shuffling at the end probably will yield better result. Nice idea too. Totally Ok ... Don't worry 🙂
  17. Mahdi Safsafi


    @Stefan Glienke @Kas Ob. Got some time and tried to do full simd ... The result is awesoooome !!! Can we go further ? yep (using ymm but this is not gonna be easy as Delphi doesn't support it yet) type TChar4 = array [0 .. 3] of Char; PChar4 = ^TChar4; const { Source Data Format : Imm8[1:0] } DF_UNSIGNED_BYTES = 0; DF_UNSIGNED_WORDS = 1; DF_SIGNED_BYTES = 2; DF_SIGNED_WORDS = 3; { Aggregation Operation : Imm8[3:2] } AGGREGATION_OP_EQUAL_ANY = 0 shl 2; AGGREGATION_OP_RANGES = 1 shl 2; AGGREGATION_OP_EQUAL_EACH = 2 shl 2; AGGREGATION_OP_EQUAL_ORDERED = 3 shl 2; { Polarity : Imm8[5:4] } POLARITY_POSITIVE = 0 shl 4; POLARITY_NEGATIVE = 1 shl 4; POLARITY_MASKED_POSITIVE = 2 shl 4; POLARITY_MASKED_NEGATIVE = 3 shl 4; { Output Selection : Imm8[6] } OS_LSI = 0 shl 6; OS_MSI = 1 shl 6; OS_BIT_MASK = 0 shl 6; OS_BYTE_WORD_MASK = 1 shl 6; const [Align(16)] AndMaskData: array [0 .. 7] of SmallInt = (31, 31, 31, 31, 31, 31, 31, 31); RangeData: array [0 .. 7] of WideChar = '09afAF' + #0000; TableData: array [0 .. 25] of TChar4 = ('????', '1010', '1011', '1100', '1101', '1110', '1111', '????', '????', '????', '????', '????', '????', '????', '????', '????', '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); function _HexToBinGhost(P, Q: PChar; Len: Integer): Boolean; asm push ebx push esi push edi mov ebx, ecx jz @@Empty lea esi, [TableData ] mov edi, ecx and ebx, 7 // trailing shr edi, 3 // number of simd-run jz @@HandleTrailing // too small to be done using simd movdqa xmm0, [AndMaskData] movdqa xmm1, [RangeData ] @@SimdLoop: movdqu xmm3, [eax] { check if its a valid hex } pcmpistri xmm1, xmm3, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE or OS_LSI ja @@Process { either end (null terminated) or invalid char } test Word [eax + ecx *2], -1 // is NT ? => continue jnz @@InvalidChar @@Process: pand xmm3, xmm0 // and each element with 31 pxor xmm2, xmm2 { --- first four chars --- } movdqa xmm4, xmm3 { first two char } punpcklbw xmm4, xmm2 // unpack low data pcmpeqq xmm2, xmm2 // generate mask // gather two TChar4 = 2*SizeOf(TChar4) = 16 bytes db $c4, $e2, $e9, $90, $2c, $e6 // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2 movdqu [edx], xmm5 // store to result { second two char } pshufd xmm4, xmm4, $0E // move next two elements to low pcmpeqq xmm2, xmm2 // gather two TChar4 db $c4, $e2, $e9, $90, $2c, $e6 // vpgatherdq xmm5,QWORD PTR [esi+xmm4*8],xmm2 movdqu [edx+16], xmm5 { --- last four chars --- } { first two char } pxor xmm2, xmm2 punpckhbw xmm3, xmm2 // unpack high data pcmpeqq xmm2, xmm2 db $c4, $e2, $e9, $90, $2c, $de // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2 movdqu [edx+32], xmm5 { second two char } pshufd xmm3, xmm3, $0E pcmpeqq xmm2, xmm2 db $c4, $e2, $e9, $90, $2c, $de // vpgatherdq xmm5,QWORD PTR [esi+xmm3*8],xmm2 movdqu [edx+48], xmm5 add eax, 16 add edx, 16 * 4 dec edi jnz @@SimdLoop test ebx, ebx jz @@NoTrailing @@HandleTrailing: mov ecx, ebx @@LegacyLoop: movzx edi, word [eax] lea ebx, [edi - 48] cmp ebx, 9 jbe @@Copy lea ebx, [edi - 65] cmp ebx, 5 jbe @@Copy lea ebx, [edi - 97] cmp ebx, 8 ja @@InvalidChar @@Copy: and edi, 31 movq xmm0, [esi + edi * 8] movq [edx], xmm0 add eax, 2 add edx, 8 dec ecx jnz @@LegacyLoop @@NoTrailing: mov word[edx], 0 // NT. @@Empty: mov eax, True @@End: pop edi pop esi pop ebx ret @@InvalidChar: xor eax, eax jmp @@End end; function HexToBinGhost(const Value: string): string; begin SetLength(Result, Length(Value) * 4); if _HexToBinGhost(Pointer(Value), Pointer(Result), Length(Value)) then exit; raise EConvertError.CreateFmt('Invalid hex digit found in ''%s''', [Value]); end;
  18. Mahdi Safsafi


    Agree ! One thing I noticed, people complain too much about compilation time(which could be a principal factor that influences EMB decision about optimization) ! I clearly understood that as a fast compilation time makes us all happy. But this shouldn't come at a cost of emitting a poor code. If someone likes a fastest compiler ... Know for sure that clients also like fastest app! A well generated code as general means things gets done quickly, for server it means less power consumption (saving bill). On mobile, friendly for battery life (happy client), ... etc.
  19. Mahdi Safsafi


    Of course ! I'm just going to underlay a serious problem. In the past, no one bothered himself to understand optimization (even great developer/company didn't) because CPUs were evolving so fast and each new generation beats the previous one by a large margin(people were just upgrading their CPUs and seeing performance x2,x3). But today we are reaching a dead point (Moore's Law) and the difference between a new generation and the previous isn't really significant ! So today the effort is jumping from CPUs to compilers/parallel programming. In the past few years, we have seeing the emergence of LLVM as a powerful compiler infrastructure and big whales started to communicate with each other more than ever. In a nutshell, tomorrow problem is optimization. The way how many choose to deal with it, is improving compiler. And I really think that Delphi should join this race ASAP.
  20. Mahdi Safsafi


    @Mike Torrettinni it can be further optimized by using full simd for all the operation but unfortunately this requires using some instructions that are not supported by Delphi and also by some cpus.
  21. Mahdi Safsafi


    @Stefan Glienke SIMD are very powerful but they come with their issues too(portability, alignment for some instructions,... ). Some compilers have a great stuff to embarrasse those issues. But neither delphi compiler nor the RTL helps (we don't even have AVX) 😥 I really wish if at some point of time Delphi supports SIMD through intrinsics or some vector types.
  22. Mahdi Safsafi


    Partially SIMDed without trailing ... and it beat all 🙂 const { Source Data Format : Imm8[1:0] } DF_UNSIGNED_BYTES = 0; DF_UNSIGNED_WORDS = 1; DF_SIGNED_BYTES = 2; DF_SIGNED_WORDS = 3; { Aggregation Operation : Imm8[3:2] } AGGREGATION_OP_EQUAL_ANY = 0 shl 2; AGGREGATION_OP_RANGES = 1 shl 2; AGGREGATION_OP_EQUAL_EACH = 2 shl 2; AGGREGATION_OP_EQUAL_ORDERED = 3 shl 2; { Polarity : Imm8[5:4] } POLARITY_POSITIVE = 0 shl 4; POLARITY_NEGATIVE = 1 shl 4; POLARITY_MASKED_POSITIVE = 2 shl 4; POLARITY_MASKED_NEGATIVE = 3 shl 4; { Output Selection : Imm8[6] } OS_LSI = 0 shl 6; OS_MSI = 1 shl 6; OS_BIT_MASK = 0 shl 6; OS_BYTE_WORD_MASK = 1 shl 6; const [Align(16)] Range: array [0 .. 7] of Char = '09afAF' + #00; function IsValidHex(P: Pointer): Boolean; asm movdqa xmm1, [Range] sub eax, 16 @@SimdLoop: add eax, 16 movdqu xmm2, [eax] pcmpistri xmm1, xmm2, DF_UNSIGNED_WORDS or AGGREGATION_OP_RANGES or POLARITY_NEGATIVE ja @@SimdLoop test Word [eax + ecx *2], -1 setz al end; function HexToBinMahdiOneShot(const HexValue: string): string; type TChar4 = array [0 .. 3] of Char; PChar4 = ^TChar4; const Table: array [0 .. 25] of TChar4 = ('0000', '1010', '1011', '1100', '1101', '1110', '1111', '0000', '0000', '0000', '0000', '0000', '0000', '0000', '0000', '0000', '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); var P: PChar4; I, Len: Integer; begin SetLength(Result, Length(HexValue) * 4); P := PChar4(Result); if IsValidHex(Pointer(HexValue)) then begin Len := Length(HexValue); for I := 1 to Len do begin P^ := Table[Ord(HexValue[I]) and 31]; Inc(P); end; end else raise EConvertError.CreateFmt('Invalid hex : %s', [HexValue]); end;
  23. Mahdi Safsafi


    Benchmark does not always tell the full story 🙂 Unlike my code, Yours isn't cache friendly const BinaryValues: array [0..15] of string = ( '0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001', '1010', '1011', '1100', '1101', '1110', '1111' ); Let's see what happens for x64 : You need 16 * SizeOf(Pointer) to store string reference = 16 * 8 = 128 bytes = 2 Cache Line. You need 16 * (4 * SizeOf(Char) + 2 byte(null terminated) + SizeOf(StrRec)) to store data = 16 *(4 * 2 + 2 + 16) = 416 bytes = 7 Cache Line. N.B: I supposed that compiler did a great job by placing your data in a continuous region. -------------------------------------------------------- You ended up consuming 416 + 128 = 544 bytes. You ended up consuming 9 Cache Line. type TChar4 = array[0..3] of Char; PChar4 = ^TChar4; const Table1: array['0'..'9'] of TChar4 = ('0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001'); Table2: array['a'..'f'] of TChar4 = ('1010', '1011', '1100', '1101', '1110', '1111'); I only need 16 * (4 * SizeOf(Char)) to store data = 16 * (4 * 2) = 128 bytes = 2 Cache Line.
  24. Mahdi Safsafi

    Default value

    multiple inheritance = multiple trouble Yep if you install it before it gets created.