Jump to content

Mahdi Safsafi

Members
  • Content Count

    383
  • Joined

  • Last visited

  • Days Won

    10

Posts posted by Mahdi Safsafi


  1. "@" operator requires a variable/constant/..., 

    Here is your way :

    type
    
      PMyRec = ^TMyRec;
    
      TMyRec = record
        s: string;
      public
        function Address(): PMyRec; inline;
      end;
    
    var
      stk: TStack<TMyRec>;
      rec: TMyRec;
      prec: PMyRec;
    
      { TMyRec }
    
    function TMyRec.Address: PMyRec;
    begin
      Result := @Self;
    end;
    
    begin
      stk := TStack<TMyRec>.Create;
      rec.s := 'Hello';
      stk.push(rec);
      prec := stk.Peek().Address();
      Writeln(prec.s);
      prec.s := 'Goodbye';
      stk.Free;
    end.

     

    • Like 1

  2. Quote

    This means there is Initialize been called and also the constructor, right ? if yes then both need cleaning up counterpart, eg. in case a managed field type been used like a string.

     

    Its developer responsibility to make the cleanup.

    As I said before, I'm not sure(I may be wrong) ... So better way to find out is to try 😉


  3. @Kas Ob.
     

    Quote

    Here i really intrigued how managed record with 10.4 is when have initializer.

    No one published any assembly about how local record been handled, is there hidden try..finally ?

    Well I didn't try Delphi Sydney but I think(I'm not sure) that managed record does not have try/finally section because the record isn't allocated dynamically. 


  4.  

    1 hour ago, Alexander Sviridenkov said:

    Little bit offtopic, but from my experience 90% of performance issues in Delphi came from implicit try except blocks.

    When procedure contains local variable of managed type, or implicit variable (f.e. string1+string2, copy(), etc) try except block with finalization calls is added to whole procedure even if variable is used only in small block. Moving this block (in case it is rarely executed) to subprocedure helps a lot.

    This only applies to dcc32 (Win32) as it uses stack-based-exception mechanism. Exception handler executes faster but this known to add 15% overhead when there is no exception.

    On the other hand, Win64 uses table-based-exception mechanism. Exception handler executes a little bit slower since its handled exclusively by the runtime but does not add any overhead when there is no exception.
     


  5. 1 hour ago, Attila Kovacs said:

    @Mahdi Safsafi 

     

    No, this was my question 🙂

    "As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt’s work."

     

    After reading the articles again, I would say, "forward branches are not taken the first time" means, no conditional forward jumps are taken by the predictor for the first time.

    Am I right?

    I see 🙂 

    For the first time when a CPU sees a branch, CPU uses static predictor :

    Mostly all CPU assumes that a backward branch is going to be taken for the reason I explained with the loop example.

    For forward branch, many CPU assumes (but not all) that a forward branch is not going to be taken. Some of them do a random prediction like core2.

     

    UPDATE:

    @Attila KovacsI forgot to answer your if/else question. remember what I said, for first time seen, CPU assumes "if" is taken because we intend to execute if-statement. So it's else section that is not going to be taken.

    • Thanks 1

  6. 16 minutes ago, Attila Kovacs said:

    @Mahdi Safsafi I see thx. Which one is the forward branch again?  The if or the else section? I'm not sure here anymore.

    Don't worry ! I'll try to give a simple explanation 🙂 

    There're many kind of jump (direct, indirect, relative, ...). Relative jmp is the most used one and the efficient one. Relative means that the offset is relative to Program Counter PC(PC is a register that holds the current instruction pointer ... on x86 its a protected register ... on an ugly implementation such aarch32(ARM) its a public register.). Forward jump means the offset of jmp is positive (hence we are jumping down). Backward jump means offset is negative(jumping up).

    # address        # opcodes    # instruction      # comment
    # (in decimal)
    backward_label:
    00000000          85C0        test eax,eax
    00000002          7407        jz forward_label   ;   PC=00000002 OFFSET=7  ; dest_addr = PC + OFFSET + SizeOf(CurrentInstruction) =  2 + 7 + 2 = 11
    00000004          B801000000  mov eax,$00000001
    00000009          EBF5        jmp backward_label ;   PC=00000009 OFFSET=0xF5(-11) : dest_addr = Same_Formula_Above = 9 - 11 + 2 = 0
    forward_label:
    00000011          C3          ret 

    Now I believe forward/backward branches are clear for you. The interesting part about backward branch prediction is that CPU assumes that its taken because usually backward branches are used for loop:

    // pascal code:
    begin
      for i = 0 to 10 do
      begin
         # dosomething ...
      end;
      # dosomething2 ...
    end;
    
    // asm version:
    xor ecx,ecx
    backward_label:
    # dosomething ...
    inc ecx
    cmp ecx, 11
    jnz backward_label ;  backward branch
    
    state2: 
    # dosomething2
    ...
    
    //------------------------------------------------------------------------------------------
    if CPU assumes that the backward branch is not taken ... then it's a performance penalty ! 
    For each iteration, CPU wastes time to execute state2 instructions and when it realizes that BP was wrong,
    it tries to recover ! there are 11 iteration !!! this can lead to a huge overhead if we are processing a large amount of data. 
    On the other hand, if it assumes that the backward branch is taken, it would save a lot of time and recover only on the last n+1 item.

    Now I believe you are understanding the concept 🙂 and you can easily answer your own question 😉


  7. Quote

    However I did not found anything about "first time they see them", in what inertia system are they considered as "first seen".

    They're considered as first seen when there is no previous information available. When CPU has no prior information about a branch, it assumes that the jump is taken (because most of the time we intend to execute the code inside "if statement").

    BP uses complex algorithms for the prediction (85-90% of the prediction are correct !). Modern CPU achieves up to 95% ! Those algorithms kept improved since the first time. As I said before, for recent CPU, there is a full specialized CPU-unit for BP.  While your program is running, CPU records all executed branches. When calling a logic for the second, third time, ... CPU uses previous available information to predicates whether a branch is going to be taken or not.

    Note that this technology is widely used by many architectures (not only for x86). However, implementation varies (some of them have a dedicated BP unit, others not, some of them are more able of OoOE, others have limited support, ... ).

    • Like 1

  8. 27 minutes ago, dummzeuch said:

    Longint = integer since basically forever.

    Longint does not have a fixed width on all platforms.

    // Delphi 10.3
    // System unit
    // line 242:
    {$IF SizeOf(LongInt) = 8}
      {$DEFINE LONGINT64}
      {$DEFINE LONGINTISCPPLONG}
    {$ENDIF}

     So I prefer to stick with the original declaration.


  9. Quote

    which could result in Pos becoming negative. On the other hand that would mean a stream with about MaxInt64 bytes of data, which is quite a lot: 9,223,372,036,854,775,807 or 2^33 gibibytes. Do 64 bit processors actually have the capacity of addressing that much (virtual) memory?

    AFAIK, there is no implementation that can handle this massive amount of data. Stream uses Int64 to allow a stream to handle > 4GB but not 2^33GB. 

    Quote

    Are there any other bugs in my implementation?

    Why you replaced LongInt with Integer for count ? Is this a typos ?


  10. Quote

    But this was/should be the case even without knowing any implementation detail of the CPU.

    OoOE is an old technology designed to improve CPU idle. That means CPU is free to rearrange the order of execution of a logic but must yield the same expectation. Branch prediction (BP) was needed to make that feature works smoothly. So in short, there was two important time-frame here. an era before P4, where developers/compilers were forced to cooperate with CPU in order to benefit from this feature. CPU provided special opcode prefix for branch called hint_prediction ( 0x2E: Branch Not Taken; 0x3E: Branch Taken). Compiler such gcc provided built in function to support that (__builtin_expect ). Developers were forced to use the asm version or the builtin function to benefit from BP. Obviously, this wasn't good because only high qualified developers(that clearly understood the technology) were able to use it. Besides they couldn't even cover all their logic (it was just impossible to cover all your if statements). CPU maker realized that and replaced the ugly prefix 0x2E, 0x3E by an automatic solution that does not require developer cooperation (>P4). Today, CPU maker are working heavy to improve OoOE/BP because it was clear that on a multiple run (not a single run) this can give a high performance (They dedicated special CPU unit for just BP). Now life is more easy but yet you still need to understand the concept in order to make a high performance version of your logic. For example, the second implementation of TMemoryStream.Write has two state and spins on Result:=0. Original implementation has mostly one state and highly optimized for Count > 0 on a multiple run. 

    
    function TMemoryStream.Write(const Buffer; Count: Integer): Longint;
    var
      Pos: Int64;
    begin
      // there is a hight chance that the condition is false. if CPU predicates to true ... thats a waste of time.
      if (FPosition < 0) or (Count <= 0) then begin
        Result := 0;
        Exit; // cpu must wait to validate BP(idle).
      end 
      else
      begin
        // state2:
    	Pos := FPosition + Count;
    	if Pos > FSize then 
    	begin
    		if Pos > FCapacity then
    		SetCapacity(Pos);
    		FSize := Pos;
    	end;
    	System.Move(Buffer, Pointer(Longint(FMemory) + FPosition)^, Count);
    	FPosition := Pos;
    	Result := Count;
    	// if CPU prediction was wrong, it must recover from state2(heavy).
      end;
    end;
    
    function TMemoryStream.Write(const Buffer; Count: Longint): Longint;
    var
      Pos: Int64;
    begin
      // there is a hight chance that the condition is true. If CPU predicates to true ... it saved time.
      if (FPosition >= 0) and (Count >= 0) then
      begin
        Pos := FPosition + Count;
        if Pos > 0 then
        begin
          if Pos > FSize then
          begin
            if Pos > FCapacity then
              SetCapacity(Pos);
            FSize := Pos;
          end;
          System.Move(Buffer, (PByte(FMemory) + FPosition)^, Count);
          FPosition := Pos;
          Result := Count;
          Exit;  // cpu may wait here to validate BP(s).
        end;
      end;
      // state2:
      Result := 0; // recovering from state2 is not heavy.
    end;

     

    • Like 1

  11. Hello,

    Take a look at the code/outputs bellow :

    
    type
      TMyClass = class
        FldEnum: (A, B, C);
        FldSet: set of (D, E, F);
        FldSubRange: 5 .. 10;
    
        FldRec: record
          FA: Integer;
          FB: Integer;
        end;
    
        FldInteger: Integer;
        FldString: string;
        FldArray: array [0 .. 2] of Integer;
        FldList: TList < (G, H, I) >;
    
        FldArrayOfRec: array [0 .. 2] of record A: Char;
        B: Char;
      end;
    end;
    
    type
      TMyClass2<T> = class(TMyClass)
        FldEnum: (A2, B2, C2);
        FldSet: set of (D2, E2, F2);
        FldSubRange: 5 .. 10;
    
        FldRec: record
          FA: Integer;
          FB: Integer;
        end;
    
        FldInteger: Integer;
        FldString: string;
        FldArray: array [0 .. 2] of Integer;
        FldList: TList < (G2, H2, I2) >;
    
        FldArrayOfRec: array [0 .. 2] of record A: Char;
        B: Char;
      end;
    end;
    
    procedure ShowRtti(AObj: TObject);
    var
      LCtx: TRttiContext;
      LType: TRttiType;
      LField: TRttiField;
      LFieldType: TRttiType;
    begin
      LCtx := TRttiContext.Create();
      LType := LCtx.GetType(AObj.ClassInfo);
      Writeln('------------ RTTI for ', AObj.ToString, ' ------------');
      for LField in LType.GetFields() do
      begin
        LFieldType := LField.FieldType;
        if Assigned(LFieldType) then
        begin
          Writeln(LField.Name:15, ' -> ', LFieldType.Name);
        end;
      end;
      Writeln('');
      LCtx.Free();
    end;
    
    var
      Obj1: TMyClass;
      Obj2: TMyClass2<Integer>;
    begin
      Obj1 := TMyClass.Create();
      Obj2 := TMyClass2<Integer>.Create();
      ShowRtti(Obj1);
      ShowRtti(Obj2);
      Obj1.Free();
      Obj2.Free();
      Readln;
    end.
    
    // --- outputs ---
    {
    ------------ RTTI for TMyClass ------------
            FldEnum -> :TMyClass.:1
             FldRec -> :TMyClass.:3
         FldInteger -> Integer
          FldString -> string
            FldList -> TList<Project1.:TMyClass.:4>
    
    ------------ RTTI for TMyClass2<System.Integer> ------------
            FldEnum -> TMyClass2<System.Integer>.:1
             FldSet -> TMyClass2<System.Integer>.:3
        FldSubRange -> TMyClass2<System.Integer>.:4
             FldRec -> TMyClass2<System.Integer>.:5
         FldInteger -> Integer
          FldString -> string
           FldArray -> TMyClass2<System.Integer>.:7
            FldList -> TList<Project1.TMyClass2<System.Integer>.:8>
      FldArrayOfRec -> TMyClass2<System.Integer>.:11
            FldEnum -> :TMyClass.:1
             FldRec -> :TMyClass.:3
         FldInteger -> Integer
          FldString -> string
            FldList -> TList<Project1.:TMyClass.:4>
    }

    As you can see, for TMyClass, some unnamed types(record, enum,) have associated RTTI. But types such (subrange, sets, array) don't have ! On the other hand, all fields of TMyClass2 have associated RTTI. 

    This is definitely a bug as the compiler should only accept one behavior(either enables RTTI for all unnamed types or disables them).  ... but the question I'm asking is what is the correct behavior ? In other word, should an unnamed type have RTTI or not ?  

    All typed languages I'm familiar with solved this by not allowing unnamed/anonymous types. C/C++ are an exception ! They allow both unnamed and anonymous types but they don't have RTTI system (at least an advanced system like Delphi). So its kind hard to know whats the correct behavior when there is no reference around. BTW, I'd love to see how FPC is handling it.

    In my opinion, I think that compiler should generate RTTI for unnamed types ... But when I think deeply I say no ! this is an unnamed type (most likely to be anonymous. You declared it implicitly ... why you should expect to have explicit RTTI in return ?)

    Please guys, I'm not asking for a workaround/good practice/historical reason ... just focus on the question 🙂 
     


  12. Quote

    The question was about TArray<T> where it does not matter at all rather than a few unnoticable microseconds at compile time.

    Just for clarification, I used two different word :  little overhead & noticeable overhead to distinguish between two different usage of TArray<T>. 

     

    Quote

    You are right however when talking about types that have executable code (and possibly a significant amount of typeinfo) as the compiler always emits all code of a generic type into each and every dcu that is using it as in your example with Unit1 and Unit2.

    However it does not need to emit into Unit3.dcu because that one is just referencing the type that already fully resides in Unit2.

    You definitely understood my example :classic_smile: In fact, for Unit3, compiler only emitted interface for the alias-type without implementation (without machine code generation). For Unit1 and Unit2 it emitted the interface and the implementation (generated code). 


  13. Yes, there is a disadvantage of using TArray<TItem> instead of TItems. In the same unit where class is declared, whenever compiler found explicit generic type(TArray<TItem>), compiler must do extra work : matching arguments, checking for constraints, ...  In a large unit that uses generics massively, this may add a little overhead. TItems on the other side, works as a cache (compiler does not need to check constraints for example).
    Using TArray<TItem> from another unit adds a noticeable overhead as the compiler must generate the type in-situ for that unit. In fact do the following test yourself: 

    unit Unit1;
    
    interface
    
    uses
      System.SysUtils,
      System.Generics.Collections,
      System.Classes;
    
    type
      TObject<T> = class
        a: T;
        b: T;
        procedure foo(a, b: T);
      end;
    
      TListOfInteger = TList<TObject<Integer>>;
    implementation
    
    { TObject<T> }
    
    procedure TObject<T>.foo(a, b: T);
    begin
    
    end;
    
    end.
    
    // ---------------------------------------
    unit Unit2;
    
    interface
    
    uses
    
      System.SysUtils,
      System.Generics.Collections,
      System.Classes, Unit1;
    
    type
      TListOfInteger2 = TList<TObject<Integer>>;
    
    implementation
    
    end.
    
    //-----------------------------------------
    unit Unit3;
    
    interface
    
    uses
    
      System.SysUtils,
      System.Generics.Collections,
      System.Classes, Unit1;
    
    type
      TListOfInteger3 = TListOfInteger; // alias
    
    implementation
    
    end.

    Now, check the size of Unit1.dcu, Unit2.dcu, Unit3.dcu.
    Final thing, TItems is more friendly for typing and reading !

    • Like 1

  14. 32 minutes ago, Stefan Glienke said:

    Those languages to my knowledge don't also have something like the initialization part of a unit which might cause a chicken-egg-problem.

    You're absolutely right Stefan.

    Particularly, the awesome D language has initialization and finalization sections, but it implements them in a very sexy way :

    1. Static constructors(initialization) are executed to initialize a module(unit)'s state. Static destructors(finalization) terminate a module's state.
    2. A module may have multiple static constructors and static destructors. The static constructors are run in lexical order, the static destructors are run in reverse lexical order.
    3. Non-shared static constructors and destructors are run whenever threads are created or destroyed, including the main thread.
    4. Shared static constructors are run once before main() is called. Shared static destructors are run after the main() function returns.

    But as you said, it can be a source of lot of bugs.

    • Like 1
×