Mahdi Safsafi

June 21, 2020

22 minutes ago, David Heffernan said:

Surely that's just going to copy the record to a temp local, and return the address of that temp local. Or am I missing something?

Yes that's right.

June 21, 2020

"@" operator requires a variable/constant/...,

Here is your way :

type

  PMyRec = ^TMyRec;

  TMyRec = record
    s: string;
  public
    function Address(): PMyRec; inline;
  end;

var
  stk: TStack<TMyRec>;
  rec: TMyRec;
  prec: PMyRec;

  { TMyRec }

function TMyRec.Address: PMyRec;
begin
  Result := @Self;
end;

begin
  stk := TStack<TMyRec>.Create;
  rec.s := 'Hello';
  stk.push(rec);
  prec := stk.Peek().Address();
  Writeln(prec.s);
  prec.s := 'Goodbye';
  stk.Free;
end.

June 21, 2020

Quote

This means there is Initialize been called and also the constructor, right ? if yes then both need cleaning up counterpart, eg. in case a managed field type been used like a string.

Its developer responsibility to make the cleanup.

As I said before, I'm not sure(I may be wrong) ... So better way to find out is to try 😉

June 21, 2020

@Kas Ob.

Quote

Here i really intrigued how managed record with 10.4 is when have initializer.

No one published any assembly about how local record been handled, is there hidden try..finally ?

Well I didn't try Delphi Sydney but I think(I'm not sure) that managed record does not have try/finally section because the record isn't allocated dynamically.

June 21, 2020

1 hour ago, Alexander Sviridenkov said:

Little bit offtopic, but from my experience 90% of performance issues in Delphi came from implicit try except blocks.

When procedure contains local variable of managed type, or implicit variable (f.e. string1+string2, copy(), etc) try except block with finalization calls is added to whole procedure even if variable is used only in small block. Moving this block (in case it is rarely executed) to subprocedure helps a lot.

This only applies to dcc32 (Win32) as it uses stack-based-exception mechanism. Exception handler executes faster but this known to add 15% overhead when there is no exception.

On the other hand, Win64 uses table-based-exception mechanism. Exception handler executes a little bit slower since its handled exclusively by the runtime but does not add any overhead when there is no exception.

June 20, 2020

1 hour ago, Attila Kovacs said:

@Mahdi Safsafi

No, this was my question 🙂

"As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt’s work."

After reading the articles again, I would say, "forward branches are not taken the first time" means, no conditional forward jumps are taken by the predictor for the first time.

Am I right?

I see 🙂

For the first time when a CPU sees a branch, CPU uses static predictor :

Mostly all CPU assumes that a backward branch is going to be taken for the reason I explained with the loop example.

For forward branch, many CPU assumes (but not all) that a forward branch is not going to be taken. Some of them do a random prediction like core2.

UPDATE:

@Attila KovacsI forgot to answer your if/else question. remember what I said, for first time seen, CPU assumes "if" is taken because we intend to execute if-statement. So it's else section that is not going to be taken.

June 20, 2020

16 minutes ago, Attila Kovacs said:

@Mahdi Safsafi I see thx. Which one is the forward branch again? The if or the else section? I'm not sure here anymore.

Don't worry ! I'll try to give a simple explanation 🙂

There're many kind of jump (direct, indirect, relative, ...). Relative jmp is the most used one and the efficient one. Relative means that the offset is relative to Program Counter PC(PC is a register that holds the current instruction pointer ... on x86 its a protected register ... on an ugly implementation such aarch32(ARM) its a public register.). Forward jump means the offset of jmp is positive (hence we are jumping down). Backward jump means offset is negative(jumping up).

# address        # opcodes    # instruction      # comment
# (in decimal)
backward_label:
00000000          85C0        test eax,eax
00000002          7407        jz forward_label   ;   PC=00000002 OFFSET=7  ; dest_addr = PC + OFFSET + SizeOf(CurrentInstruction) =  2 + 7 + 2 = 11
00000004          B801000000  mov eax,$00000001
00000009          EBF5        jmp backward_label ;   PC=00000009 OFFSET=0xF5(-11) : dest_addr = Same_Formula_Above = 9 - 11 + 2 = 0
forward_label:
00000011          C3          ret

Now I believe forward/backward branches are clear for you. The interesting part about backward branch prediction is that CPU assumes that its taken because usually backward branches are used for loop:

// pascal code:
begin
  for i = 0 to 10 do
  begin
     # dosomething ...
  end;
  # dosomething2 ...
end;

// asm version:
xor ecx,ecx
backward_label:
# dosomething ...
inc ecx
cmp ecx, 11
jnz backward_label ;  backward branch

state2: 
# dosomething2
...

//------------------------------------------------------------------------------------------
if CPU assumes that the backward branch is not taken ... then it's a performance penalty ! 
For each iteration, CPU wastes time to execute state2 instructions and when it realizes that BP was wrong,
it tries to recover ! there are 11 iteration !!! this can lead to a huge overhead if we are processing a large amount of data. 
On the other hand, if it assumes that the backward branch is taken, it would save a lot of time and recover only on the last n+1 item.

Now I believe you are understanding the concept 🙂 and you can easily answer your own question 😉

June 20, 2020

Quote

However I did not found anything about "first time they see them", in what inertia system are they considered as "first seen".

They're considered as first seen when there is no previous information available. When CPU has no prior information about a branch, it assumes that the jump is taken (because most of the time we intend to execute the code inside "if statement").

BP uses complex algorithms for the prediction (85-90% of the prediction are correct !). Modern CPU achieves up to 95% ! Those algorithms kept improved since the first time. As I said before, for recent CPU, there is a full specialized CPU-unit for BP. While your program is running, CPU records all executed branches. When calling a logic for the second, third time, ... CPU uses previous available information to predicates whether a branch is going to be taken or not.

Note that this technology is widely used by many architectures (not only for x86). However, implementation varies (some of them have a dedicated BP unit, others not, some of them are more able of OoOE, others have limited support, ... ).

June 20, 2020

27 minutes ago, dummzeuch said:

Longint = integer since basically forever.

Longint does not have a fixed width on all platforms.

// Delphi 10.3
// System unit
// line 242:
{$IF SizeOf(LongInt) = 8}
  {$DEFINE LONGINT64}
  {$DEFINE LONGINTISCPPLONG}
{$ENDIF}

So I prefer to stick with the original declaration.

June 20, 2020

@Kas Ob. You may need to take a look at Intel performance counter monitor(PCM). I didn't tried it myself but a friend of mine recommended it for me.

June 20, 2020

Quote

which could result in Pos becoming negative. On the other hand that would mean a stream with about MaxInt64 bytes of data, which is quite a lot: 9,223,372,036,854,775,807 or 2^33 gibibytes. Do 64 bit processors actually have the capacity of addressing that much (virtual) memory?

AFAIK, there is no implementation that can handle this massive amount of data. Stream uses Int64 to allow a stream to handle > 4GB but not 2^33GB.

Quote

Are there any other bugs in my implementation?

Why you replaced LongInt with Integer for count ? Is this a typos ?

June 20, 2020

Quote

But this was/should be the case even without knowing any implementation detail of the CPU.

OoOE is an old technology designed to improve CPU idle. That means CPU is free to rearrange the order of execution of a logic but must yield the same expectation. Branch prediction (BP) was needed to make that feature works smoothly. So in short, there was two important time-frame here. an era before P4, where developers/compilers were forced to cooperate with CPU in order to benefit from this feature. CPU provided special opcode prefix for branch called hint_prediction ( 0x2E: Branch Not Taken; 0x3E: Branch Taken). Compiler such gcc provided built in function to support that (__builtin_expect ). Developers were forced to use the asm version or the builtin function to benefit from BP. Obviously, this wasn't good because only high qualified developers(that clearly understood the technology) were able to use it. Besides they couldn't even cover all their logic (it was just impossible to cover all your if statements). CPU maker realized that and replaced the ugly prefix 0x2E, 0x3E by an automatic solution that does not require developer cooperation (>P4). Today, CPU maker are working heavy to improve OoOE/BP because it was clear that on a multiple run (not a single run) this can give a high performance (They dedicated special CPU unit for just BP). Now life is more easy but yet you still need to understand the concept in order to make a high performance version of your logic. For example, the second implementation of TMemoryStream.Write has two state and spins on Result:=0. Original implementation has mostly one state and highly optimized for Count > 0 on a multiple run.


function TMemoryStream.Write(const Buffer; Count: Integer): Longint;
var
  Pos: Int64;
begin
  // there is a hight chance that the condition is false. if CPU predicates to true ... thats a waste of time.
  if (FPosition < 0) or (Count <= 0) then begin
    Result := 0;
    Exit; // cpu must wait to validate BP(idle).
  end 
  else
  begin
    // state2:
	Pos := FPosition + Count;
	if Pos > FSize then 
	begin
		if Pos > FCapacity then
		SetCapacity(Pos);
		FSize := Pos;
	end;
	System.Move(Buffer, Pointer(Longint(FMemory) + FPosition)^, Count);
	FPosition := Pos;
	Result := Count;
	// if CPU prediction was wrong, it must recover from state2(heavy).
  end;
end;

function TMemoryStream.Write(const Buffer; Count: Longint): Longint;
var
  Pos: Int64;
begin
  // there is a hight chance that the condition is true. If CPU predicates to true ... it saved time.
  if (FPosition >= 0) and (Count >= 0) then
  begin
    Pos := FPosition + Count;
    if Pos > 0 then
    begin
      if Pos > FSize then
      begin
        if Pos > FCapacity then
          SetCapacity(Pos);
        FSize := Pos;
      end;
      System.Move(Buffer, (PByte(FMemory) + FPosition)^, Count);
      FPosition := Pos;
      Result := Count;
      Exit;  // cpu may wait here to validate BP(s).
    end;
  end;
  // state2:
  Result := 0; // recovering from state2 is not heavy.
end;

June 19, 2020

Theoretically what you said is true. However, in practice Count could be (-) !

Moreover, CPU likes the first implementation over yours, the probability to call write with Count > 0 is much higher than calling it with zero ! (Out-of-order execution).

June 19, 2020

@Kas Ob. for FldList, it sounds Ok. Compiler used Integer for just the name ... all other RTTI information are valid (e.g: MaxValue).

BTW, Notepad++ provides a user defined language. You can quickly make your pascal-variant 😉

June 19, 2020

Quote

What really does disturb me the most is FldList.

Wow ! I missed that ... I'll investigate and let you know.

Quote

Mahdi, are you familiar with DCU decompiler ?

No, but sounds interesting ... I'll take a look later.

June 18, 2020

As long as the compiler allows compiler directives... There would be no reliable tools.

June 18, 2020

Hello,

Take a look at the code/outputs bellow :


type
  TMyClass = class
    FldEnum: (A, B, C);
    FldSet: set of (D, E, F);
    FldSubRange: 5 .. 10;

    FldRec: record
      FA: Integer;
      FB: Integer;
    end;

    FldInteger: Integer;
    FldString: string;
    FldArray: array [0 .. 2] of Integer;
    FldList: TList < (G, H, I) >;

    FldArrayOfRec: array [0 .. 2] of record A: Char;
    B: Char;
  end;
end;

type
  TMyClass2<T> = class(TMyClass)
    FldEnum: (A2, B2, C2);
    FldSet: set of (D2, E2, F2);
    FldSubRange: 5 .. 10;

    FldRec: record
      FA: Integer;
      FB: Integer;
    end;

    FldInteger: Integer;
    FldString: string;
    FldArray: array [0 .. 2] of Integer;
    FldList: TList < (G2, H2, I2) >;

    FldArrayOfRec: array [0 .. 2] of record A: Char;
    B: Char;
  end;
end;

procedure ShowRtti(AObj: TObject);
var
  LCtx: TRttiContext;
  LType: TRttiType;
  LField: TRttiField;
  LFieldType: TRttiType;
begin
  LCtx := TRttiContext.Create();
  LType := LCtx.GetType(AObj.ClassInfo);
  Writeln('------------ RTTI for ', AObj.ToString, ' ------------');
  for LField in LType.GetFields() do
  begin
    LFieldType := LField.FieldType;
    if Assigned(LFieldType) then
    begin
      Writeln(LField.Name:15, ' -> ', LFieldType.Name);
    end;
  end;
  Writeln('');
  LCtx.Free();
end;

var
  Obj1: TMyClass;
  Obj2: TMyClass2<Integer>;
begin
  Obj1 := TMyClass.Create();
  Obj2 := TMyClass2<Integer>.Create();
  ShowRtti(Obj1);
  ShowRtti(Obj2);
  Obj1.Free();
  Obj2.Free();
  Readln;
end.

// --- outputs ---
{
------------ RTTI for TMyClass ------------
        FldEnum -> :TMyClass.:1
         FldRec -> :TMyClass.:3
     FldInteger -> Integer
      FldString -> string
        FldList -> TList<Project1.:TMyClass.:4>

------------ RTTI for TMyClass2<System.Integer> ------------
        FldEnum -> TMyClass2<System.Integer>.:1
         FldSet -> TMyClass2<System.Integer>.:3
    FldSubRange -> TMyClass2<System.Integer>.:4
         FldRec -> TMyClass2<System.Integer>.:5
     FldInteger -> Integer
      FldString -> string
       FldArray -> TMyClass2<System.Integer>.:7
        FldList -> TList<Project1.TMyClass2<System.Integer>.:8>
  FldArrayOfRec -> TMyClass2<System.Integer>.:11
        FldEnum -> :TMyClass.:1
         FldRec -> :TMyClass.:3
     FldInteger -> Integer
      FldString -> string
        FldList -> TList<Project1.:TMyClass.:4>
}

As you can see, for TMyClass, some unnamed types(record, enum,) have associated RTTI. But types such (subrange, sets, array) don't have ! On the other hand, all fields of TMyClass2 have associated RTTI.

This is definitely a bug as the compiler should only accept one behavior(either enables RTTI for all unnamed types or disables them). ... but the question I'm asking is : what is the correct behavior ? In other word, should an unnamed type have RTTI or not ?

All typed languages I'm familiar with solved this by not allowing unnamed/anonymous types. C/C++ are an exception ! They allow both unnamed and anonymous types but they don't have RTTI system (at least an advanced system like Delphi). So its kind hard to know whats the correct behavior when there is no reference around. BTW, I'd love to see how FPC is handling it.

In my opinion, I think that compiler should generate RTTI for unnamed types ... But when I think deeply I say no ! this is an unnamed type (most likely to be anonymous. You declared it implicitly ... why you should expect to have explicit RTTI in return ?)

Please guys, I'm not asking for a workaround/good practice/historical reason ... just focus on the question 🙂

June 18, 2020

@Stefan Glienke Aha I see ! I wasn't meaning "TArray<T>" in particular. My example even didn't used it .

June 18, 2020

Quote

The question was about TArray<T> where it does not matter at all rather than a few unnoticable microseconds at compile time.

Just for clarification, I used two different word : little overhead & noticeable overhead to distinguish between two different usage of TArray<T>.

Quote

You are right however when talking about types that have executable code (and possibly a significant amount of typeinfo) as the compiler always emits all code of a generic type into each and every dcu that is using it as in your example with Unit1 and Unit2.

However it does not need to emit into Unit3.dcu because that one is just referencing the type that already fully resides in Unit2.

You definitely understood my example In fact, for Unit3, compiler only emitted interface for the alias-type without implementation (without machine code generation). For Unit1 and Unit2 it emitted the interface and the implementation (generated code).

June 17, 2020

Yes, there is a disadvantage of using TArray<TItem> instead of TItems. In the same unit where class is declared, whenever compiler found explicit generic type(TArray<TItem>), compiler must do extra work : matching arguments, checking for constraints, ... In a large unit that uses generics massively, this may add a little overhead. TItems on the other side, works as a cache (compiler does not need to check constraints for example).
Using TArray<TItem> from another unit adds a noticeable overhead as the compiler must generate the type in-situ for that unit. In fact do the following test yourself:

unit Unit1;

interface

uses
  System.SysUtils,
  System.Generics.Collections,
  System.Classes;

type
  TObject<T> = class
    a: T;
    b: T;
    procedure foo(a, b: T);
  end;

  TListOfInteger = TList<TObject<Integer>>;
implementation

{ TObject<T> }

procedure TObject<T>.foo(a, b: T);
begin

end;

end.

// ---------------------------------------
unit Unit2;

interface

uses

  System.SysUtils,
  System.Generics.Collections,
  System.Classes, Unit1;

type
  TListOfInteger2 = TList<TObject<Integer>>;

implementation

end.

//-----------------------------------------
unit Unit3;

interface

uses

  System.SysUtils,
  System.Generics.Collections,
  System.Classes, Unit1;

type
  TListOfInteger3 = TListOfInteger; // alias

implementation

end.

Now, check the size of Unit1.dcu, Unit2.dcu, Unit3.dcu.
Final thing, TItems is more friendly for typing and reading !

June 17, 2020

1 hour ago, David Heffernan said:

D is a truly wonderful language, and Andrei Alexandrescu's book is by some distance the best programming book I have ever read.

Indeed ! D official doc is also super good. Perhaps Andrei was behind it too.

June 17, 2020

1 hour ago, Uwe Raabe said:

Isn't that just a subset of Delphi?

If it was true ... then we're definitely using Elphi 😉

June 17, 2020

32 minutes ago, Stefan Glienke said:

Those languages to my knowledge don't also have something like the initialization part of a unit which might cause a chicken-egg-problem.

You're absolutely right Stefan.

Particularly, the awesome D language has initialization and finalization sections, but it implements them in a very sexy way :

Static constructors(initialization) are executed to initialize a module(unit)'s state. Static destructors(finalization) terminate a module's state.
A module may have multiple static constructors and static destructors. The static constructors are run in lexical order, the static destructors are run in reverse lexical order.
Non-shared static constructors and destructors are run whenever threads are created or destroyed, including the main thread.
Shared static constructors are run once before main() is called. Shared static destructors are run after the main() function returns.

But as you said, it can be a source of lot of bugs.

June 17, 2020

Class constructors are executed in the lexical order in which they implemented (not declared). Class destructor executed in the reverse order in which they implemented.

NOTE: Many languages follow the same principle.

June 15, 2020

@dummzeuch I used PChar instead of string in b_FindFirstExNextBFSClick and I got up to 20% improvement. I tested on Windows 10 with SDD on 'C:' drive.

Sign In

Mahdi Safsafi

Content Count

Joined

Last visited

Days Won

Content Type

Profiles

Forums

Calendar

Posts posted by Mahdi Safsafi

How to get pointer to record at the top of TStack<T>

How to get pointer to record at the top of TStack<T>

TMemoryStream.Write

TMemoryStream.Write

TMemoryStream.Write

TMemoryStream.Write

TMemoryStream.Write

TMemoryStream.Write

TMemoryStream.Write

TMemoryStream.Write

TMemoryStream.Write

TMemoryStream.Write

TMemoryStream.Write

Unnamed types and RTTI

Unnamed types and RTTI

Disadvantage of using defined type of TArray?

Unnamed types and RTTI

Disadvantage of using defined type of TArray?

Disadvantage of using defined type of TArray?

Disadvantage of using defined type of TArray?

Class Constructor in Delphi 10.4

Class Constructor in Delphi 10.4

Class Constructor in Delphi 10.4

Class Constructor in Delphi 10.4

Depth First Search vs. Breadth First Search in directories

Browse

Activity