Mike Torrettinni

Micro optimization - effect of defined and not used local variables

and "IF" you use it that way:

  • RAD Studio 10.3.3 Arch sample
  • aSTR.Lenght  --> using Helper class for Strings


image.thumb.png.5c1aed52ec4c20cf0ba60d8aa002bc4d.png    image.thumb.png.4559080e9b80c0e8fce7e3437c15c1fd.png   image.thumb.png.ac2b9810c82c8921af642465189e3f78.png


  Form1: TForm1;


{$R *.dfm}

// const OR
  bFlag: boolean = true; // if needs change it

function ProcessStringOLD(const aStr: string): string;
  Result := aStr;
  // Result := Result + ' new value '; // NO NEEDS any "local" var!
  if (Result.Length = 1) then
    Result := aStr;

function ProcessStringNew(const aStr: string): string;
    if bFlag then
    Result := ProcessStringOLD(aStr);
    Form1.Caption := Result;

procedure TForm1.Button1Click(Sender: TObject);

initialization // is executed before many unit (yours)

bFlag := false;







2 hours ago, David Heffernan said:

Is this code a bottleneck in your program? If not write it in the way that is easiest to read. How many times have I said that to you? Why would you choose to make your code hard to read for no benefit? 

It this change is an actual improvement and if it becomes a template for similar functions, it could have noticeable effect on overall project. Of course, if at the end the result is not worth it, then change will not be implemented.

2 hours ago, Bill Meyer said:

Or as a friend advised me when I started learning to code, on my wood-burning CPU: First make it work, then worry about performance.

Very good, thanks! This is years old function working as expected. Maybe it's time for an improvement, but only if there is benefit at the end. I'm not trying a change for the sake of a change.

54 minutes ago, emailx45 said:

initialization // is executed before many unit (yours) 
  bFlag := false;


Thank you. Interesting suggestion, but I don't use unit initialization sections (I think only in 1). I have 'prepare on start' unit/methods that handle/control project behavior, executed before first form is shown.

4 hours ago, Mike Torrettinni said:

Thank you. Interesting suggestion, but I don't use unit initialization sections (I think only in 1). I have 'prepare on start' unit/methods that handle/control project behavior, executed before first form is shown.

This Sections is always used to initialize/register class etc... in OP.

you can have in any unit, and it is executed according with calls order in your projects. always before any others units without it, or on finaly when end your app.


Delphi use it in many units (~895units in RAD Studio 10.3.3 Arch in source codes)

  • It is not just an "adornment" in the code, but an important section, after "Interface" and "Implementation", properly!
  • Widely used in "FireDAC", for example!


17 hours ago, Rollo62 said:

You could remove the flag, by the use of a PointerVariable as pointer to function.

Would that not potentially incur a cache miss, if the pointer points to a "remote" function?

16 hours ago, Kas Ob. said:

move these local managed types vars to be private fields even when each one of them is not used outside one method, here you can recycle them

This makes the object unusable for multi-threading because it is unnecessarily stateful.

7 hours ago, Mike Torrettinni said:

It this change is an actual improvement

Measure your program and find out. 


If all you do is micro benchmarks then likely all you will achieve is to make your code harder to read and develop, and your program runs no faster. 


Do you know where the bottlenecks are in your program? 

7 minutes ago, A.M. Hoornweg said:

This makes the object unusable for multi-threading because it is unnecessarily stateful.

While it can, the impact is manageable most the time by default, see, you have an object, if the object doesn't use any fields of its own then you are free to use it in multithread way safely, of course if it is not calling unsafe outsider code ( objects, functions), so by introducing such approach by moving local var to object field, then yes it should be protected against parallel usage, but and this is big but, when the last time you saw an object without local field that had been used in multithreading !? those are rare.

If these object does have fields then it is already protected and that field (converted from local) would not be a big difference.


It is an approach to squeeze some juice, not ideal and it does add complexity, also comes with drawbacks like the one you pointing to, but how this is different from any algorithm we use on daily basis.

14 minutes ago, Kas Ob. said:

when the last time you saw an object without local field that had been used in multithreading

What exactly is a "local field" ?         


Do you mean a private field of a class (a member of an instantiated object, located on the heap) , or do you mean a local variable of a procedure or method (located on the stack) ?



13 minutes ago, A.M. Hoornweg said:

What exactly is a "local field" ?     

just "fields" instead of "local field"


the right wording of the question is 

when the last time you saw an object without fields that had been used in multithreading ?

Some hints about performance on REAL bottlenecks:


It covers, among others, the tip of a sub-function if you have some temporary managed variables (like string).


The associated code, proving the slide assumptions, is available at https://synopse.info/files/slides/EKON22_2_High_Performance_Pascal_Code_On_Servers.zip
Worth I look to understand how it works in practice.


But remember:
"Premature Optimization if the Root of All Evil !" (DK)

1 minute ago, Kas Ob. said:

just "fields" instead of "local field"


the right wording of the question is 

when the last time you saw an object without fields that had been used in multithreading ?

All the time.   I am especially fond of classes that have only class methods. They basically act as namespaces.  




10 minutes ago, A.M. Hoornweg said:

All the time.   I am especially fond of classes that have only class methods. They basically act as namespaces.  

Fine, means are you free to use the first one with extra unused parameter moving the allocating the managed type variable from the the intensively called function in a loop to the caller.


I didn't suggest that you or anyone should use that everywhere, but if you to enhance a loop calling function with such variables then there is a workaround.

I found myself using TStringList very often, it is great tool that can't live without, but when it does come to fast in intensive data processing i found recycling that list yield better performance, as such usage will remove the create and free, leaving me to call clear on exit, which the skipped destructor should called.

1 hour ago, Kas Ob. said:

Fine, means are you free to use the first one with extra unused parameter moving the allocating the managed type variable from the the intensively called function in a loop to the caller.


I didn't suggest that you or anyone should use that everywhere, but if you to enhance a loop calling function with such variables then there is a workaround.

I found myself using TStringList very often, it is great tool that can't live without, but when it does come to fast in intensive data processing i found recycling that list yield better performance, as such usage will remove the create and free, leaving me to call clear on exit, which the skipped destructor should called.

Fair enough.   You're using the stringlist as an internally shared object, just for saving some time by not having to create/destroy one whenever you need one.


You could take that concept one step further by creating a global stringlist pool (a singleton) from which you can request an available tStringlist whenever you need one.  That pool could be shared among many objects and you could even make it threadsafe if you want.




Procedure tMyobject.DoSomething;

VAR ts:tStringlist;









The problem of having local variables of managed data types such as strings is that Delphi needs to guarantee that no memory leaks occur.  So there's always a hidden Try/Finally block in such methods that will "finalize" the managed variables and release any allocated heap space. That takes time to execute, even if there's no further "code" in the method.












Good suggestions! At this moment this was a test of micro benchmarking, and if similar concept is applied to multiple methods, it might bring some more than micro improvements. Of course this is not a 'let me test this quickly in 1h and know the results'... it will take time and results might not be what I was hoping for, or I might be surprised and it turns out to be big overall improvement. 🙂

On 11/26/2020 at 7:08 PM, A.M. Hoornweg said:

The problem of having local variables of managed data types such as strings is that Delphi needs to guarantee that no memory leaks occur.  So there's always a hidden Try/Finally block in such methods that will "finalize" the managed variables and release any allocated heap space. That takes time to execute, even if there's no further "code" in the method.


I came across this trying to duplicate C++'s std::next_permutation using Delphi.  I went through various modifications, and had left a string declaration in where it was no longer needed:

procedure reverse(var s:AnsiString; const a,x:word);  inline;
    i,j : word;
    //t   : string;
begin                          //  x is one past the end of string
   if  a  = x-1 then exit;
   j     := ( x-a ) shr 1;     //  trunc((x-a)/2);
   for i := 1 to j do
            swapCh( s[a-1+i] , s[x-i] );

All permutations of 12 chars = 479,001,600.  C++ = 2s.  Commenting out the string reduced the Delphi code from 9s to 6s.  (I haven't been back to it since then.)


On 11/26/2020 at 12:08 PM, A.M. Hoornweg said:

there's always a hidden Try/Finally block in such methods that will "finalize" the managed variables and release any allocated heap space. That takes time to execute, even if there's no further "code" in the method.

And on win32 those try/finally have a significant effect even worse than a heap allocation at times because they completely trash a part of the CPUs branch prediction mechanism - see RSP-27375

@pmcgee That swapCh caught my eye and signaled something wrong, would you care to share its implementation ?


I think it can be faster but that depends on that swapChar, s is var string and the compiler in many cases will introduce an overhead for handling it and passing it further.

While you care about the speed of 12! permutations operation, then can you check if this is faster

procedure reverse(var s:AnsiString; const a,x:word);  inline;
    i,j : word;
    //t   : string;
    tmpChar: Byte;
    SBytes: pByte absolute s;
begin                          //  x is one past the end of string
   if  a  = x-1 then exit;
   j     := ( x-a ) shr 1;     //  trunc((x-a)/2);
   for i := 1 to j do
       tmpChar := SBytes[a - 1 + i];
       SBytes[a - 1 + i] := SBytes[x - i];
       SBytes[x - i] := tmpChar;      
       //swapCh( s[a-1+i] , s[x-i] );

Didn't run the code, i hope it is right.

5 minutes ago, Kas Ob. said:

@pmcgee That swapCh caught my eye and signaled something wrong, would you care to share its implementation ?


I think it can be faster but that depends on that swapChar, s is var string and the compiler in many cases will introduce an overhead for handling it and passing it further.

While you care about the speed of 12! permutations operation, then can you check if this is faster

   for i := 1 to j do
       tmpChar := SBytes[a - 1 + i];
       SBytes[a - 1 + i] := SBytes[x - i];
       SBytes[x - i] := tmpChar;      
       //swapCh( s[a-1+i] , s[x-i] );


I had tried a couple things there ... this was a small improvement.  I haven't pulled apart the assembly code yet.  It's just an ongoing interest.    It'll be fun to try it with char/byte array.

procedure swapByte( a:Pbyte ; b:Pbyte );    inline;
    if a <> b then begin
       a^ := a^ + b^;
       b^ := a^ - b^;
       a^ := a^ - b^ ;

procedure swapChar( var a : Ansichar; var b : Ansichar );   inline;
var c :  Ansichar;
    if a<>b then begin
       c := a; a := b; b := c;


@pmcgee Thank you for sharing.


I tried this


{$R *.res}


procedure swapByte( a:Pbyte ; b:Pbyte );    inline;
    if a <> b then begin
       a^ := a^ + b^;
       b^ := a^ - b^;
       a^ := a^ - b^ ;
end;  }

procedure swapChar(var a: Ansichar; var b: Ansichar); inline;
  c: Ansichar;
  if a <> b then
    c := a;
    a := b;
    b := c;

procedure reverse(var s: AnsiString; const a, x: word); inline;
  i, j: word;
  //t   : string;
begin                          //  x is one past the end of string
  if a = x - 1 then
  j := (x - a) shr 1;     //  trunc((x-a)/2);
  for i := 1 to j do
    swapChar(s[a - 1 + i], s[x - i]);

procedure reverse2(var s: AnsiString; const a, x: word); inline;
  i, j: word;
  tmpChar: Byte;
  SBytes: pByte absolute s;
  if a = x - 1 then
  j := (x - a) shr 1;     //  trunc((x-a)/2);
  for i := 1 to j do
    tmpChar := SBytes[a - 1 + i];
    SBytes[a - 1 + i] := SBytes[x - i];
    SBytes[x - i] := tmpChar;
    //swapCh(s[a - 1 + i], s[x - i]);

  st: AnsiString;
  i: Integer;
  D: Uint64;

  st := '1234567890ab';
  d := GetTickCount;
  for i := 1 to 479001600 do           // 12!
    reverse(st, 1, 12);
  D := GetTickCount - D;

  st := '1234567890ab';
  d := GetTickCount;
  for i := 1 to 479001600 do
    reverse2(st, 1, 12);
  D := GetTickCount - D;


The result



What am i missing here ?

13 hours ago, Stefan Glienke said:

And on win32 those try/finally have a significant effect even worse than a heap allocation at times because they completely trash a part of the CPUs branch prediction mechanism - see RSP-27375

Yes that's right ! I've seen your proposal as well and I've a better proposal that solves your proposal issues(Step Over/replicating the codes) and it's a little bit slightly faster !

To begin, the issue arise because there was a mismatch between a call and a ret instruction (a ret instruction that doesn't correspond to a call instruction). In your proposal, you introduced a call to fix the issue but that also introduced Step Over issue !

Here is my proposal if we jumped without using a call instruction then we simply return without using a ret instruction. How ? we do a lazy stack pop (add esp, 4) to remove the return address from the stack then we jump back to the return address (jmp [esp - 4]).

Program Test;

{$R *.res}

  Diagnostics, Windows;


procedure Test;
  i: Integer;
  i := 0;
      payload :
      add  esp, 4      // remove return address from the stack
      jmp  [esp - 4]   // jmp back (return address)
  if i = 0 then;

procedure PatchTryFinally1(address: Pointer);
  jmp: array [0 .. 14] of Byte = ($33, $C0, $5A, $59, $59, $64, $89, $10, $E8, $02, $00, $00, $00, $EB, $00);
  n: NativeUInt;
  target: Pointer;
  offset: Byte;
  target := PPointer(PByte(address) + 11)^;
  offset := PByte(target) - (PByte(address) + 10) - 5;

  WriteProcessMemory(GetCurrentProcess, address, @jmp, SizeOf(jmp), n);
  WriteProcessMemory(GetCurrentProcess, PByte(address) + SizeOf(jmp) - 1, @offset, 1, n);
  FlushInstructionCache(GetCurrentProcess, address, SizeOf(jmp));

procedure PatchTryFinally2(address: Pointer);
  Data: array [0 .. 6] of Byte = ($83, $C4, $04, $FF, $64, $24, $FC);
  n: NativeUInt;
  WriteProcessMemory(GetCurrentProcess, address, @Data, SizeOf(Data), n);

procedure PatchTryFinally(address: Pointer);
  PatchTryFinally2(PByte(@Test) + $32);
  PatchTryFinally1(PByte(@Test) + 26);

  i: Integer;
  sw: TStopwatch;

  sw := TStopwatch.StartNew;

  sw := TStopwatch.StartNew;
  for i := 1 to 100000000 do



Maybe this is OT for this thread. I'll look up where to start a new one, and add a link.



8 hours ago, Kas Ob. said:

@pmcgee Thank you for sharing.


I tried this



The result

What am i missing here ?


