Jump to content
lg17

Execution time difference between 32b and 64b, using static arrays

Recommended Posts

Hi all.
I observe a noticable difference of execution time for the same code, compiled in 32 bits and 64 bits.
The following code is an extract of a more complex application.
If you run it, it will always take much more time in 64 bits than in 32 bits. The code can be tested using arrays of 1, 2 or 3 floats (option 1,2 and 3).

If I look at the assembly code in debug mode, it is different in 32b and 64b.

Any idea how to make the 64 bits version run as fast as the 32 bits one? (Compilation options or code adjustments?)
Thanks.

 

Note: I'm using Delphi 12.1

 

Here is the code:

	Type
	  {$IF Defined(WIN64)}
	  t_reel = double;
	  {$ELSE}
	  t_reel = single;
	  {$ENDIF}

	  t_vect1 = Array[0..0] OF t_reel;
	  t_vect2 = Array[0..1] OF t_reel;
	  t_vect3 = Array[0..2] OF t_reel;

	const
	  k_vect1_nul  : t_vect1 = (0.0);
	  k_vect2_nul  : t_vect2 = (0.0, 0.0);
	  k_vect3_nul  : t_vect3 = (0.0, 0.0, 0.0);

	procedure test();
	var iLoop:integer;
		l_SW1:TStopwatch;
		l_vec1: t_vect1;
		l_vec2: t_vect2;
		l_vec3: t_vect3;
	begin
	  l_SW1:=TSTopWatch.StartNew;

	  iLoop:=0;
	  while (iLoop<900000000) do begin
		//l_vec1 := k_vect1_nul;  //option 1
		//l_vec2 := k_vect2_nul;  //option 2
		l_vec3 := k_vect3_nul;    //option 3
		inc(iLoop);
	  end;

	  l_SW1.Stop;
	  Showmessage(intToStr(l_SW1.ElapsedTicks)+' ticks / '+intToStr(l_SW1.ElapsedMilliseconds)+' ms');
	end;

 

Edited by lg17
Add compiler/IDE version note

Share this post


Link to post

That's interesting. I confirm and tried some variations on the single vs double types in both; my time in milliseconds is shown in the comments:

  {$IF Defined(WIN64)}
  t_reel = double;      // 2880-2930 ms
  //t_reel = single;    // 2410-2440 ms
  {$ELSE}
  //t_reel = double;    // 5030-5150 ms
  t_reel = single;      // 1170-1180 ms
  {$ENDIF}

 

Share this post


Link to post

Replace:

 

  	//l_vec3 := k_vect3_nul;    //option 3
    move(k_vect3_nul, l_vec3, sizeof(t_vect3)); //< Edit with sizof(type)

 

Edited by DelphiUdIT
  • Like 1

Share this post


Link to post

The difference depends from assembler decoding: for example if use the "option2" and all single (x32 and x64), the x64 is more fast ....

May be can be optimize better, but on compiler time.

 

If you want you can write an assembler function that do something better.

 

EDIT: seems that an indirection write use [si] and [di] cpu registers is really more slow that other tecnique, that is know but I really don't think that is so slow...

Edited by DelphiUdIT

Share this post


Link to post

The compiler is generating very bad instruction, even calling RTL Move would be faster then using MOVSQ !!

https://uops.info/html-instr/MOVSQ.html

 

4 cycles in theory but in reality it is 7-8 cycles on my CPU !

 

Suggestion find these and manually move/copy the fields, specially in tight loops.

Share this post


Link to post

Thanks to all for your answers.

In my mind, 32 bits application should perform better with 32b float and 64 bits application with 64b float. This is why I have this compilation condition at the beginning.

I'm going to try using the move instruction where it is possible and see if it improves the speed.

The problem is that this slowness issue happens also when I assign a vectX to another vectX... like

l_vec3A := l_vec3B;

so everywhere in the code.

I don't know if this is something that could be solved/improved by embarcadero.

Thanks.

Share this post


Link to post
21 minutes ago, lg17 said:

I don't know if this is something that could be solved/improved by embarcadero.

Like me and @Kas Ob. said something wrong with the assembler coding. In theory the compiler do the right way (my personal opinion), but the timing is not right ... should do more study about that. Now I look to my code also, I facing some problems that may be originate from that.

I don't think tha Embarcadero do something in real time ... surely they will improve in the future.

 

You can open a bug report on new quality portal.

Edited by DelphiUdIT

Share this post


Link to post

It should also be kept in mind that the number of cycles (almost a billion) has a significant impact on the times: even in modern processors, using a "machine" instruction that takes 1 cycle more than another corresponds more or less to having an equivalent time of about 400 ms more (taking into account a base frequency of 2.5 GHz of the single Core and only the theoretical calculated time).

Share this post


Link to post

This should be another way to do (like @Kas Ob. suggested):

 

//l_vec3 := k_vect3_nul;  //option 3
for var i := Low(l_vec3) to High(l_vec3) do
  l_vec3[i] := k_vect3_nul[i];    //option 3

It's more fast then the orginal in x64 and near the same more slow in x32

Edited by DelphiUdIT

Share this post


Link to post

Thanks @DelphiUdIT

The most efficient solutions we found was

l_vec3[0] := k_vect3_nul[0];
l_vec3[1] := k_vect3_nul[1];
l_vec3[2] := k_vect3_nul[2];

 

Share this post


Link to post

x64: It's incredible, but writing 3 times in memory is more fast that using MOVSQ ... really weird hardware implementation, may be that this works better with long data, like string.

The x32 implements in facts 3 single writing (if datatype is "single") and with "double datatype" use the similar weird approach like x64 with more high timing because it's 2xMOVSD in use.

 

Edited by DelphiUdIT

Share this post


Link to post
3 hours ago, DelphiUdIT said:

really weird hardware implementation,

If it could be done better it would had. ( not sure if that is correct English phrase)

 

Anyway, you need to look to the hardware implementation as its basic building blocks, so you can't invent an instruction or algorithm easily, you either need huge hardware circuits which will cost more delay and other problem, or utilize already exist uops !

so here a more detailed look at MOVSQ

https://uops.info/html-lat/AMT/MOVSQ-Measurements.html

 

Quote
  • UOPS_RETIRED.ALL: 17.0

It does show that 17 or 18 micro ops used for one MOVSQ, and this is huge latency.

Share this post


Link to post

I made a little FMX test app a few years back just to see how floating point differed in performance cross platform: https://github.com/darnocian/delphi-gausslegendre-pi-approximation-test

 

It was nothing special and just a test using Extended / Double. The approximation function was implemented in https://github.com/darnocian/delphi-gausslegendre-pi-approximation-test/blob/master/gauss.legendre.pi.pas

 

I havn't run the benchmark again recently, but at the time, it did highlight that cross platform, a review was required of how the underlying floating point stuff was being done in the various compilers as some were way out...

Share this post


Link to post

I write "really weird" because if you use manually "mov ax, source" and "mov dest, ax" you do it more quickly, and not a little bit.

 

"movsd" and his brothers are complex instructions (like you link):  https://uops.info/html-instr/MOVSQ.html

Quote

Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)

The use of them should be carefully considered and evaluated. In the Intel Reference Manual (Intel® 64 and IA-32 Architectures
Software Developer’s Manual, - Ed. JUNE 2024)  they told about these instructions:

 

Quote

7.3.9.3 Fast-String Operation
To improve performance, more recent processors support modifications to the processor’s operation during the
string store operations initiated with the MOVS, MOVSB, STOS, and STOSB instructions. This optimized operation,
called fast-string operation, is used when the execution of one of those instructions meets certain initial conditions
(see below). Instructions using fast-string operation effectively operate on the string in groups that may
include multiple elements of the native data size (byte, word, doubleword, or quadword). With fast-string operation,
the processor recognizes interrupts and data breakpoints only on boundaries between these groups. Faststring
operation is used only if the source and destination addresses both use either the WB or WC memory types.

So, they should use in these cases (I don't look more deep)  to be "fast", otherwise they are "poor" (may think).

I don't know if Embarcadero can think (and of course want) and resolve this situation, whether there really is a need to fix it.

 

We try near a bilion of cycles, but it's really a normal situation ?

 

P.S.: I try the same in Lazarus / FPC and the timing is lower (near 400 ms.) for double operations in x64 and their compiler use normal "mov [RBP...], RAX" and company to execute the same work.

 

Edited by DelphiUdIT

Share this post


Link to post

@DelphiUdIT On side note: (after seeing you mention "mov ax, source" and "mov dest, ax")

If you are going to open a ticket then report the hidden performance killer, it strikes randomly, due the simplicity of the compiler and its usage of specific 16bit operation instruction.

 

from "3.4.2.3 Length-Changing Prefixes (LCP) " https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html

Quote

3.4.2.3 Length-Changing Prefixes (LCP) 
The length of an instruction can be up to 15 bytes in length. Some prefixes can dynamically change the 
length of an instruction that the decoder must recognize. Typically, the pre-decode unit will estimate the 
length of an instruction in the byte stream assuming the absence of LCP. When the predecoder encoun-
ters an LCP in the fetch line, it must use a slower length decoding algorithm. With the slower length 
decoding algorithm, the predecoder decodes the fetch in 6 cycles, instead of the usual 1 cycle. Normal 
queuing throughout of the machine pipeline generally cannot hide LCP penalties. 
The prefixes that can dynamically change the length of a instruction include:
• Operand size prefix (0x66).
• Address size prefix (0x67).
The instruction MOV DX, 01234h is subject to LCP stalls in processors based on Intel Core microarchitec-
ture, and in Intel Core Duo and Intel Core Solo processors. Instructions that contain imm16 as part of 
their fixed encoding but do not require LCP to change the immediate size are not subject to LCP stalls. 
The REX prefix (4xh) in 64-bit mode can change the size of two classes of instruction, but does not cause 
an LCP penalty.
If the LCP stall happens in a tight loop, it can cause significant performance degradation. When decoding 
is not a bottleneck, as in floating-point heavy code, isolated LCP stalls usually do not cause performance 
degradation.
Assembly/Compiler Coding Rule 19. (MH impact, MH generality) Favor generating code using 
imm8 or imm32 values instead of imm16 values.
If imm16 is needed, load equivalent imm32 into a register and use the word value in the register instead.

Double LCP Stalls

....

An instruction is encoded with a MODR/M and SIB byte, and the fetch line boundary crossing is 
between the MODR/M and the SIB bytes.
• An instruction starts at offset 13 of a fetch line references a memory location using register and 
immediate byte offset addressing mode.
The first stall is for the 1st fetch line, and the 2nd stall is for the 2nd fetch line. A double LCP stall causes 
a decode penalty of 11 cycles. 

...

False LCP Stalls

..

The following techniques can help avoid false LCP stalls: 
• Upcast all short operations from the F7 group of instructions to long, using the full 32 bit version. 
• Ensure that the F7 opcode never starts at offset 14 of a fetch line.
Assembly/Compiler Coding Rule 20. (M impact, ML generality) Ensure instructions using 0xF7 
opcode byte does not start at offset 14 of a fetch line; and avoid using these instruction to operate on 
16-bit data, upcast short data to 32 bits.

(look at the included example, "Example 3-15.  Avoiding False LCP Delays with 0xF7 Group Instructions", it is disturbing non the least)

Most Delphi string operation does utilize 16bit loading the storing, the above LCP effect does occur when a these instruction aligned (or positioned at specific address 13h or 14h, repeating and reproducing it hard as it might be stay hidden until some changes in complete part of the project triggered a small shift, also very important thing, these instruction are generated by the Delphi compiler without any consideration for such alignment, so what happen in real life and when you benchmark that it keeps shifting up and down and differentiate from version to another from build to build, what we witness is close to steady result, but in fact the bottlenecks where shifted form place to another, fixing this will ensure better and more consistent performance in string handling in any Delphi binary produced. 

 

I believe this was the case not a long ago when i suggested to replace 16bit field fill with 32bit and let it overflow, the suggestion was outlandish unless that instruction hit specific address and then cause 6 or 11 cycle of stall ( full pipe stop), and it looks it was the case as Vencent observed.

Share this post


Link to post
1 hour ago, Kas Ob. said:

On side note: (after seeing you mention "mov ax, source" and "mov dest, ax")

Sorry, i mention 16 bit register only for quickly "writing". Of course the implementatio was on 64 bit register.
Of the cases you indicated, I have honestly never noticed them (I don't go to debug in assembler unless there is a specific reason). I don't work much with strings, if not merely for the visualizations on the UI and therefore I have not noticed problems such as performance drops or anything else.

I sometimes look into assembler, I produce exclusive 64 bit application and no more 32 bit, and I saw that they use tipically REX (so pure 64 bit instruction) and rarely 32 bits instructions (with 32 bit arguments), it may also be that the latter are reminiscences of the past.
Of course, for a perfect code as you indicated (and as I would also like it to be) there would be a lot of indications to follow ... in reality I think that the projects (like the compilers) have been carried forward by adding functionality without ever rewriting them from scratch and therefore they necessarily carry with them some legacies of the past.
Putting in place all those precautions with which the code could improve both its functionality and stability and perhaps even security is an effort and requires a lot of resources.

Edited by DelphiUdIT

Share this post


Link to post
5 hours ago, DelphiUdIT said:

I have open an issue to QP: RSS-2192

You can't seriously expect them to change the compiler codegen based on the information in that issue (or this thread for that matter).

Share this post


Link to post
29 minutes ago, Anders Melander said:

You can't seriously expect them to change the compiler codegen based on the information in that issue (or this thread for that matter).

I don't expect that they change anything and I express explicity that. But I think that signal it is a chance to go better in the future.

 

Edited by DelphiUdIT

Share this post


Link to post

It looks like whoever implemented that in the Delphi compiler was overly obsessed with codesize.

 

If you compare what both gcc and clang do you can see that they rather produce a ton of mov instructions (which execute nicely in parallel on modern processors) before using memmov or rep: https://godbolt.org/z/Gjsqn7qas (change the size of the static array to see the assembly code change)

Edited by Stefan Glienke
  • Like 2
  • Thanks 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×