Jump to content
lg17

Execution time difference between 32b and 64b, using static arrays

Recommended Posts

Hi all.
I observe a noticable difference of execution time for the same code, compiled in 32 bits and 64 bits.
The following code is an extract of a more complex application.
If you run it, it will always take much more time in 64 bits than in 32 bits. The code can be tested using arrays of 1, 2 or 3 floats (option 1,2 and 3).

If I look at the assembly code in debug mode, it is different in 32b and 64b.

Any idea how to make the 64 bits version run as fast as the 32 bits one? (Compilation options or code adjustments?)
Thanks.

 

Note: I'm using Delphi 12.1

 

Here is the code:

	Type
	  {$IF Defined(WIN64)}
	  t_reel = double;
	  {$ELSE}
	  t_reel = single;
	  {$ENDIF}

	  t_vect1 = Array[0..0] OF t_reel;
	  t_vect2 = Array[0..1] OF t_reel;
	  t_vect3 = Array[0..2] OF t_reel;

	const
	  k_vect1_nul  : t_vect1 = (0.0);
	  k_vect2_nul  : t_vect2 = (0.0, 0.0);
	  k_vect3_nul  : t_vect3 = (0.0, 0.0, 0.0);

	procedure test();
	var iLoop:integer;
		l_SW1:TStopwatch;
		l_vec1: t_vect1;
		l_vec2: t_vect2;
		l_vec3: t_vect3;
	begin
	  l_SW1:=TSTopWatch.StartNew;

	  iLoop:=0;
	  while (iLoop<900000000) do begin
		//l_vec1 := k_vect1_nul;  //option 1
		//l_vec2 := k_vect2_nul;  //option 2
		l_vec3 := k_vect3_nul;    //option 3
		inc(iLoop);
	  end;

	  l_SW1.Stop;
	  Showmessage(intToStr(l_SW1.ElapsedTicks)+' ticks / '+intToStr(l_SW1.ElapsedMilliseconds)+' ms');
	end;

 

Edited by lg17
Add compiler/IDE version note

Share this post


Link to post

That's interesting. I confirm and tried some variations on the single vs double types in both; my time in milliseconds is shown in the comments:

  {$IF Defined(WIN64)}
  t_reel = double;      // 2880-2930 ms
  //t_reel = single;    // 2410-2440 ms
  {$ELSE}
  //t_reel = double;    // 5030-5150 ms
  t_reel = single;      // 1170-1180 ms
  {$ENDIF}

 

Share this post


Link to post

Replace:

 

  	//l_vec3 := k_vect3_nul;    //option 3
    move(k_vect3_nul, l_vec3, sizeof(t_vect3)); //< Edit with sizof(type)

 

Edited by DelphiUdIT
  • Like 1

Share this post


Link to post

The difference depends from assembler decoding: for example if use the "option2" and all single (x32 and x64), the x64 is more fast ....

May be can be optimize better, but on compiler time.

 

If you want you can write an assembler function that do something better.

 

EDIT: seems that an indirection write use [si] and [di] cpu registers is really more slow that other tecnique, that is know but I really don't think that is so slow...

Edited by DelphiUdIT

Share this post


Link to post

The compiler is generating very bad instruction, even calling RTL Move would be faster then using MOVSQ !!

https://uops.info/html-instr/MOVSQ.html

 

4 cycles in theory but in reality it is 7-8 cycles on my CPU !

 

Suggestion find these and manually move/copy the fields, specially in tight loops.

Share this post


Link to post

Thanks to all for your answers.

In my mind, 32 bits application should perform better with 32b float and 64 bits application with 64b float. This is why I have this compilation condition at the beginning.

I'm going to try using the move instruction where it is possible and see if it improves the speed.

The problem is that this slowness issue happens also when I assign a vectX to another vectX... like

l_vec3A := l_vec3B;

so everywhere in the code.

I don't know if this is something that could be solved/improved by embarcadero.

Thanks.

Share this post


Link to post
21 minutes ago, lg17 said:

I don't know if this is something that could be solved/improved by embarcadero.

Like me and @Kas Ob. said something wrong with the assembler coding. In theory the compiler do the right way (my personal opinion), but the timing is not right ... should do more study about that. Now I look to my code also, I facing some problems that may be originate from that.

I don't think tha Embarcadero do something in real time ... surely they will improve in the future.

 

You can open a bug report on new quality portal.

Edited by DelphiUdIT

Share this post


Link to post

It should also be kept in mind that the number of cycles (almost a billion) has a significant impact on the times: even in modern processors, using a "machine" instruction that takes 1 cycle more than another corresponds more or less to having an equivalent time of about 400 ms more (taking into account a base frequency of 2.5 GHz of the single Core and only the theoretical calculated time).

Share this post


Link to post

This should be another way to do (like @Kas Ob. suggested):

 

//l_vec3 := k_vect3_nul;  //option 3
for var i := Low(l_vec3) to High(l_vec3) do
  l_vec3[i] := k_vect3_nul[i];    //option 3

It's more fast then the orginal in x64 and near the same more slow in x32

Edited by DelphiUdIT

Share this post


Link to post

Thanks @DelphiUdIT

The most efficient solutions we found was

l_vec3[0] := k_vect3_nul[0];
l_vec3[1] := k_vect3_nul[1];
l_vec3[2] := k_vect3_nul[2];

 

Share this post


Link to post

x64: It's incredible, but writing 3 times in memory is more fast that using MOVSQ ... really weird hardware implementation, may be that this works better with long data, like string.

The x32 implements in facts 3 single writing (if datatype is "single") and with "double datatype" use the similar weird approach like x64 with more high timing because it's 2xMOVSD in use.

 

Edited by DelphiUdIT

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×