Jump to content
Anders Melander

x87 vs SSE single truncation

Recommended Posts

Posted (edited)

So I have the following function which is supposed to truncate a Single using the SSE CVTTSS2SI instruction. Pretty simple except for all the MXCSR fluff.
Yes, I know I could just use the SSE4.1 ROUNDSS instruction, which does all of the below in a single instruction, but that's not relevant to this.

 

Anyway, the problem is that my function doesn't always agree with System.Trunc (which is implemented with the x87 instruction FISTP). I guess that is to expected in some case due to the difference in precision (80 vs 32 bits) but as far as I can tell that is not the problem I'm encountering here - and I would also only expect it to manifest as a problem in rounding and not truncation.

 

Specifically I have the value -2343.5

System.Trunc(-2343.5) = -2343

FastTrunc(-2343.5)=-2344

 

Given that truncation is supposed to round towards zero, I believe that System.Trunc is correct. But then why is CVTTTSS2SI not doing that?

 

function FastTrunc_SSE2(Value: Single): Integer;
var
  SaveMXCSR: Cardinal;
  NewMXCSR: Cardinal;
const
  // SSE MXCSR rounding modes
  MXCSR_ROUND_MASK    = $FFFF9FFF;
  MXCSR_ROUND_NEAREST = $00000000;
  MXCSR_ROUND_DOWN    = $00002000;
  MXCSR_ROUND_UP      = $00004000;
  MXCSR_ROUND_TRUNC   = $00006000;
asm
        XOR     ECX, ECX

        // Save current rounding mode
        STMXCSR SaveMXCSR
        // Load rounding mode
        MOV     EAX, SaveMXCSR
        // Do we need to change anything?
        TEST    EAX, MXCSR_ROUND_DOWN
        JNZ     @SetMXCSR
        TEST    EAX, MXCSR_ROUND_UP
        JZ      @SkipSetMXCSR // Skip expensive LDMXCSR
@SetMXCSR:
        // Save current rounding mode in ECX and flag that we need to restore it
        MOV     ECX, EAX
        // Set rounding mode to truncation
        AND     EAX, MXCSR_ROUND_MASK
        OR      EAX, MXCSR_ROUND_TRUNC
        // Set new rounding mode
        MOV     NewMXCSR, EAX
        LDMXCSR NewMXCSR
@SkipSetMXCSR:

{$if defined(TARGET_x86)}
        MOVSS   XMM0, Value
{$ifend}
        // Round/Trunc
        CVTSS2SI EAX, XMM0

        // Restore rounding mode
        // Did we modify it?
        TEST    ECX, ECX
        JZ      @SkipRestoreMXCSR // Skip expensive LDMXCSR
        // Restore old rounding mode
        LDMXCSR SaveMXCSR
@SkipRestoreMXCSR:
end;

 

Edited by Anders Melander

Share this post


Link to post

Hmm. It seems to be doing odd/even rounding:

FastTrunc(0.5) = 0
FastTrunc(1.5) = 2
FastTrunc(2.5) = 2
FastTrunc(3.5) = 4

Ah, it's the fluff. I got the logic mixed up:

        // Do we need to change anything?
        TEST    EAX, MXCSR_ROUND_DOWN
        JNZ     @SetMXCSR
        TEST    EAX, MXCSR_ROUND_UP
        JZ      @SkipSetMXCSR // Skip expensive LDMXCSR
@SetMXCSR:
        [...]

 

Yet again, the duck provides the answer.

Share this post


Link to post
Posted (edited)

nothing ... already answered ...

 

Edited by DelphiUdIT

Share this post


Link to post

CVTSS2SI vs. CVTTSS2SI:

CVTTSS2SI (truncation (rounding toward 0)) does not use/need MXCSR.

Share this post


Link to post

Isn't this all that is needed?

function FastTrunc(Value: Single): Integer;
asm
  {$IFDEF CPUX86}
  movss xmm0, Value
  {$ENDIF}
  cvttss2si eax, xmm0
end;

 

Share this post


Link to post

@pcoder, @Stefan Glienke

I think he wants to try a general function, also for others rounding modes (he inserts other masks to the function).
But you are right, there are also other native CPU instructions that do that specif works (like that you exposed).

 

Share this post


Link to post
19 minutes ago, DelphiUdIT said:

@pcoder, @Stefan Glienke

I think he wants to try a general function, also for others rounding modes (he inserts other masks to the function).
But you are right, there are also other native CPU instructions that do that specif works (like that you exposed).

 

Would be weird to want to truncate a single (a well defined task) and insist on doing so in an inefficient way.

Share this post


Link to post
Posted (edited)
2 hours ago, Stefan Glienke said:

Isn't this all that is needed?


function FastTrunc(Value: Single): Integer;
asm
  {$IFDEF CPUX86}
  movss xmm0, Value
  {$ENDIF}
  cvttss2si eax, xmm0
end;

 

Yes it is but for some reason CVTTSS2SI is not always faster than CVTSS2SI. I'm not sure that I can trust the benchmarks though. The results does seem to fluctuate a bit.

 

Here are the different versions (TFloat = Single):

function Trunc_Pas(Value: TFloat): Integer;
begin
  Result := Trunc(Value);
end;

function FastTrunc_SSE2(Value: TFloat): Integer;
asm
{$if defined(CPUX86)}
        MOVSS      XMM0, Value
{$ifend}
        CVTTSS2SI  EAX, XMM0
end;

function SlowTrunc_SSE2(Value: TFloat): Integer;
var
  SaveMXCSR: Cardinal;
  NewMXCSR: Cardinal;
const
  // SSE MXCSR rounding modes
  MXCSR_ROUND_MASK    = $FFFF9FFF;
  MXCSR_ROUND_NEAREST = $00000000;
  MXCSR_ROUND_DOWN    = $00002000;
  MXCSR_ROUND_UP      = $00004000;
  MXCSR_ROUND_TRUNC   = $00006000;
asm
        XOR     ECX, ECX

        // Save current rounding mode
        STMXCSR SaveMXCSR
        // Load rounding mode
        MOV     EAX, SaveMXCSR
        // Do we need to change anything?
        MOV     ECX, EAX
        NOT     ECX
        AND     ECX, MXCSR_ROUND_TRUNC
        JZ      @SkipSetMXCSR // Skip expensive LDMXCSR
@SetMXCSR:
        // Save current rounding mode in ECX and flag that we need to restore it
        MOV     ECX, EAX
        // Set rounding mode to truncation
        AND     EAX, MXCSR_ROUND_MASK
        OR      EAX, MXCSR_ROUND_TRUNC
        // Set new rounding mode
        MOV     NewMXCSR, EAX
        LDMXCSR NewMXCSR
@SkipSetMXCSR:

{$if defined(CPUX86)}
        MOVSS   XMM0, Value
{$ifend}
        // Round/Trunc
        CVTSS2SI EAX, XMM0

        // Restore rounding mode
        // Did we modify it?
        TEST    ECX, ECX
        JZ      @SkipRestoreMXCSR // Skip expensive LDMXCSR
        // Restore old rounding mode
        LDMXCSR SaveMXCSR
@SkipRestoreMXCSR:
end;

function FastTrunc_SSE41(Value: TFloat): Integer;
const
  ROUND_MODE = $08 + $03; // $00=Round, $01=Floor, $02=Ceil, $03=Trunc
asm
{$if defined(CPUX86)}
        MOVSS   xmm0, Value
{$ifend}

        ROUNDSS xmm0, xmm0, ROUND_MODE
        CVTSS2SI eax, xmm0
end;

And here are the benchmark results from my 10 year old Core i5-2500K @3.3 desktop system.

x86 results

image.thumb.png.d3487c69049a1069b89d9d8c44d2ecde.png

x64 results

image.thumb.png.4cda65bb9ede23f5ef728bda04fd7280.png

 

Meh... but at least they are all consistently faster than Trunc - Unless I test on my laptop with a Core i7-8750H CPU @2.2

x86 results on battery

image.thumb.png.7bb3c0aa3f3f2c97037fc0f193cac16a.png

x86 results on mains

image.thumb.png.d4474a667772d1f5d565a85e1a64188d.png

Yes, I know it's the result of my power saving profile throttling the CPU but it's interesting that it makes the x87 math so much faster than the SIMD math.

 

Here's the benchmark code for completeness:

procedure BM_FastTrunc(const state: TState);
begin
  var FastTruncProc: TFastRoundProc := TFastRoundProc(state[0]);

  for var _ in state do
  begin
    RandSeed := 0;

    for var i := 1 to 1000*1000*1000 do
    begin
      FastTruncProc(Random(i) / i);
    end;
  end;
end;


const
  FastTruncs: array[0..3] of record
    Name: string;
    Proc: TFastRoundProc;
  end = (
    (Name: 'Trunc'; Proc: Trunc_Pas),
    (Name: 'FastTrunc_SSE2'; Proc: FastTrunc_SSE2),
    (Name: 'FastTrunc_SSE41'; Proc: FastTrunc_SSE41),
    (Name: 'SlowTrunc_SSE2'; Proc: SlowTrunc_SSE2)
  );
begin
  for var i := 0 to High(FastTruncs) do
    Spring.Benchmark.Benchmark(BM_FastTrunc, 'FastTrunc').Arg(Int64(@FastTruncs[i].Proc)).ArgName(FastTruncs[i].Name).TimeUnit(kMillisecond);

  Spring.Benchmark.Benchmark_Main;
end.

 

Edited by Anders Melander

Share this post


Link to post

By the way, the reason why the RTL Trunc is slower is probably because it's only been implemented for Double; There is no overload for Single so it always incurs the overhead of Single->Double conversion.

The x64 version is implemented with a single CVTTSD2SI instruction while the x86 version uses x87.

 

Also, since the RTL Trunc is implemented as assembler it cannot be inlined and on x86 Delphi always pass Single params on the stack even though they would fit in a general register. This levels the playing field and makes a faster alternative worthwhile.

 

It's beyond me why they haven't implemented basic numerical functions such as Trunc, Round, Abs, etc. as compiler intrinsics so we at least can get them inlined.

  • Like 1

Share this post


Link to post
On 3/3/2024 at 7:05 PM, Anders Melander said:

 

Given that truncation is supposed to round towards zero, I believe that System.Trunc is correct. But then why is CVTTTSS2SI not doing that?

Say with periodic stuff like TDateTime Trunc works well if it understood that right now is yesterday date + 0.xxxx fraction so Trunc not supposed be rounding up to zero when negative. 

 

Conversely

 

On the Unit circle a truncated 1/4 turn needs one turn  added to get a "reading"   of 1    Now to turn back that turn added, we remove that turn when looking at "negative values"   (-1/4 - 1)   = Trunc(-1.25) = -1.  

 

 

Share this post


Link to post
Posted (edited)

Differences in microbenchmarks can have all kinds of reasons (*) - when talking about the performance of single instructions you never estimate them from some possibly flawed microbenchmark but consult the instruction timings table (search for CVT(T)SS2SI) - the fact that truncate and non-truncate are always listed together makes it obvious that they perform exactly the same.

 

(*) code alignment or address of the measured functions being one of the many reasons that can easily make some small or significant differences in the results

 

All these tiny gotchas are the reason why many people don't like microbenchmarks. They are one tool for measuring but don't tell the ultimate truth - especially when it comes down to only a few instructions.

 

That being said here are the results from an i5-13600K:

 

x86
-----------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations
-----------------------------------------------------------------------------
FastTrunc/Trunc:10910016                 5585 ms         3703 ms            1
FastTrunc/FastTrunc_SSE2:10910032        2081 ms         1234 ms            1
FastTrunc/FastTrunc_SSE41:10910120       2158 ms         1047 ms            1
FastTrunc/SlowTrunc_SSE2:10910048        4193 ms         2641 ms            1


x64

-----------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations
-----------------------------------------------------------------------------
FastTrunc/Trunc:12750304                 5793 ms         3750 ms            1
FastTrunc/FastTrunc_SSE2:12750336        4775 ms         3656 ms            1
FastTrunc/FastTrunc_SSE41:12750432       6364 ms         4703 ms            1
FastTrunc/SlowTrunc_SSE2:12750352        4808 ms         2703 ms            1

Take these results with a grain of salt and keep in mind two things:

- Spring.Benchmark still has some issues when running on Intels hybrid CPUs (12th and 13th gen) - I can trick a bit with setting Thread Affinity masks to run only on P-Cores but sometimes the times are a bit off

- on x64 we might experience the behavior of implicitly converting Single to Double and back - I did not inspect the assembly code.

Edited by Stefan Glienke

Share this post


Link to post
Posted (edited)
2 hours ago, Stefan Glienke said:

Unfortunately it isn't up to date. For example, your processor architecture (Raptopr Lake/Raptor Cove) isn't in there.
And, unless you're Peter Cordes and have all this info in your head, it's often too time consuming to compare the timings of each instruction for each of the relevant architectures. And then there's execution units, pipelines, fusing and stuff I don't even understand to consider. Somebody train an AI to figure this sh*t out for me.

 

I seem to remember that VTune had a static code analyzer with all this information built in, many, many versions ago, but I think that's gone now.

 

2 hours ago, Stefan Glienke said:

on x64 we might experience the behavior of implicitly converting Single to Double and back - I did not inspect the assembly code.

Random returns a Double so there conversion from that to Single but that is the same for all the functions. There's no implicit conversion beyond that; If I'm passing a Single to a function that takes a Single argument then that value stays a Single. Passed on the stack for x86 and in XMM0 for x64.

 

2 hours ago, Stefan Glienke said:

(*) code alignment or address of the measured functions being one of the many reasons that can easily make some small or significant differences in the results

I have {$CODEALIGN 16} in an include file as I need it elsewhere for SIMD aligned loads.

 

2 hours ago, Stefan Glienke said:

Take these results with a grain of salt

Yes; Your x64 results are pretty wonky. ROUNDSS+CVTSS2SI should be faster than CVTSS2SD+CVTTSD2SI. Actually, ROUNDSS+CVTSS2SI has a slightly higher latency (8+6) than CVTSS2SD+CVTTSD2SI (5+6).

Edited by Anders Melander

Share this post


Link to post
10 hours ago, Anders Melander said:

It's beyond me why they haven't implemented basic numerical functions such as Trunc, Round, Abs, etc. as compiler intrinsics so we at least can get them inlined.

I mean, they've shown no interest in performance whatsoever, and even less interest in floating point code. It's as much as they can manage to vaguely support all the different compilers they have and keep them functioning. 

Share this post


Link to post
15 hours ago, Anders Melander said:

Unfortunately it isn't up to date. For example, your processor architecture (Raptopr Lake/Raptor Cove) isn't in there.

I did not read all of it but here is a bit of information on that subject.

Share this post


Link to post
20 hours ago, Stefan Glienke said:

- Spring.Benchmark still has some issues when running on Intels hybrid CPUs (12th and 13th gen) - I can trick a bit with setting Thread Affinity masks to run only on P-Cores but sometimes the times are a bit off

I couldn't find a function for disabling the Efficiency-cores in your public source... so I wrote one (yes, I'm procrastinating again):

// Set process affinity to exclude efficiency cores
function SetPerformanceAffinityMask(Force: boolean = False): boolean;
procedure RestoreAffinityMask;

https://github.com/graphics32/graphics32/blob/3c239b58b063892b20063e8735de5360ef9fb5be/Source/GR32_System.pas#L102

 

Now I just need a CPU that can actually utilize it 😕

 

By the way, your previous post lead me to this:  https://www.uops.info/table.html
Much easier to use than Agner Fog's tables and also appears to be more up to date. Now I'm thinking about how to get that info integrated into the Delphi debugger... and maybe throw in the data from Félix Cloutier's x86 reference. I guess that is also where godbolt gets its reference info from. Oh wait; There I go again. Better get back to work now.

Share this post


Link to post

I just looked at my unit test of FastTrunk and I wondered why I was running the tests with different values of MXCSR set - and then I remembered why I chose to use ROUNDSS instead of CVTTSS2SI...

The Intel documentation on CVTTSD2SI states:

Quote

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register.

So I assumed that CVTTSS2SI behaved the same way and opted against having to fiddle with MXCSR in order to guarantee truncation.

 

Well, it turns out that it does behave the same way; The documentation wrong. How about that.

Share this post


Link to post

BTW, have you also measured the impact of type?

single (CVTTSS2SI) vs double (CVTTSD2SI)
No need for numbers, just wondering about the average difference :)

Share this post


Link to post
2 hours ago, pcoder said:

BTW, have you also measured the impact of type?

No. I'm working in Single precision so there's no type conversion going on.

That said, I have implemented overloads for both Single and Double and the single and double instructions performs exactly the same.

Share this post


Link to post

These microbenchmarks for floating point typically are of quite limited use. I remember making a bunch of changes based on such benchmarks and then finding absolutely no impact in the actual program, presumably because the bottleneck was memory.

Share this post


Link to post
40 minutes ago, David Heffernan said:

I remember making a bunch of changes based on such benchmarks and then finding absolutely no impact in the actual program, presumably because the bottleneck was memory.

Sounds like premature optimization 🙂

I'm doing graphics so memory bandwidth is always going to be a bottleneck. The first goal then is to use the correct algorithms and update as little as possible (thus minimizing the impact of that bottleneck) and then do everything else as fast as possible. Round and Trunc are used a lot for some operations and while replacing them with something faster might not yield much in most situations they are significant components in some performance scenarios. Also, my goal wasn't really to create a killer Round/Trunc function. I just wound up there because I needed to isolate the functionality when it didn't behave as I expected.

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×