Jump to content

Anders Melander

Members
  • Content Count

    2205
  • Joined

  • Last visited

  • Days Won

    113

Posts posted by Anders Melander


  1. 40 minutes ago, David Heffernan said:

    I remember making a bunch of changes based on such benchmarks and then finding absolutely no impact in the actual program, presumably because the bottleneck was memory.

    Sounds like premature optimization 🙂

    I'm doing graphics so memory bandwidth is always going to be a bottleneck. The first goal then is to use the correct algorithms and update as little as possible (thus minimizing the impact of that bottleneck) and then do everything else as fast as possible. Round and Trunc are used a lot for some operations and while replacing them with something faster might not yield much in most situations they are significant components in some performance scenarios. Also, my goal wasn't really to create a killer Round/Trunc function. I just wound up there because I needed to isolate the functionality when it didn't behave as I expected.

     


  2. 2 hours ago, pcoder said:

    BTW, have you also measured the impact of type?

    No. I'm working in Single precision so there's no type conversion going on.

    That said, I have implemented overloads for both Single and Double and the single and double instructions performs exactly the same.


  3. I just looked at my unit test of FastTrunk and I wondered why I was running the tests with different values of MXCSR set - and then I remembered why I chose to use ROUNDSS instead of CVTTSS2SI...

    The Intel documentation on CVTTSD2SI states:

    Quote

    When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register.

    So I assumed that CVTTSS2SI behaved the same way and opted against having to fiddle with MXCSR in order to guarantee truncation.

     

    Well, it turns out that it does behave the same way; The documentation wrong. How about that.


  4. 20 hours ago, Stefan Glienke said:

    - Spring.Benchmark still has some issues when running on Intels hybrid CPUs (12th and 13th gen) - I can trick a bit with setting Thread Affinity masks to run only on P-Cores but sometimes the times are a bit off

    I couldn't find a function for disabling the Efficiency-cores in your public source... so I wrote one (yes, I'm procrastinating again):

    // Set process affinity to exclude efficiency cores
    function SetPerformanceAffinityMask(Force: boolean = False): boolean;
    procedure RestoreAffinityMask;

    https://github.com/graphics32/graphics32/blob/3c239b58b063892b20063e8735de5360ef9fb5be/Source/GR32_System.pas#L102

     

    Now I just need a CPU that can actually utilize it 😕

     

    By the way, your previous post lead me to this:  https://www.uops.info/table.html
    Much easier to use than Agner Fog's tables and also appears to be more up to date. Now I'm thinking about how to get that info integrated into the Delphi debugger... and maybe throw in the data from Félix Cloutier's x86 reference. I guess that is also where godbolt gets its reference info from. Oh wait; There I go again. Better get back to work now.


  5. 2 hours ago, Stefan Glienke said:

    Unfortunately it isn't up to date. For example, your processor architecture (Raptopr Lake/Raptor Cove) isn't in there.
    And, unless you're Peter Cordes and have all this info in your head, it's often too time consuming to compare the timings of each instruction for each of the relevant architectures. And then there's execution units, pipelines, fusing and stuff I don't even understand to consider. Somebody train an AI to figure this sh*t out for me.

     

    I seem to remember that VTune had a static code analyzer with all this information built in, many, many versions ago, but I think that's gone now.

     

    2 hours ago, Stefan Glienke said:

    on x64 we might experience the behavior of implicitly converting Single to Double and back - I did not inspect the assembly code.

    Random returns a Double so there conversion from that to Single but that is the same for all the functions. There's no implicit conversion beyond that; If I'm passing a Single to a function that takes a Single argument then that value stays a Single. Passed on the stack for x86 and in XMM0 for x64.

     

    2 hours ago, Stefan Glienke said:

    (*) code alignment or address of the measured functions being one of the many reasons that can easily make some small or significant differences in the results

    I have {$CODEALIGN 16} in an include file as I need it elsewhere for SIMD aligned loads.

     

    2 hours ago, Stefan Glienke said:

    Take these results with a grain of salt

    Yes; Your x64 results are pretty wonky. ROUNDSS+CVTSS2SI should be faster than CVTSS2SD+CVTTSD2SI. Actually, ROUNDSS+CVTSS2SI has a slightly higher latency (8+6) than CVTSS2SD+CVTTSD2SI (5+6).


  6. By the way, the reason why the RTL Trunc is slower is probably because it's only been implemented for Double; There is no overload for Single so it always incurs the overhead of Single->Double conversion.

    The x64 version is implemented with a single CVTTSD2SI instruction while the x86 version uses x87.

     

    Also, since the RTL Trunc is implemented as assembler it cannot be inlined and on x86 Delphi always pass Single params on the stack even though they would fit in a general register. This levels the playing field and makes a faster alternative worthwhile.

     

    It's beyond me why they haven't implemented basic numerical functions such as Trunc, Round, Abs, etc. as compiler intrinsics so we at least can get them inlined.

    • Like 1

  7. 2 hours ago, Stefan Glienke said:

    Isn't this all that is needed?

    
    function FastTrunc(Value: Single): Integer;
    asm
      {$IFDEF CPUX86}
      movss xmm0, Value
      {$ENDIF}
      cvttss2si eax, xmm0
    end;

     

    Yes it is but for some reason CVTTSS2SI is not always faster than CVTSS2SI. I'm not sure that I can trust the benchmarks though. The results does seem to fluctuate a bit.

     

    Here are the different versions (TFloat = Single):

    function Trunc_Pas(Value: TFloat): Integer;
    begin
      Result := Trunc(Value);
    end;
    
    function FastTrunc_SSE2(Value: TFloat): Integer;
    asm
    {$if defined(CPUX86)}
            MOVSS      XMM0, Value
    {$ifend}
            CVTTSS2SI  EAX, XMM0
    end;
    
    function SlowTrunc_SSE2(Value: TFloat): Integer;
    var
      SaveMXCSR: Cardinal;
      NewMXCSR: Cardinal;
    const
      // SSE MXCSR rounding modes
      MXCSR_ROUND_MASK    = $FFFF9FFF;
      MXCSR_ROUND_NEAREST = $00000000;
      MXCSR_ROUND_DOWN    = $00002000;
      MXCSR_ROUND_UP      = $00004000;
      MXCSR_ROUND_TRUNC   = $00006000;
    asm
            XOR     ECX, ECX
    
            // Save current rounding mode
            STMXCSR SaveMXCSR
            // Load rounding mode
            MOV     EAX, SaveMXCSR
            // Do we need to change anything?
            MOV     ECX, EAX
            NOT     ECX
            AND     ECX, MXCSR_ROUND_TRUNC
            JZ      @SkipSetMXCSR // Skip expensive LDMXCSR
    @SetMXCSR:
            // Save current rounding mode in ECX and flag that we need to restore it
            MOV     ECX, EAX
            // Set rounding mode to truncation
            AND     EAX, MXCSR_ROUND_MASK
            OR      EAX, MXCSR_ROUND_TRUNC
            // Set new rounding mode
            MOV     NewMXCSR, EAX
            LDMXCSR NewMXCSR
    @SkipSetMXCSR:
    
    {$if defined(CPUX86)}
            MOVSS   XMM0, Value
    {$ifend}
            // Round/Trunc
            CVTSS2SI EAX, XMM0
    
            // Restore rounding mode
            // Did we modify it?
            TEST    ECX, ECX
            JZ      @SkipRestoreMXCSR // Skip expensive LDMXCSR
            // Restore old rounding mode
            LDMXCSR SaveMXCSR
    @SkipRestoreMXCSR:
    end;
    
    function FastTrunc_SSE41(Value: TFloat): Integer;
    const
      ROUND_MODE = $08 + $03; // $00=Round, $01=Floor, $02=Ceil, $03=Trunc
    asm
    {$if defined(CPUX86)}
            MOVSS   xmm0, Value
    {$ifend}
    
            ROUNDSS xmm0, xmm0, ROUND_MODE
            CVTSS2SI eax, xmm0
    end;

    And here are the benchmark results from my 10 year old Core i5-2500K @3.3 desktop system.

    x86 results

    image.thumb.png.d3487c69049a1069b89d9d8c44d2ecde.png

    x64 results

    image.thumb.png.4cda65bb9ede23f5ef728bda04fd7280.png

     

    Meh... but at least they are all consistently faster than Trunc - Unless I test on my laptop with a Core i7-8750H CPU @2.2

    x86 results on battery

    image.thumb.png.7bb3c0aa3f3f2c97037fc0f193cac16a.png

    x86 results on mains

    image.thumb.png.d4474a667772d1f5d565a85e1a64188d.png

    Yes, I know it's the result of my power saving profile throttling the CPU but it's interesting that it makes the x87 math so much faster than the SIMD math.

     

    Here's the benchmark code for completeness:

    procedure BM_FastTrunc(const state: TState);
    begin
      var FastTruncProc: TFastRoundProc := TFastRoundProc(state[0]);
    
      for var _ in state do
      begin
        RandSeed := 0;
    
        for var i := 1 to 1000*1000*1000 do
        begin
          FastTruncProc(Random(i) / i);
        end;
      end;
    end;
    
    
    const
      FastTruncs: array[0..3] of record
        Name: string;
        Proc: TFastRoundProc;
      end = (
        (Name: 'Trunc'; Proc: Trunc_Pas),
        (Name: 'FastTrunc_SSE2'; Proc: FastTrunc_SSE2),
        (Name: 'FastTrunc_SSE41'; Proc: FastTrunc_SSE41),
        (Name: 'SlowTrunc_SSE2'; Proc: SlowTrunc_SSE2)
      );
    begin
      for var i := 0 to High(FastTruncs) do
        Spring.Benchmark.Benchmark(BM_FastTrunc, 'FastTrunc').Arg(Int64(@FastTruncs[i].Proc)).ArgName(FastTruncs[i].Name).TimeUnit(kMillisecond);
    
      Spring.Benchmark.Benchmark_Main;
    end.

     


  8. On 3/15/2024 at 7:14 AM, Der schöne Günther said:

    In case (for whatever reason), you really just like to add the current line number into a string, then have a look at:

    https://stackoverflow.com/q/7214213

     

     

    On 3/15/2024 at 8:31 AM, Ian Branch said:

    I had a look at that previously.  I may be missreading its use but it seems that JCLDebug relies on an Exception for something like this to work:

    The accepted answer to that question doesn't involve exceptions; It redirects the assertion handler to another function which then has access to the unit name and line number.


  9. 3 minutes ago, William23668 said:

    They only write to use jcl\install.bat in github 

    Yeah but have you looked into install.bat (warning: have the suicide prevention hotline on speed dial if you do):

    :: compile installer
    echo.
    echo ===================================================================
    echo Compiling JediInstaller...
    build\dcc32ex.exe %INSTALL_VERBOSE% --runtime-package-rtl --runtime-package-vcl -q -dJCLINSTALL -E..\bin -I..\source\include -U..\source\common;..\source\windows JediInstaller.dpr
    if ERRORLEVEL 1 goto FailedCompile
    :: New Delphi versions output "This product doesn't support command line compiling" and then exit with ERRORLEVEL 0
    if not exist ..\bin\JediInstaller.exe goto FailedCompile
    
    echo.
    echo ===================================================================
    echo Launching JCL installer...
    
    ::start ..\bin\JediInstaller.exe %*
    if not exist ..\bin\JCLCmdStarter.exe goto FailStart
    ..\bin\JCLCmdStarter.exe ..\bin\JediInstaller.exe %*
    if ERRORLEVEL 1 goto FailStart
    goto FINI

     

    • Haha 1

  10. 1 hour ago, William23668 said:

    But when I try to install using jcl\install.bat I got this error:

    Include file "source\include\jedi\jedi.inc" not found.

    I'm not sure but I think it needs to be installed using some sort of installer which then generates the include file based on something, something, whatever, at this point I gave up and deleted everything.


  11. 56 minutes ago, Vandrovnik said:

    Almost all my problems with JCL and JVCL installs were because of I had somewhere on the disk (on the path) another instance of JCL/JVCL, which was (partially) used instead of the new version.

    That's funny. Almost all my problems with JCL and JVCL was caused by the fact that I installed them in the first place. Easily solvable though 🙂

    • Haha 1

  12. Hmm. It seems to be doing odd/even rounding:

    FastTrunc(0.5) = 0
    FastTrunc(1.5) = 2
    FastTrunc(2.5) = 2
    FastTrunc(3.5) = 4

    Ah, it's the fluff. I got the logic mixed up:

            // Do we need to change anything?
            TEST    EAX, MXCSR_ROUND_DOWN
            JNZ     @SetMXCSR
            TEST    EAX, MXCSR_ROUND_UP
            JZ      @SkipSetMXCSR // Skip expensive LDMXCSR
    @SetMXCSR:
            [...]

     

    Yet again, the duck provides the answer.


  13. So I have the following function which is supposed to truncate a Single using the SSE CVTTSS2SI instruction. Pretty simple except for all the MXCSR fluff.
    Yes, I know I could just use the SSE4.1 ROUNDSS instruction, which does all of the below in a single instruction, but that's not relevant to this.

     

    Anyway, the problem is that my function doesn't always agree with System.Trunc (which is implemented with the x87 instruction FISTP). I guess that is to expected in some case due to the difference in precision (80 vs 32 bits) but as far as I can tell that is not the problem I'm encountering here - and I would also only expect it to manifest as a problem in rounding and not truncation.

     

    Specifically I have the value -2343.5

    System.Trunc(-2343.5) = -2343

    FastTrunc(-2343.5)=-2344

     

    Given that truncation is supposed to round towards zero, I believe that System.Trunc is correct. But then why is CVTTTSS2SI not doing that?

     

    function FastTrunc_SSE2(Value: Single): Integer;
    var
      SaveMXCSR: Cardinal;
      NewMXCSR: Cardinal;
    const
      // SSE MXCSR rounding modes
      MXCSR_ROUND_MASK    = $FFFF9FFF;
      MXCSR_ROUND_NEAREST = $00000000;
      MXCSR_ROUND_DOWN    = $00002000;
      MXCSR_ROUND_UP      = $00004000;
      MXCSR_ROUND_TRUNC   = $00006000;
    asm
            XOR     ECX, ECX
    
            // Save current rounding mode
            STMXCSR SaveMXCSR
            // Load rounding mode
            MOV     EAX, SaveMXCSR
            // Do we need to change anything?
            TEST    EAX, MXCSR_ROUND_DOWN
            JNZ     @SetMXCSR
            TEST    EAX, MXCSR_ROUND_UP
            JZ      @SkipSetMXCSR // Skip expensive LDMXCSR
    @SetMXCSR:
            // Save current rounding mode in ECX and flag that we need to restore it
            MOV     ECX, EAX
            // Set rounding mode to truncation
            AND     EAX, MXCSR_ROUND_MASK
            OR      EAX, MXCSR_ROUND_TRUNC
            // Set new rounding mode
            MOV     NewMXCSR, EAX
            LDMXCSR NewMXCSR
    @SkipSetMXCSR:
    
    {$if defined(TARGET_x86)}
            MOVSS   XMM0, Value
    {$ifend}
            // Round/Trunc
            CVTSS2SI EAX, XMM0
    
            // Restore rounding mode
            // Did we modify it?
            TEST    ECX, ECX
            JZ      @SkipRestoreMXCSR // Skip expensive LDMXCSR
            // Restore old rounding mode
            LDMXCSR SaveMXCSR
    @SkipRestoreMXCSR:
    end;

     


  14. 8 hours ago, PeterPanettone said:

    In practice, the lack of professional layout capabilities results in many bumbling-looking applications, with controls that sometimes overlap when run on a device with display settings different from those of the original application developer. This shortcoming has given Delphi the unjustified reputation of being an unprofessional amateur developer tool.

    Nonsense. Windows developers have been able to create professionally looking applications that for decades without the aid of layout controls. The main reason for amateurish looking applications is amateurish developers.

     

    9 hours ago, PeterPanettone said:

    Or even better, Embarcadero should buy the TdxLayoutControl component from DevExpress and integrate it into the Professional version. This would give Delphi the professionalism it deserves due to its other capabilities.

    The DevExpress layout control is tightly coupled to the rest of their library but even if it had been possible to separate it from the rest then it would be a terrible idea. Embarcadero does not have the resources or expertise to maintain and evolve something as complex as TdxLayoutControl. Just look at the state of the 3rd party libraries they already have incorporated into Delphi.

    I wouldn't mind a rudimentary layout control as a part of the VCL but if they can't even get something as simple as TGridPanel to work properly then I think it's better they not even try.

    • Like 9

  15. 4 minutes ago, Pat Heuvel said:

    Am I doing something wrong?

    Probably not but I have never tested with a map file produced by C++ Builder and it looks like the format differs slightly from that of Delphi.

    The segment/module list of a Delphi map file looks like this:

    Detailed map of segments
    
     0001:00000000 0000FED4 C=CODE     S=.text    G=(none)   M=System   ACBP=A9
     0001:0000FED4 00000C9C C=CODE     S=.text    G=(none)   M=SysInit  ACBP=A9
     0001:00010B70 0000373C C=CODE     S=.text    G=(none)   M=System.Types ACBP=A9
     0001:000142AC 000007E8 C=CODE     S=.text    G=(none)   M=System.UITypes ACBP=A9
     0001:00014A94 00001E04 C=CODE     S=.text    G=(none)   M=Winapi.Windows ACBP=A9
     0001:00016898 000003A8 C=CODE     S=.text    G=(none)   M=System.SysConst ACBP=A9
    [...]

    As you can see there's no path in the module names.

     

    If you create a bug report at the map2pdb issue tracker and attach the map file (zipped) I will take a look at it.


  16. 1 minute ago, Wagner Landgraf said:

    where the heck to I find old VTune versions to install

    There used to be a link to download previous versions (which is how I managed to use it with Windows 7 at that time), but apparently they've removed that ability:

    https://community.intel.com/t5/Analyzers/where-can-I-download-an-older-version-vtune/m-p/1561574#M24281

     

    Quote

    Only Customer with access to priority support can download the older version of tools. Otherwise Intel provides the latest and greatest version for the public use.

    😞


  17. 1 hour ago, Wagner Landgraf said:

    Has anybody tried to use VTune inside a VM with M1 (ARM MAC)?

    I recall I was able to use it, but I rebuilt my VM and now I can't make it work.

    VTune only supports Intel hardware as it relies on certain CPU features that are only available on Intel CPUs. At least that what they claim:

    https://www.intel.com/content/www/us/en/developer/articles/system-requirements/vtune-profiler-system-requirements.html

     

    Maybe you can get an older version of VTune to work. For example the current version of VTune doesn't support hardware assisted profiling on my (admittedly pretty old) processor.

×