Jump to content
Joseph MItzen

Strange Benchmark Results; Am I Missing Something?

Recommended Posts

About a month ago I was doing some benchmarks of FreePascal under Linux and got an unexpected result. Thanks to some help here I was able to get Delphi Community Edition installed yesterday and found the same anomaly with Delphi targeting Windows. I was wondering if I'm missing something in the below benchmark or something has changed with the new compiler and there's something I'm doing I'm not aware of that significantly impacts performance.

 

The benchmark consists of summing the total of all the numbers between 1 and 1 billion which are evenly divisible by 3 or 5 and printing the total. It seems to me that it's a short, simple and straightforward test that should have one simple, obvious way to implement it in Pascal:

program numtest8;

Var total, num : Int64;
begin
  total := 0;
  for num := 1 to 1000000000 Do
      if (num mod 3 = 0) or (num mod 5 = 0) Then
            total := total + num;
WriteLn(total);

end.

Compiled with maximum performance under FreePascal (I tried other O settings; they didn't make any difference) these are the results I get on a Ryzen 2700 8 core CPU with stock cooler, best of 3:

 

64bit, Linux: 11.36 seconds, 99% CPU

64bit, Windows 11 VM, 11.43 seconds (PowerShell doesn't report CPU usage with the timing information)

 

This is the figure for Delphi:

 

64bit, Windows 11 VM: 11.38 seconds

 

Those figures seem rather consistent across two compilers, two operating systems, a VM and bare metal, so nothing weird. But here's the weird part (cue spooky music):

 

I was also performing the benchmark across other languages, single core vs. parallel, etc. One language tested was Python. I was able to create a single-threaded Python version that beat a multicore version, but that's a weird result for another forum. It's this that has me stumped. What follows is a test with PyPy, a JIT-enabled version of the Python runtime. Here's the source code:

 

total = 0

for x in range(1, 1_000_000_001):
    if x % 3 == 0 or x % 5 == 0:
        total = total + x
        
print(total)        

Note: if I were performing the benchmark this way in regular Python, it'd still be running; loops kill Python performance. The best way of writing this code in regular Python gives a result of 1 minute, 15 seconds (!!!). But PyPy was much better... too much better:

 

PyPy3.9 v7.3.8, 64bit, Linux: 3.34 seconds, 99% CPU

PyPy3.9 V7.3.8 64bit, Windows VM: 3.41 seconds

 

PyPy performed over 3X faster than Delphi / FreePascal, and that's including the time it takes to compile on the fly! The CPU data from the Linux run proves what I already knew about PyPy, that it doesn't do any automatic parallelization that might explain this result. Is there some fancy new CPU instruction it could be using that the Pascal compilers aren't? I seem to recall that at least the Delphi 32bit compiler didn't perform bit-shifting to do division; could that still be the case and the divisions are killing the performance?

 

I decided to test another JIT option for Python, Numba. Numba is intended to JIT mathematical operations only. It's not a Python runtime, but a regular library that takes Python functions and compiles them. It also has a cache function and a pre-compile option. I was lazy and since I didn't know how the pre-compile function worked I just used the cache option and ran it once to cache the compiled function then performed benchmarks. The code:

 

from numba import jit


@jit(nopython=True, cache=True, fastmath=True)
def test():
    total = 0
    for num in range(1, 1_000_000_001):
        if num % 3 == 0 or num % 5 == 0:
            total = total + num
    return total

answer = test()
print(answer)

Python3.10.2 + Numba, 64bit, Linux: 2.07 seconds, 158% CPU

Python3.10.2 + Numba, 64bit, Windows VM: 2.49 seconds

 

I can't really explain the 158% CPU figure, as I know that numba can do some parallelization but a debug run suggested it couldn't find anything to parallelize. Regardless, it knocked almost a second off the PyPy time, probably due to caching the compiled function.

 

I could tell Numba to explicitly parallelize:

from numba import jit, prange


@jit(nopython=True, cache=True, parallel=True, fastmath=True)
def test():
    total = 0
    for num in prange(1, 1_000_000_001):
        if num % 3 == 0 or num % 5 == 0:
            total = total + num
    return total

answer = test()
print(answer)

And now we get:

Python3.10.2 + Numba, 64bit, Linux: 0.84 seconds, 743% CPU

Python3.10.2 + Numba, 64bit, Windows VM:  1.20 seconds

 

Clearly here the CPU results under Linux are consistent with an 8-core CPU.

 

So the mystery is: With a statically typed, compiled language, we get circa 11.4 seconds across OSes and compilers. With a dynamically typed language with its own runtime VM, and one notoriously slow at that, we get 3.4 seconds at worst including JIT compile time, , 2.0 - 2.5 seconds pre-compiled with some possible automatic optimization, and 0.8 - 1.2 seconds with explicitly requesting the loop be parallelized. Those results are... unexpected. Now I honestly thought that JITted Python could match or ever-so-slightly beat FreePascal, because I've seen that happen before several years ago. But at least 3X faster to 10X faster? That was a shock.

 

I'm half hoping someone will tell me I'm an idiot and I should never write Pascal code like that and if I just do X it'll speed up 300%. If not, my best guess after days of thought is that vague memory about bit-shifting. I remember when I read that Delphi didn't do that because Tiny C, a C compiler so tiny it fits on a 3.5" floppy disk - with operating system! - only has three optimizations, but one of them is bit-shifting. That would seem to imply it's rather useful.

 

I'd be interested if anyone has any insight into why Delphi and FreePascal performed so poorly in this benchmark. I'd really like to rule out any mistakes on my end. If we can figure out what's killing the Delphi performance - and it's not me - these results should make for a very, very interesting QC feature request. :classic_biggrin:

 

Edited by Joseph MItzen
  • Like 1

Share this post


Link to post

Have you tried integer instead of Int64? Or are we talking the 64 bit compiler?

 

(1 billion is < 2^31 of I'm not totally mistaken.)

Share this post


Link to post
1 hour ago, dummzeuch said:

Have you tried integer instead of Int64?

That will work for num, but the total may actually exceed that. After all it is the sum of all 3d or 5th numbers.

 

20 minutes ago, David Heffernan said:

A really good compiler would remove the loop and just turn it into a simple assignment

Can you elaborate on how such a simple assignment would look like?

  • Like 2

Share this post


Link to post
3 hours ago, Joseph MItzen said:

I'd be interested if anyone has any insight into why Delphi and FreePascal performed so poorly in this benchmark. I'd really like to rule out any mistakes on my end. If we can figure out what's killing the Delphi performance - and it's not me - these results should make for a very, very interesting QC feature request. :classic_biggrin:

 

I don't know about FreePascal, but there are several things that have impact in Delphi compiler.

 

First, being single pass compiler, possible compiler optimizations are more limited than in multiple pass compilers. Next, 64bit compiler in general has less optimizations than original Win32 compiler. I am not going into the why's because I don't have that information and everything would be pure speculation on my side. Also Delphi 64bit compiler is not optimally using all available CPU registers, so there will be more unnecessary memory operations.

 

I know that having a compiler that can do more optimizations is beneficial, because you can focus on the code you write and not on how fast it will run, but whether we like it or not, when speed matters optimizing the algorithm has always been first step.

 

For instance, this would be faster way to achieve same results:

 

var
  total, num: Int64;
begin
  total := 0;
  num := 3;
  while num <= 1000000000 do
    begin
      total := total + num;
      inc(num, 3);
    end;

  num := 5;
  while num <= 1000000000 do
    begin
      total := total + num;
      inc(num, 5);
    end;
end;

I know that this is not exactly the answer you are looking for.

Share this post


Link to post

Interesting to know other compilers can parallelize a simple loop. I thought you need to split and parallelize manually, like Delphi.

Can you imagine Delphi had a flag:

parallel=True

and does everything for you. I guess Numba developers saw TTask, TThread, OmniThreadLibrary... and just decided to implement a simple flag. Nice!

  • Like 1

Share this post


Link to post
1 hour ago, Dalija Prasnikar said:

For instance, this would be faster way to achieve same results:

 


var
  total, num: Int64;
begin
  total := 0;
  num := 3;
  while num <= 1000000000 do
    begin
      total := total + num;
      inc(num, 3);
    end;

  num := 5;
  while num <= 1000000000 do
    begin
      total := total + num;
      inc(num, 5);
    end;
end;

I know that this is not exactly the answer you are looking for.

 

I think the result will not be the same - original code adds number 15 just once to the total, while your code will add it twice...

 

Edited by Vandrovnik
  • Like 3

Share this post


Link to post
48 minutes ago, Vandrovnik said:

 

I think the result will not be the same - original code adds number 15 just once to the total, while your code will add it twice...

 

LOL I should have known better than doing math first thing on Sunday morning.

  • Like 1
  • Haha 6

Share this post


Link to post

Speaking about algorithms:

const
  cnt = 1000000000;
var
  total: Int64;
  tot3: Int64;
  tot5: Int64;
  tot35: Int64;
  cnt3: Int64;
  cnt5: Int64;
  cnt35: Int64;
begin
  cnt3 := cnt div 3;
  cnt5 := cnt div 5;
  cnt35 := cnt5 div 3;
  tot3 := 3*(cnt3*(cnt3 + 1) div 2);
  tot5 := 5*(cnt5*(cnt5 + 1) div 2);
  tot35 := 15*(cnt35*(cnt35 + 1) div 2);
  total := tot3 + tot5 - tot35;
end;

Execution time is less than 1 Millisecond.

  • Like 6

Share this post


Link to post
5 hours ago, Uwe Raabe said:

Can you elaborate on how such a simple assignment would look like?

The value of the summation  an be evaluated statically. I don't think any compilers do this, but I could be wrong. 

Share this post


Link to post

a good compiler would do something similar:

 

function mod35s: int64;
label
  l1, l2;
asm
  push    rdi
  push    rbx
  mov     edi, 1000000000
  xor     rax, rax
  mov     rbx, 0
  mov     ecx, 0
@l1:
  imul    edx, edi, -858993459
  mov     ecx, edi
  cmp     edx, 858993459
  jbe     @L2
  imul    edx, edi, -1431655765
  cmp     edx, 1431655765
  cmova   ecx, ebx
@L2:
  add     rax, rcx
  dec     edi
  jnz     @l1
  pop     rbx
  pop     rdi
  ret
end;

 

Edited by Attila Kovacs

Share this post


Link to post
19 minutes ago, Attila Kovacs said:

Uwe, are you counting the occurrences? It's something different than OP's.

No, I am calculating the sums.

 

These are the numbers for Win32:

Loop: 233333334166666668 elapsed:   3554.07020 ms
Calc: 233333334166666668 elapsed:      0.00030 ms

Counting all multiples of 3 between 1 and N is N div 3. Let's name that N3. So one can write the Sum of those as

total3 := 0;
for I := 1 to N3 do 
  total3 := total3 + 3*I;

We can now get the 3* out of the sum and get

total3 := 0;
for I := 1 to N3 do 
  total3 := total3 + I;
total3 := 3*total3;

Using the well known sum(1..n) formula n*(n+1)/2, we end up with what I wrote above for tot3.

 

Do the same with 5 to get the second term of the result tot5.

 

Now we have to take care of the numbers that are multiples of 3 and 5, as up to now they are added twice. That's where the 15 comes into play with tot35.

  • Like 1

Share this post


Link to post
34 minutes ago, Uwe Raabe said:

No, I am calculating the sums.

 

Ah I see.

Also, I missed one 0 in the loop, so the asm above takes ~1 sec on my pc not 100msecs.

 

Share this post


Link to post
7 hours ago, Mike Torrettinni said:

Interesting to know other compilers can parallelize a simple loop. I thought you need to split and parallelize manually, like Delphi.

Can you imagine Delphi had a flag:


parallel=True

and does everything for you. I guess Numba developers saw TTask, TThread, OmniThreadLibrary... and just decided to implement a simple flag. Nice!

 

I have tried this (will probably not work fine for MaxValue not divisible by Workers), it is 3.6 times faster than single thread on an old i7 CPU. I guess it should be possible to use IFuture<Int64>, but I do not know how to pass parameters to it...

It is easy to use tParallel.For, but there I used tInterlocked.Add(total, num), which resulted in much worse time than single thread.


 

const MaxValue=1000000000;

type tCalcThread=class(tThread)
      public
       CalcFrom: integer;
       CalcTo: integer;
       Value: Int64;
       procedure Execute; override;
     end;

procedure tCalcThread.Execute;
 var PartialTotal: Int64;
     num: integer;
 begin
 PartialTotal:=0;
 for num:=CalcFrom to CalcTo do
  if (num mod 3 = 0) or (num mod 5 = 0) Then inc(PartialTotal, num);
 Value:=PartialTotal;
end;

function PLoop32: Int64;
 const Workers = 8;
 Var total : Int64;
     Calcs: array[0..Workers-1] of tCalcThread;
     a: integer;
 begin
 total := 0;
 fillchar(Calcs, sizeof(Calcs), 0);
 try
  for a:=0 to Workers-1 do begin
   Calcs[a]:=tCalcThread.Create(true);
   Calcs[a].FreeOnTerminate:=false;
   Calcs[a].CalcFrom:=Int64(MaxValue)*a div Workers;
   Calcs[a].CalcTo:=Int64(MaxValue)*(a+1) div Workers-1;
   Calcs[a].Start;
  end;
  for a:=0 to Workers-1 do begin
   Calcs[a].WaitFor;
   inc(total, Calcs[a].Value);
  end;
 finally
  for a:=0 to Workers-1 do FreeAndNil(Calcs[a]);
 end;
 result := total;
end;

 

Edited by Vandrovnik
  • Like 1

Share this post


Link to post

Interlocked is always going to be terrible here. Local accumulation, join, and then a final summation would be better. 

 

A magic parallel=true flag is fanciful. 

 

Why look at Python. Surely look at the various C++ compilers? 

Share this post


Link to post
8 minutes ago, David Heffernan said:

Interlocked is always going to be terrible here. Local accumulation, join, and then a final summation would be better. 

 

 

I know (I have used it in my thread example), but is it possible to do also with tParallel.For?

  • Like 1

Share this post


Link to post
3 hours ago, Vandrovnik said:

I have tried this

Nice.

Added the x64 asm, enhanced the chunk size calculation, reduced from 8 logical to 4 physical cores (this improves a bit in 64 bit mode / under a given runtime? / on my cpu?).

                                                                                                                                                                 edit: in the IDE 4 is faster, outside the ide 8 is much faster

 

It's now almost twice as fast as my single threaded asm was.

So there is a lot of overhead firing up the threads, but still a performance gain.

 

 

Project1.dpr

Edited by Attila Kovacs
  • Like 1

Share this post


Link to post
13 hours ago, dummzeuch said:

Have you tried integer instead of Int64? Or are we talking the 64 bit compiler?

 

(1 billion is < 2^31 of I'm not totally mistaken.)

The sum of the values is 233,333,334,166,666,668 which would overflow a 32bit Integer. I tried changing the loop counter to Integer, but that didn't produce any change in the time.

Share this post


Link to post
38 minutes ago, Joseph MItzen said:

I tried changing the loop counter to Integer, but that didn't produce any change in the time.

Strange. I can confirm what @Dalija Prasnikar reported about a 50% reduction in a comment above. 

 

For Win32 changing num from Int64 to Integer reduces the time from 5.8 seconds down to 3.5 seconds.

For Win64 the runtime for Int64 is way higher (10.4 seconds) than for Win32 (5.8 seconds), but reduces to a slightly better value for num being Integer: 3.4 seconds.

  • Like 1

Share this post


Link to post
9 hours ago, Uwe Raabe said:

For Win32 changing num from Int64 to Integer reduces the time from 5.8 seconds down to 3.5 seconds.

That's because 64-bit operations for x32 occupy 2 registers. More or less simple with addition, terrible with division. However, speedup with 32-bit num is unexpected.

Edited by Fr0sT.Brutal

Share this post


Link to post

Just for your information, the following snippet in Rust was around 1250 ms on my system.

Compilation about 1 second.

fn main() {
    let mut total: u64 = 0;
    for number in 1..=1000000000 {
        if ((number % 3) == 0) || ((number % 5) == 0) {
            total += number;
        };
    }

    println!("total is {}", total);
}

The Delphi implementation was around 7.5 seconds on this machine.

Edited by Der schöne Günther

Share this post


Link to post

program Benchmark;

{$APPTYPE CONSOLE}

uses
  SysUtils, DateUtils;

var
  StartTime, EndTime: TDateTime;
  Total: Int64;
  NumOfIterations3, NumOfIterations5, NumOfIterations15: Int64;
  Sum3, Sum5, Sum15: Int64;
  Duration: Double;
  Key: Char;

begin
  // Start measuring time
  StartTime := Now;

  NumOfIterations3 := 1000000000 div 3;
  NumOfIterations5 := 1000000000 div 5;
  NumOfIterations15 := 1000000000 div 15;

  Sum3 := NumOfIterations3 * (NumOfIterations3 + 1) div 2 * 3;
  Sum5 := NumOfIterations5 * (NumOfIterations5 + 1) div 2 * 5;
  Sum15 := NumOfIterations15 * (NumOfIterations15 + 1) div 2 * 15;

  Total := Sum3 + Sum5 - Sum15;

  // Stop measuring time
  EndTime := Now;
  Duration := SecondsBetween(EndTime, StartTime);
  Writeln('Total: ', Total);
  Writeln('Time taken: ', Duration:0:6, ' seconds');

  Write('Press any key to exit');
  Readln(Key);
end.

 

 

What about this code optimization?

 

Share this post


Link to post

Of course, we don't have the magic parallel=true flag but still, we can use parallelism.
Maybe this is not the perfect one but something like this will give you the maximum performance probably.
I'm adding this reply just for those who may need something like this, this is not the real answer probably.
Anyway, I was able to achieve a 25% improvement in performance by using the code below:

Remarks:
- NumTasks = 4; ==>  Specifies the number of parallel tasks to be used which is limited by the available resources, you can try with more or less chunks.

- sw: An instance of TStopwatch to measure the elapsed time.

- The code utilizes TParallel.For to parallelize the loop, dividing the range of numbers from 1 to 1000000000 into chunks based on the number of tasks (NumTasks).

- Each parallel task calculates the sum of multiples within its assigned chunk using the SumMultiples procedure.

- The partial sums from each task are added to the total using TInterlocked.Add for thread-safe operations.

 

 


 

uses
  System.SysUtils, System.Diagnostics, System.Threading, System.SyncObjs,
  System.Math;

const
  NumTasks = 4;

procedure SumMultiples(var partialTotal: Int64; start, finish: Int64);
var
  num: Integer;
begin
  partialTotal := 0;
  for num := start to finish do
    if (num mod 3 = 0) or (num mod 5 = 0) then
      partialTotal := partialTotal + num;
end;

Var
  total: Int64;
  sw: TStopwatch;
begin
  sw := TStopwatch.Create;
  sw.start;

  total := 0;

  TParallel.For(1, NumTasks,
    procedure(index: Int64)
    var
      start, finish, partialTotal: Int64;
      chunkSize: Int64;
    begin
      chunkSize := 1000000000 div NumTasks;
      start := (index - 1) * chunkSize + 1;
      finish := Min(index * chunkSize, 1000000000);

      SumMultiples(partialTotal, start, finish);

      TInterlocked.Add(total, partialTotal);
    end);

  sw.Stop;
  WriteLn(total);
  WriteLn('Total time ', sw.ElapsedMilliseconds, ' ms');
  Readln;
end.

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×