Joseph MItzen 251 Posted March 20, 2022 (edited) About a month ago I was doing some benchmarks of FreePascal under Linux and got an unexpected result. Thanks to some help here I was able to get Delphi Community Edition installed yesterday and found the same anomaly with Delphi targeting Windows. I was wondering if I'm missing something in the below benchmark or something has changed with the new compiler and there's something I'm doing I'm not aware of that significantly impacts performance. The benchmark consists of summing the total of all the numbers between 1 and 1 billion which are evenly divisible by 3 or 5 and printing the total. It seems to me that it's a short, simple and straightforward test that should have one simple, obvious way to implement it in Pascal: program numtest8; Var total, num : Int64; begin total := 0; for num := 1 to 1000000000 Do if (num mod 3 = 0) or (num mod 5 = 0) Then total := total + num; WriteLn(total); end. Compiled with maximum performance under FreePascal (I tried other O settings; they didn't make any difference) these are the results I get on a Ryzen 2700 8 core CPU with stock cooler, best of 3: 64bit, Linux: 11.36 seconds, 99% CPU 64bit, Windows 11 VM, 11.43 seconds (PowerShell doesn't report CPU usage with the timing information) This is the figure for Delphi: 64bit, Windows 11 VM: 11.38 seconds Those figures seem rather consistent across two compilers, two operating systems, a VM and bare metal, so nothing weird. But here's the weird part (cue spooky music): I was also performing the benchmark across other languages, single core vs. parallel, etc. One language tested was Python. I was able to create a single-threaded Python version that beat a multicore version, but that's a weird result for another forum. It's this that has me stumped. What follows is a test with PyPy, a JIT-enabled version of the Python runtime. Here's the source code: total = 0 for x in range(1, 1_000_000_001): if x % 3 == 0 or x % 5 == 0: total = total + x print(total) Note: if I were performing the benchmark this way in regular Python, it'd still be running; loops kill Python performance. The best way of writing this code in regular Python gives a result of 1 minute, 15 seconds (!!!). But PyPy was much better... too much better: PyPy3.9 v7.3.8, 64bit, Linux: 3.34 seconds, 99% CPU PyPy3.9 V7.3.8 64bit, Windows VM: 3.41 seconds PyPy performed over 3X faster than Delphi / FreePascal, and that's including the time it takes to compile on the fly! The CPU data from the Linux run proves what I already knew about PyPy, that it doesn't do any automatic parallelization that might explain this result. Is there some fancy new CPU instruction it could be using that the Pascal compilers aren't? I seem to recall that at least the Delphi 32bit compiler didn't perform bit-shifting to do division; could that still be the case and the divisions are killing the performance? I decided to test another JIT option for Python, Numba. Numba is intended to JIT mathematical operations only. It's not a Python runtime, but a regular library that takes Python functions and compiles them. It also has a cache function and a pre-compile option. I was lazy and since I didn't know how the pre-compile function worked I just used the cache option and ran it once to cache the compiled function then performed benchmarks. The code: from numba import jit @jit(nopython=True, cache=True, fastmath=True) def test(): total = 0 for num in range(1, 1_000_000_001): if num % 3 == 0 or num % 5 == 0: total = total + num return total answer = test() print(answer) Python3.10.2 + Numba, 64bit, Linux: 2.07 seconds, 158% CPU Python3.10.2 + Numba, 64bit, Windows VM: 2.49 seconds I can't really explain the 158% CPU figure, as I know that numba can do some parallelization but a debug run suggested it couldn't find anything to parallelize. Regardless, it knocked almost a second off the PyPy time, probably due to caching the compiled function. I could tell Numba to explicitly parallelize: from numba import jit, prange @jit(nopython=True, cache=True, parallel=True, fastmath=True) def test(): total = 0 for num in prange(1, 1_000_000_001): if num % 3 == 0 or num % 5 == 0: total = total + num return total answer = test() print(answer) And now we get: Python3.10.2 + Numba, 64bit, Linux: 0.84 seconds, 743% CPU Python3.10.2 + Numba, 64bit, Windows VM: 1.20 seconds Clearly here the CPU results under Linux are consistent with an 8-core CPU. So the mystery is: With a statically typed, compiled language, we get circa 11.4 seconds across OSes and compilers. With a dynamically typed language with its own runtime VM, and one notoriously slow at that, we get 3.4 seconds at worst including JIT compile time, , 2.0 - 2.5 seconds pre-compiled with some possible automatic optimization, and 0.8 - 1.2 seconds with explicitly requesting the loop be parallelized. Those results are... unexpected. Now I honestly thought that JITted Python could match or ever-so-slightly beat FreePascal, because I've seen that happen before several years ago. But at least 3X faster to 10X faster? That was a shock. I'm half hoping someone will tell me I'm an idiot and I should never write Pascal code like that and if I just do X it'll speed up 300%. If not, my best guess after days of thought is that vague memory about bit-shifting. I remember when I read that Delphi didn't do that because Tiny C, a C compiler so tiny it fits on a 3.5" floppy disk - with operating system! - only has three optimizations, but one of them is bit-shifting. That would seem to imply it's rather useful. I'd be interested if anyone has any insight into why Delphi and FreePascal performed so poorly in this benchmark. I'd really like to rule out any mistakes on my end. If we can figure out what's killing the Delphi performance - and it's not me - these results should make for a very, very interesting QC feature request. Edited March 20, 2022 by Joseph MItzen 1 Share this post Link to post
dummzeuch 1505 Posted March 20, 2022 Have you tried integer instead of Int64? Or are we talking the 64 bit compiler? (1 billion is < 2^31 of I'm not totally mistaken.) Share this post Link to post
David Heffernan 2345 Posted March 20, 2022 A really good compiler would remove the loop and just turn it into a simple assignment 1 Share this post Link to post
Uwe Raabe 2057 Posted March 20, 2022 1 hour ago, dummzeuch said: Have you tried integer instead of Int64? That will work for num, but the total may actually exceed that. After all it is the sum of all 3d or 5th numbers. 20 minutes ago, David Heffernan said: A really good compiler would remove the loop and just turn it into a simple assignment Can you elaborate on how such a simple assignment would look like? 2 Share this post Link to post
Dalija Prasnikar 1396 Posted March 20, 2022 3 hours ago, Joseph MItzen said: I'd be interested if anyone has any insight into why Delphi and FreePascal performed so poorly in this benchmark. I'd really like to rule out any mistakes on my end. If we can figure out what's killing the Delphi performance - and it's not me - these results should make for a very, very interesting QC feature request. I don't know about FreePascal, but there are several things that have impact in Delphi compiler. First, being single pass compiler, possible compiler optimizations are more limited than in multiple pass compilers. Next, 64bit compiler in general has less optimizations than original Win32 compiler. I am not going into the why's because I don't have that information and everything would be pure speculation on my side. Also Delphi 64bit compiler is not optimally using all available CPU registers, so there will be more unnecessary memory operations. I know that having a compiler that can do more optimizations is beneficial, because you can focus on the code you write and not on how fast it will run, but whether we like it or not, when speed matters optimizing the algorithm has always been first step. For instance, this would be faster way to achieve same results: var total, num: Int64; begin total := 0; num := 3; while num <= 1000000000 do begin total := total + num; inc(num, 3); end; num := 5; while num <= 1000000000 do begin total := total + num; inc(num, 5); end; end; I know that this is not exactly the answer you are looking for. Share this post Link to post
Dalija Prasnikar 1396 Posted March 20, 2022 I forgot to mention, in original code, merely declaring num as Integer gives 50% increase in the performance. 2 Share this post Link to post
Mike Torrettinni 198 Posted March 20, 2022 Interesting to know other compilers can parallelize a simple loop. I thought you need to split and parallelize manually, like Delphi. Can you imagine Delphi had a flag: parallel=True and does everything for you. I guess Numba developers saw TTask, TThread, OmniThreadLibrary... and just decided to implement a simple flag. Nice! 1 Share this post Link to post
Vandrovnik 214 Posted March 20, 2022 (edited) 1 hour ago, Dalija Prasnikar said: For instance, this would be faster way to achieve same results: var total, num: Int64; begin total := 0; num := 3; while num <= 1000000000 do begin total := total + num; inc(num, 3); end; num := 5; while num <= 1000000000 do begin total := total + num; inc(num, 5); end; end; I know that this is not exactly the answer you are looking for. I think the result will not be the same - original code adds number 15 just once to the total, while your code will add it twice... Edited March 20, 2022 by Vandrovnik 3 Share this post Link to post
Dalija Prasnikar 1396 Posted March 20, 2022 48 minutes ago, Vandrovnik said: I think the result will not be the same - original code adds number 15 just once to the total, while your code will add it twice... LOL I should have known better than doing math first thing on Sunday morning. 1 6 Share this post Link to post
Uwe Raabe 2057 Posted March 20, 2022 Speaking about algorithms: const cnt = 1000000000; var total: Int64; tot3: Int64; tot5: Int64; tot35: Int64; cnt3: Int64; cnt5: Int64; cnt35: Int64; begin cnt3 := cnt div 3; cnt5 := cnt div 5; cnt35 := cnt5 div 3; tot3 := 3*(cnt3*(cnt3 + 1) div 2); tot5 := 5*(cnt5*(cnt5 + 1) div 2); tot35 := 15*(cnt35*(cnt35 + 1) div 2); total := tot3 + tot5 - tot35; end; Execution time is less than 1 Millisecond. 6 Share this post Link to post
David Heffernan 2345 Posted March 20, 2022 5 hours ago, Uwe Raabe said: Can you elaborate on how such a simple assignment would look like? The value of the summation an be evaluated statically. I don't think any compilers do this, but I could be wrong. Share this post Link to post
Attila Kovacs 629 Posted March 20, 2022 (edited) a good compiler would do something similar: function mod35s: int64; label l1, l2; asm push rdi push rbx mov edi, 1000000000 xor rax, rax mov rbx, 0 mov ecx, 0 @l1: imul edx, edi, -858993459 mov ecx, edi cmp edx, 858993459 jbe @L2 imul edx, edi, -1431655765 cmp edx, 1431655765 cmova ecx, ebx @L2: add rax, rcx dec edi jnz @l1 pop rbx pop rdi ret end; Edited March 20, 2022 by Attila Kovacs Share this post Link to post
Uwe Raabe 2057 Posted March 20, 2022 19 minutes ago, Attila Kovacs said: Uwe, are you counting the occurrences? It's something different than OP's. No, I am calculating the sums. These are the numbers for Win32: Loop: 233333334166666668 elapsed: 3554.07020 ms Calc: 233333334166666668 elapsed: 0.00030 ms Counting all multiples of 3 between 1 and N is N div 3. Let's name that N3. So one can write the Sum of those as total3 := 0; for I := 1 to N3 do total3 := total3 + 3*I; We can now get the 3* out of the sum and get total3 := 0; for I := 1 to N3 do total3 := total3 + I; total3 := 3*total3; Using the well known sum(1..n) formula n*(n+1)/2, we end up with what I wrote above for tot3. Do the same with 5 to get the second term of the result tot5. Now we have to take care of the numbers that are multiples of 3 and 5, as up to now they are added twice. That's where the 15 comes into play with tot35. 1 Share this post Link to post
Attila Kovacs 629 Posted March 20, 2022 34 minutes ago, Uwe Raabe said: No, I am calculating the sums. Ah I see. Also, I missed one 0 in the loop, so the asm above takes ~1 sec on my pc not 100msecs. Share this post Link to post
Vandrovnik 214 Posted March 20, 2022 (edited) 7 hours ago, Mike Torrettinni said: Interesting to know other compilers can parallelize a simple loop. I thought you need to split and parallelize manually, like Delphi. Can you imagine Delphi had a flag: parallel=True and does everything for you. I guess Numba developers saw TTask, TThread, OmniThreadLibrary... and just decided to implement a simple flag. Nice! I have tried this (will probably not work fine for MaxValue not divisible by Workers), it is 3.6 times faster than single thread on an old i7 CPU. I guess it should be possible to use IFuture<Int64>, but I do not know how to pass parameters to it... It is easy to use tParallel.For, but there I used tInterlocked.Add(total, num), which resulted in much worse time than single thread. const MaxValue=1000000000; type tCalcThread=class(tThread) public CalcFrom: integer; CalcTo: integer; Value: Int64; procedure Execute; override; end; procedure tCalcThread.Execute; var PartialTotal: Int64; num: integer; begin PartialTotal:=0; for num:=CalcFrom to CalcTo do if (num mod 3 = 0) or (num mod 5 = 0) Then inc(PartialTotal, num); Value:=PartialTotal; end; function PLoop32: Int64; const Workers = 8; Var total : Int64; Calcs: array[0..Workers-1] of tCalcThread; a: integer; begin total := 0; fillchar(Calcs, sizeof(Calcs), 0); try for a:=0 to Workers-1 do begin Calcs[a]:=tCalcThread.Create(true); Calcs[a].FreeOnTerminate:=false; Calcs[a].CalcFrom:=Int64(MaxValue)*a div Workers; Calcs[a].CalcTo:=Int64(MaxValue)*(a+1) div Workers-1; Calcs[a].Start; end; for a:=0 to Workers-1 do begin Calcs[a].WaitFor; inc(total, Calcs[a].Value); end; finally for a:=0 to Workers-1 do FreeAndNil(Calcs[a]); end; result := total; end; Edited March 20, 2022 by Vandrovnik 1 Share this post Link to post
David Heffernan 2345 Posted March 20, 2022 Interlocked is always going to be terrible here. Local accumulation, join, and then a final summation would be better. A magic parallel=true flag is fanciful. Why look at Python. Surely look at the various C++ compilers? Share this post Link to post
Vandrovnik 214 Posted March 20, 2022 8 minutes ago, David Heffernan said: Interlocked is always going to be terrible here. Local accumulation, join, and then a final summation would be better. I know (I have used it in my thread example), but is it possible to do also with tParallel.For? 1 Share this post Link to post
Attila Kovacs 629 Posted March 20, 2022 (edited) 3 hours ago, Vandrovnik said: I have tried this Nice. Added the x64 asm, enhanced the chunk size calculation, reduced from 8 logical to 4 physical cores (this improves a bit in 64 bit mode / under a given runtime? / on my cpu?). edit: in the IDE 4 is faster, outside the ide 8 is much faster It's now almost twice as fast as my single threaded asm was. So there is a lot of overhead firing up the threads, but still a performance gain. Project1.dpr Edited March 20, 2022 by Attila Kovacs 1 Share this post Link to post
Joseph MItzen 251 Posted March 20, 2022 13 hours ago, dummzeuch said: Have you tried integer instead of Int64? Or are we talking the 64 bit compiler? (1 billion is < 2^31 of I'm not totally mistaken.) The sum of the values is 233,333,334,166,666,668 which would overflow a 32bit Integer. I tried changing the loop counter to Integer, but that didn't produce any change in the time. Share this post Link to post
Uwe Raabe 2057 Posted March 20, 2022 38 minutes ago, Joseph MItzen said: I tried changing the loop counter to Integer, but that didn't produce any change in the time. Strange. I can confirm what @Dalija Prasnikar reported about a 50% reduction in a comment above. For Win32 changing num from Int64 to Integer reduces the time from 5.8 seconds down to 3.5 seconds. For Win64 the runtime for Int64 is way higher (10.4 seconds) than for Win32 (5.8 seconds), but reduces to a slightly better value for num being Integer: 3.4 seconds. 1 Share this post Link to post
Fr0sT.Brutal 900 Posted March 21, 2022 (edited) 9 hours ago, Uwe Raabe said: For Win32 changing num from Int64 to Integer reduces the time from 5.8 seconds down to 3.5 seconds. That's because 64-bit operations for x32 occupy 2 registers. More or less simple with addition, terrible with division. However, speedup with 32-bit num is unexpected. Edited March 21, 2022 by Fr0sT.Brutal Share this post Link to post
Der schöne Günther 316 Posted March 21, 2022 (edited) Just for your information, the following snippet in Rust was around 1250 ms on my system. Compilation about 1 second. fn main() { let mut total: u64 = 0; for number in 1..=1000000000 { if ((number % 3) == 0) || ((number % 5) == 0) { total += number; }; } println!("total is {}", total); } The Delphi implementation was around 7.5 seconds on this machine. Edited March 21, 2022 by Der schöne Günther Share this post Link to post
Dalibor31 0 Posted August 31, 2023 program Benchmark; {$APPTYPE CONSOLE} uses SysUtils, DateUtils; var StartTime, EndTime: TDateTime; Total: Int64; NumOfIterations3, NumOfIterations5, NumOfIterations15: Int64; Sum3, Sum5, Sum15: Int64; Duration: Double; Key: Char; begin // Start measuring time StartTime := Now; NumOfIterations3 := 1000000000 div 3; NumOfIterations5 := 1000000000 div 5; NumOfIterations15 := 1000000000 div 15; Sum3 := NumOfIterations3 * (NumOfIterations3 + 1) div 2 * 3; Sum5 := NumOfIterations5 * (NumOfIterations5 + 1) div 2 * 5; Sum15 := NumOfIterations15 * (NumOfIterations15 + 1) div 2 * 15; Total := Sum3 + Sum5 - Sum15; // Stop measuring time EndTime := Now; Duration := SecondsBetween(EndTime, StartTime); Writeln('Total: ', Total); Writeln('Time taken: ', Duration:0:6, ' seconds'); Write('Press any key to exit'); Readln(Key); end. What about this code optimization? Share this post Link to post
Ali Dehban 38 Posted January 25 Of course, we don't have the magic parallel=true flag but still, we can use parallelism. Maybe this is not the perfect one but something like this will give you the maximum performance probably. I'm adding this reply just for those who may need something like this, this is not the real answer probably. Anyway, I was able to achieve a 25% improvement in performance by using the code below: Remarks: - NumTasks = 4; ==> Specifies the number of parallel tasks to be used which is limited by the available resources, you can try with more or less chunks. - sw: An instance of TStopwatch to measure the elapsed time. - The code utilizes TParallel.For to parallelize the loop, dividing the range of numbers from 1 to 1000000000 into chunks based on the number of tasks (NumTasks). - Each parallel task calculates the sum of multiples within its assigned chunk using the SumMultiples procedure. - The partial sums from each task are added to the total using TInterlocked.Add for thread-safe operations. uses System.SysUtils, System.Diagnostics, System.Threading, System.SyncObjs, System.Math; const NumTasks = 4; procedure SumMultiples(var partialTotal: Int64; start, finish: Int64); var num: Integer; begin partialTotal := 0; for num := start to finish do if (num mod 3 = 0) or (num mod 5 = 0) then partialTotal := partialTotal + num; end; Var total: Int64; sw: TStopwatch; begin sw := TStopwatch.Create; sw.start; total := 0; TParallel.For(1, NumTasks, procedure(index: Int64) var start, finish, partialTotal: Int64; chunkSize: Int64; begin chunkSize := 1000000000 div NumTasks; start := (index - 1) * chunkSize + 1; finish := Min(index * chunkSize, 1000000000); SumMultiples(partialTotal, start, finish); TInterlocked.Add(total, partialTotal); end); sw.Stop; WriteLn(total); WriteLn('Total time ', sw.ElapsedMilliseconds, ' ms'); Readln; end. Share this post Link to post