Jump to content
julkas

Multiple two UInt64 modulo

Recommended Posts

Hiere is my ASM code for (a * b) mod m, where a, b, m - UInt64.

Can it be done better ? Thanks.

Online - https://ideone.com/LKNjMM.

program mulmod;
{$IF Defined(FPC)}{$MODE Delphi}{$ASMMODE Intel}{$ENDIF}

function MulModASM(a, b: UInt64; m: UInt64): UInt64;
asm
  MOV RAX, a
  MOV RCX, m
  MUL b
  DIV RCX
  MOV @Result, RDX
end;

var
  a, b, m: UInt64;

begin
  m := $FFFFFFFFFFFFFFFF;
  a := 3;
  b := m - 1;
  WriteLn(a, ' ', b, ' ', m);
  WriteLn('   ASM - ', MulModASM(a, b, m));
  WriteLn('Pascal - ', (a * b) mod m);  
end.

 

Share this post


Link to post
Guest

That is wrong, there are mistakes 

1) Delphi can't handle 128bit values in simple or basic types, so when you multiply 64bit by 64bit the result is 128 but any operation on it will be truncated, means the lower value (64bit) only will be used.

2) You are dividing and taking only the lower part, this can be done only when you are using modulo Mersenne number (m), only then you can truncated the values at the length of m+1, EDIT: don't confuse Mersenne number with Mersenne prime, like i did.

 

So i google and found this

https://stackoverflow.com/questions/12168348/ways-to-do-modulo-multiplication-with-primitive-types

this exactly what are you asking for, because you are not using BigInt library (BigNum), 

 

I took that code in the first answer 

uint64_t mulmod(uint64_t a, uint64_t b, uint64_t m) {
    int64_t res = 0;
    while (a != 0) {
        if (a & 1) res = (res + b) % m;
        a >>= 1;
        b = (b << 1) % m;
    }
    return res;
}

And then pasted it in https://godbolt.org/ you can switch the compiler to see for yourself different platform, compilers and machine code, if you choose X86-64 GCC 10.1, then the result will be this

mulmod(unsigned long, unsigned long, unsigned long):
        mov     rcx, rdx
        test    rdi, rdi
        je      .L5
        xor     r8d, r8d
.L4:
        mov     r9, r8
        test    dil, 1
        je      .L3
        lea     rax, [r8+rsi]
        xor     edx, edx
        div     rcx
        mov     r8, rdx
        mov     r9, rdx
.L3:
        lea     rax, [rsi+rsi]
        xor     edx, edx
        div     rcx
        shr     rdi
        mov     rsi, rdx
        jne     .L4
        mov     rax, r9
        ret
.L5:
        xor     r9d, r9d
        mov     rax, r9
        ret

And that is clean assembly code, only you have to worry about calling convention, google that, or someone here might explain it for the different platforms supported by Delphi. 

 

Now, what is this algorithm ?! you may ask.

Refer to https://en.wikipedia.org/wiki/Modular_arithmetic 

And don't forget to check the sources and links, 

 

To use BigInt, either implement your own or find already tested library, i found two that will work for you, if you need big integers.

https://github.com/Xor-el/CryptoLib4Pascal

https://github.com/delphi-pascal-archive/Pascal-RSA

Edited by Guest

Share this post


Link to post

Karatsuba's algorithm is a fast multiplication algorithm.

Karatsuba noted that in fact only three multiplications of n / 2 - digit numbers are enough, since
(ad + bc) = (a + b) * (c + d) - ac - bd.

 

That is, using 64/2 = 32 bit numbers, you can get a 64 bit result.

 

N mod m where m = 2 ^ k is a power of two, you can discard bit digits more than k.
N mod 2 ^ 32 is a 32 bit number.

Edited by Marat1961

Share this post


Link to post
1 hour ago, Kas Ob. said:

2) You are dividing and taking only the lower part, this can be done only when you are using modulo Mersenne number (m), only then you can truncated the values at the length of m+1, EDIT: don't confuse Mersenne number with Mersenne prime, like i did.

What counts is that in x86_64 asm, MUL and DIV are 128-bit.

https://www.felixcloutier.com/x86/mul states:

REX.W + F7 /4    MUL r/m64    M    Valid    N.E.    Unsigned multiply (RDX:RAX ← RAX ∗ r/m64).

and https://www.felixcloutier.com/x86/div reports:

REX.W + F7 /6    DIV r/m64    M    Valid    N.E.    Unsigned divide RDX:RAX by r/m64, with result stored in RAX ← Quotient, RDX ← Remainder

 

So my guess is that this code is correct, and the fastest possible on the target.
Only if m is known in advance (as a constant), you may be able to make the division faster by using a reciprocal multiplication or a bits shift.

 

From https://www.agner.org/optimize/instruction_tables.pdf on a modern CPU, 64-bit MUL takes around 3 cycles, and 64-bit DIV 10 cycles.
Difficult to make it faster with any other algorithm.

 

As a small optimization, you can omit the stack frame:

 

function MulModASM(a, b, m: UInt64): UInt64;
{$ifdef FPC}nostackframe; assembler; asm {$else} asm .noframe {$endif FPC}
  MOV rax, a
  MOV rcx, m
  MUL b
  DIV rcx
  MOV rax, RDX
end;

 

https://ideone.com/KUsgMi

 

About register conventions, for windows/posix: rcx/rdi=a rdx/rsi=b r8/rdx=m so it should be fine.
Just to be tested with a lot of values and high 64-bit limits.

Edited by Arnaud Bouchez
  • Like 2

Share this post


Link to post
Guest

You are right, the code will work for UInt64, i thought i saw IDIV, my bad, running the code in Delphi gave different result and i wrongly assumed the asm is wrong but in fact it was mod in Delphi that failed.

Share this post


Link to post
2 hours ago, Marat1961 said:

Karatsuba's algorithm is a fast multiplication algorithm.

Karatsuba noted that in fact only three multiplications of n / 2 - digit numbers are enough, since
(ad + bc) = (a + b) * (c + d) - ac - bd.

 

That is, using 64/2 = 32 bit numbers, you can get a 64 bit result.

 

N mod m where m = 2 ^ k is a power of two, you can discard bit digits more than k.
N mod 2 ^ 32 is a 32 bit number.

Yes that's right but Karatsuba has its own applications and would be great for BigInteger. But for OP I think that's too overhead compared to native HW multiplication.

Share this post


Link to post

Why is an assembler needed for this operation?

If the code is used for example to generate a random number.

var seed class: Int64;
function of the TRandom64.Def class: Int64;
to begin
  Result: = seed * 6364136223846793005 + 1442695040888963407;
  seed: = Result;
end;

then we need a 2 ^ 64 remainder of the products of two Uint64 numbers.

To do this, it is enough to multiply two numbers and take the least significant number from the result.

 

Since the leading part of the result is not used, it seems there is no difference which multiplication works signed or unsigned.

I was tormented by vague doubts, I wrote several tests and found no difference.

type
  T2Uin64 = record
    hi, lo: UInt64;
  end;

procedure UMul64 (a, b: UInt64; var r: T2Uin64);
asm
  MOV RAX, a
  MUL b
  MOV r.lo, RAX
  MOV r.hi, RDX
end;

procedure TForm2.Button2Click (Sender: TObject);
const
  imax32 = Maxint;
  imax36 = UInt64 (Maxint) * 16;
  imax63 = 9223372036854775807;
  umax64 = 18446744073709551615;
var
  a, b, m, c, d, r1, r2: UInt64;
  r: T2Uin64;
begin
  a: = imax63;
  b: = 2;
  d: = umax64 - 1;
  c: = a * b;
  r1: = c div m;
  r2: = (a * b) div m;
  Assert (c = d);
  UMul64 (a, b, r);
  r2: = r.lo div m;
  d: = r.lo;
  Assert (d = c);
  a: = umax64;
  b: = 2;
  d: = umax64;
  c: = a * b;
  UMul64 (a, b, r);
  b: = umax64;
  c: = a * b;
  UMul64 (umax64, umax64, r);
end;

If you want to find the remainder again, there is no need to use an assembler.

// r2: = (a * b) div m;
mov rax, [rbp + 68 $]
imul rax, [rbp + $ 60]
xor edx, edx
div qword ptr [rbp + $ 58]
mov [rbp + $ 38], rax

It's better than calling an assembler subroutine anyway, because it will be faster.
The code becomes portable and will work in both 32-bit and 64-bit implementations and on any platform.

 

In my opinion, optimizing using assembler will only hurt here.

Edited by Marat1961
  • Like 1

Share this post


Link to post
32 minutes ago, julkas said:

My initial solution has bug - we must first reduce a and b modulo m.

Can you describe the same as an arithmetic expression? To make it clear and unambiguous.

 

Gigo. Garbage at the input garbage at the output.

Edited by Marat1961

Share this post


Link to post

Revised -

program mulmod;
{$IF Defined(FPC)}{$MODE Delphi}{$ASMMODE Intel}{$ENDIF}

function MulModASM(a, b: UInt64; m: UInt64): UInt64;
label
  CHECK_B, MUL_MOD;
asm
  MOV RCX, m
  MOV RAX, a
  CMP RAX, RCX
  JB CHECK_B
  XOR RDX, RDX
  DIV RCX
  MOV RAX, RDX

CHECK_B:
  MOV R8, RAX
  MOV RAX, b
  CMP RAX, RCX
  JB MUL_MOD
  XOR RDX, RDX
  DIV RCX
  MOV RAX, RDX

MUL_MOD:
  XOR RDX, RDX
  MUL R8
  DIV RCX
  MOV @Result, RDX

end;

var
  a, b, m: UInt64;

begin
  m := $FFFFFFFFFFFFFFFF;
  a := 3;
  b := m - 1;
  WriteLn(a, ' ', b, ' ', m);
  WriteLn('   ASM - ', MulModASM(a, b, m));
  WriteLn('   ASM - ', MulModASM(3, 12, 9));  
  WriteLn('   ASM - ', MulModASM(m, m, 9));    
  WriteLn('Pascal - ', (a * b) mod m);  
  WriteLn('Pascal - ', (m * m) mod 9);  
  
end.

 

Share this post


Link to post
On 10/29/2020 at 8:36 AM, Marat1961 said:

Why is an assembler needed for this operation?

 

The code becomes portable and will work in both 32-bit and 64-bit implementations and on any platform.

 

I like portable code. But sometimes I need fast ASM Intel 64-bit Object Pascal implementation.

Share this post


Link to post
On 10/29/2020 at 5:45 PM, Marat1961 said:

Gigo. Garbage at the input garbage at the output.

OK. Please provide example of a, b, m (m <> 0) where my last solution is wrong.

Share this post


Link to post
Guest
4 hours ago, julkas said:

OK. Please provide example of a, b, m (m <> 0) where my last solution is wrong.

OK, Please provide explanation of what on earth are you are trying to do ?

 

Your solution is impossible to be judge unless you elaborate, but looking at you the so called revised

1) you don't need to zero RDX before division.

2) the parameter "a" is in RCX and your first line "MOV RCX, m" is overwriting it a with m.

3) WHAY, i mean why your sample in two posts insist on using m=$FFFFFFFFFFFFFFFF, that value is exactly "m= UInt64(-1)", in other words it is the highest value for UInt64, means the modulus x in "x = a mod m" when "m= highestvalue" will always be "x=a" unless a=m then it will be "x=0", so no need for mod for that value to begin with, a simple compare will do if you need "mod highestvalue".

 

again explain what are you trying to achieve? as Marat1961 assumed you are trying to implement a random number generator, while i assumed you are trying to implement multiplication in a Galois field.

 

So before answering any more question, spare us the guessing and help yourself by format your questions in meaningful manner.

Share this post


Link to post
4 hours ago, Kas Ob. said:

1) you don't need to zero RDX before division.

Why not? I thought RDX:RAX are used:

Unsigned divide RDX:RAX by r/m64, with result stored in RAX ← Quotient, RDX ← Remainder.

Share this post


Link to post
Guest
9 minutes ago, Vandrovnik said:

Why not? I thought RDX:RAX are used:

Unsigned divide RDX:RAX by r/m64, with result stored in RAX ← Quotient, RDX ← Remainder.

Exactly, RDX:RAX is 128 bit, so if he is doing mod "value <= 2^64" then there is no point doing the division using 128 instead of 64bit.

 

On top of that the parameter "b" is passed on RDX and he is clearing it, the whole thing is a mess and meaningless.

Share this post


Link to post
38 minutes ago, Kas Ob. said:

Exactly, RDX:RAX is 128 bit, so if he is doing mod "value <= 2^64" then there is no point doing the division using 128 instead of 64bit.

But when divisor is 64bit, he has to use the "128 bits divide by 64 bits", because, as far as I know, there is no instruction for "64 bits divide by 64 bits", or is there?

Share this post


Link to post
Guest
15 minutes ago, Vandrovnik said:

But when divisor is 64bit, he has to use the "128 bits divide by 64 bits", because, as far as I know, there is no instruction for "64 bits divide by 64 bits", or is there?

Reading what i wrote, i think i could write it clearer, which make me just like OP, 🙂

as for 64bit division you can go with "div ecx" it does work on 64bit exactly like 32bit architecture, and will use EDX just like RDX, 

my point was unless we do understand what is needed, we can't help, there are tricks you can use to avoid mod or division, even you can replace mod with simple "AND" in some cases, how many i saw multiply by 4 which could be a simple shift, and all of that depend on the algorithm, and more importantly is depends on the usage of that ModMul to begin with.

If he do need that extreme speed then it is better to first, find the best algorithm, only after that comes assembly to help, even though it looks straight forward like translation, in some cases you need to tweak a little to avoid switching register by switching parameters.

 

Share this post


Link to post
18 hours ago, julkas said:

I like portable code. But sometimes I need fast ASM Intel 64-bit Object Pascal implementation.

Where will your code in the procedure be faster than the same code generated by the compiler.

As far as I understand, the functionality of the code is the same, but the overhead of calling the procedure is added.

As a result, if we calculate everything, we lose in performance.

Edited by Marat1961

Share this post


Link to post
6 hours ago, Marat1961 said:

Where will your code in the procedure be faster than the same code generated by the compiler.

As far as I understand, the functionality of the code is the same, but the overhead of calling the procedure is added.

As a result, if we calculate everything, we lose in performance.

Thanks all for replies.

Please see question - https://www.quora.com/What-are-the-10-biggest-prime-numbers-below-2-64

Here is my last solution  - https://ideone.com/oO83MV

 

Can you do it faster using only Miller Rabin test?

If - yes,, provide your solution.

 

BTW - My pure Object Pascal implementation is here - https://github.com/JulStrat/primesieve-pas/blob/mult/mult/modar.pas

 

Edited by julkas

Share this post


Link to post
Guest

@julkas Thank you for clearing this out.

 

These algorithm you are after are specific for rings and fields, so the arithmetic operation mentioned here https://en.wikipedia.org/wiki/Modular_arithmetic#Integers_modulo_n are specific to Z/nZ ring.

 

It looks good except for your assembly in your last solution, it is still wrong.

Quote

function MulModASM(a, b: UInt64; m: UInt64): UInt64;
//   RCX = a   RDX = b  R8 = m
label
  MUL_MOD; 
asm
  MOV RCX, m    //   We lost a 
  MOV RAX, a    //   a is m now !
  CMP RAX, RCX
  JB MUL_MOD

 

And that was for the assembly, now for the algorithm itself, we can apply the following 

image.png.593aaccaa7752106633108696216301b.png         (1)     EDIT: replaced the formula with image

But this is useful and exist to simplify the multiplication when the size/length of "a" and "b" can't fit simple type, or we have very big integers, because lets say "a" is an integer with "an" bit and "b" does have "ab" bits, then the multiplication will have an+ab bit, unlike the addition a+b which will have "max(an,ab)+1" bit, also worst case is when an+ab are longer than CPU register, but luckily this is not the case here, as you asking for 64bit, and our CPU's does support 128Bit division and multiplication !, right ?

 

So i want you to fix it yourself, it is simple and don't need to use (1), because while that equation is there to simplify the calculation, in your case with UInt64 it will does the opposite and complicate the calculation making it slower.

 

Hint : better than using local vars, you have plenty of registers to store a,b and m and manipulate the values, just remember for any register that you will use other than "RAX,RCX,RDX,R8,R9" you should save it then restore it (push and pop)

 

 

Edited by Guest

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×