julkas 12 Posted October 28, 2020 Hiere is my ASM code for (a * b) mod m, where a, b, m - UInt64. Can it be done better ? Thanks. Online - https://ideone.com/LKNjMM. program mulmod; {$IF Defined(FPC)}{$MODE Delphi}{$ASMMODE Intel}{$ENDIF} function MulModASM(a, b: UInt64; m: UInt64): UInt64; asm MOV RAX, a MOV RCX, m MUL b DIV RCX MOV @Result, RDX end; var a, b, m: UInt64; begin m := $FFFFFFFFFFFFFFFF; a := 3; b := m - 1; WriteLn(a, ' ', b, ' ', m); WriteLn(' ASM - ', MulModASM(a, b, m)); WriteLn('Pascal - ', (a * b) mod m); end. Share this post Link to post
Guest Posted October 28, 2020 (edited) That is wrong, there are mistakes 1) Delphi can't handle 128bit values in simple or basic types, so when you multiply 64bit by 64bit the result is 128 but any operation on it will be truncated, means the lower value (64bit) only will be used. 2) You are dividing and taking only the lower part, this can be done only when you are using modulo Mersenne number (m), only then you can truncated the values at the length of m+1, EDIT: don't confuse Mersenne number with Mersenne prime, like i did. So i google and found this https://stackoverflow.com/questions/12168348/ways-to-do-modulo-multiplication-with-primitive-types this exactly what are you asking for, because you are not using BigInt library (BigNum), I took that code in the first answer uint64_t mulmod(uint64_t a, uint64_t b, uint64_t m) { int64_t res = 0; while (a != 0) { if (a & 1) res = (res + b) % m; a >>= 1; b = (b << 1) % m; } return res; } And then pasted it in https://godbolt.org/ you can switch the compiler to see for yourself different platform, compilers and machine code, if you choose X86-64 GCC 10.1, then the result will be this mulmod(unsigned long, unsigned long, unsigned long): mov rcx, rdx test rdi, rdi je .L5 xor r8d, r8d .L4: mov r9, r8 test dil, 1 je .L3 lea rax, [r8+rsi] xor edx, edx div rcx mov r8, rdx mov r9, rdx .L3: lea rax, [rsi+rsi] xor edx, edx div rcx shr rdi mov rsi, rdx jne .L4 mov rax, r9 ret .L5: xor r9d, r9d mov rax, r9 ret And that is clean assembly code, only you have to worry about calling convention, google that, or someone here might explain it for the different platforms supported by Delphi. Now, what is this algorithm ?! you may ask. Refer to https://en.wikipedia.org/wiki/Modular_arithmetic And don't forget to check the sources and links, To use BigInt, either implement your own or find already tested library, i found two that will work for you, if you need big integers. https://github.com/Xor-el/CryptoLib4Pascal https://github.com/delphi-pascal-archive/Pascal-RSA Edited October 28, 2020 by Guest Share this post Link to post
Marat1961 17 Posted October 28, 2020 (edited) Karatsuba's algorithm is a fast multiplication algorithm. Karatsuba noted that in fact only three multiplications of n / 2 - digit numbers are enough, since (ad + bc) = (a + b) * (c + d) - ac - bd. That is, using 64/2 = 32 bit numbers, you can get a 64 bit result. N mod m where m = 2 ^ k is a power of two, you can discard bit digits more than k. N mod 2 ^ 32 is a 32 bit number. Edited October 28, 2020 by Marat1961 Share this post Link to post
Arnaud Bouchez 407 Posted October 28, 2020 (edited) 1 hour ago, Kas Ob. said: 2) You are dividing and taking only the lower part, this can be done only when you are using modulo Mersenne number (m), only then you can truncated the values at the length of m+1, EDIT: don't confuse Mersenne number with Mersenne prime, like i did. What counts is that in x86_64 asm, MUL and DIV are 128-bit. https://www.felixcloutier.com/x86/mul states: REX.W + F7 /4 MUL r/m64 M Valid N.E. Unsigned multiply (RDX:RAX ← RAX ∗ r/m64). and https://www.felixcloutier.com/x86/div reports: REX.W + F7 /6 DIV r/m64 M Valid N.E. Unsigned divide RDX:RAX by r/m64, with result stored in RAX ← Quotient, RDX ← Remainder So my guess is that this code is correct, and the fastest possible on the target. Only if m is known in advance (as a constant), you may be able to make the division faster by using a reciprocal multiplication or a bits shift. From https://www.agner.org/optimize/instruction_tables.pdf on a modern CPU, 64-bit MUL takes around 3 cycles, and 64-bit DIV 10 cycles. Difficult to make it faster with any other algorithm. As a small optimization, you can omit the stack frame: function MulModASM(a, b, m: UInt64): UInt64; {$ifdef FPC}nostackframe; assembler; asm {$else} asm .noframe {$endif FPC} MOV rax, a MOV rcx, m MUL b DIV rcx MOV rax, RDX end; https://ideone.com/KUsgMi About register conventions, for windows/posix: rcx/rdi=a rdx/rsi=b r8/rdx=m so it should be fine. Just to be tested with a lot of values and high 64-bit limits. Edited October 28, 2020 by Arnaud Bouchez 2 Share this post Link to post
Guest Posted October 28, 2020 You are right, the code will work for UInt64, i thought i saw IDIV, my bad, running the code in Delphi gave different result and i wrongly assumed the asm is wrong but in fact it was mod in Delphi that failed. Share this post Link to post
Mahdi Safsafi 225 Posted October 28, 2020 2 hours ago, Marat1961 said: Karatsuba's algorithm is a fast multiplication algorithm. Karatsuba noted that in fact only three multiplications of n / 2 - digit numbers are enough, since (ad + bc) = (a + b) * (c + d) - ac - bd. That is, using 64/2 = 32 bit numbers, you can get a 64 bit result. N mod m where m = 2 ^ k is a power of two, you can discard bit digits more than k. N mod 2 ^ 32 is a 32 bit number. Yes that's right but Karatsuba has its own applications and would be great for BigInteger. But for OP I think that's too overhead compared to native HW multiplication. Share this post Link to post
Marat1961 17 Posted October 29, 2020 (edited) Why is an assembler needed for this operation? If the code is used for example to generate a random number. var seed class: Int64; function of the TRandom64.Def class: Int64; to begin Result: = seed * 6364136223846793005 + 1442695040888963407; seed: = Result; end; then we need a 2 ^ 64 remainder of the products of two Uint64 numbers. To do this, it is enough to multiply two numbers and take the least significant number from the result. Since the leading part of the result is not used, it seems there is no difference which multiplication works signed or unsigned. I was tormented by vague doubts, I wrote several tests and found no difference. type T2Uin64 = record hi, lo: UInt64; end; procedure UMul64 (a, b: UInt64; var r: T2Uin64); asm MOV RAX, a MUL b MOV r.lo, RAX MOV r.hi, RDX end; procedure TForm2.Button2Click (Sender: TObject); const imax32 = Maxint; imax36 = UInt64 (Maxint) * 16; imax63 = 9223372036854775807; umax64 = 18446744073709551615; var a, b, m, c, d, r1, r2: UInt64; r: T2Uin64; begin a: = imax63; b: = 2; d: = umax64 - 1; c: = a * b; r1: = c div m; r2: = (a * b) div m; Assert (c = d); UMul64 (a, b, r); r2: = r.lo div m; d: = r.lo; Assert (d = c); a: = umax64; b: = 2; d: = umax64; c: = a * b; UMul64 (a, b, r); b: = umax64; c: = a * b; UMul64 (umax64, umax64, r); end; If you want to find the remainder again, there is no need to use an assembler. // r2: = (a * b) div m; mov rax, [rbp + 68 $] imul rax, [rbp + $ 60] xor edx, edx div qword ptr [rbp + $ 58] mov [rbp + $ 38], rax It's better than calling an assembler subroutine anyway, because it will be faster. The code becomes portable and will work in both 32-bit and 64-bit implementations and on any platform. In my opinion, optimizing using assembler will only hurt here. Edited October 29, 2020 by Marat1961 1 Share this post Link to post
julkas 12 Posted October 29, 2020 ZZZ! My initial solution has bug - we must first reduce a and b modulo m. Share this post Link to post
Marat1961 17 Posted October 29, 2020 (edited) 32 minutes ago, julkas said: My initial solution has bug - we must first reduce a and b modulo m. Can you describe the same as an arithmetic expression? To make it clear and unambiguous. Gigo. Garbage at the input garbage at the output. Edited October 29, 2020 by Marat1961 Share this post Link to post
julkas 12 Posted October 30, 2020 Revised - program mulmod; {$IF Defined(FPC)}{$MODE Delphi}{$ASMMODE Intel}{$ENDIF} function MulModASM(a, b: UInt64; m: UInt64): UInt64; label CHECK_B, MUL_MOD; asm MOV RCX, m MOV RAX, a CMP RAX, RCX JB CHECK_B XOR RDX, RDX DIV RCX MOV RAX, RDX CHECK_B: MOV R8, RAX MOV RAX, b CMP RAX, RCX JB MUL_MOD XOR RDX, RDX DIV RCX MOV RAX, RDX MUL_MOD: XOR RDX, RDX MUL R8 DIV RCX MOV @Result, RDX end; var a, b, m: UInt64; begin m := $FFFFFFFFFFFFFFFF; a := 3; b := m - 1; WriteLn(a, ' ', b, ' ', m); WriteLn(' ASM - ', MulModASM(a, b, m)); WriteLn(' ASM - ', MulModASM(3, 12, 9)); WriteLn(' ASM - ', MulModASM(m, m, 9)); WriteLn('Pascal - ', (a * b) mod m); WriteLn('Pascal - ', (m * m) mod 9); end. Share this post Link to post
julkas 12 Posted October 31, 2020 On 10/29/2020 at 8:36 AM, Marat1961 said: Why is an assembler needed for this operation? The code becomes portable and will work in both 32-bit and 64-bit implementations and on any platform. I like portable code. But sometimes I need fast ASM Intel 64-bit Object Pascal implementation. Share this post Link to post
julkas 12 Posted October 31, 2020 On 10/29/2020 at 5:45 PM, Marat1961 said: Gigo. Garbage at the input garbage at the output. OK. Please provide example of a, b, m (m <> 0) where my last solution is wrong. Share this post Link to post
Guest Posted October 31, 2020 4 hours ago, julkas said: OK. Please provide example of a, b, m (m <> 0) where my last solution is wrong. OK, Please provide explanation of what on earth are you are trying to do ? Your solution is impossible to be judge unless you elaborate, but looking at you the so called revised 1) you don't need to zero RDX before division. 2) the parameter "a" is in RCX and your first line "MOV RCX, m" is overwriting it a with m. 3) WHAY, i mean why your sample in two posts insist on using m=$FFFFFFFFFFFFFFFF, that value is exactly "m= UInt64(-1)", in other words it is the highest value for UInt64, means the modulus x in "x = a mod m" when "m= highestvalue" will always be "x=a" unless a=m then it will be "x=0", so no need for mod for that value to begin with, a simple compare will do if you need "mod highestvalue". again explain what are you trying to achieve? as Marat1961 assumed you are trying to implement a random number generator, while i assumed you are trying to implement multiplication in a Galois field. So before answering any more question, spare us the guessing and help yourself by format your questions in meaningful manner. Share this post Link to post
Vandrovnik 215 Posted October 31, 2020 4 hours ago, Kas Ob. said: 1) you don't need to zero RDX before division. Why not? I thought RDX:RAX are used: Unsigned divide RDX:RAX by r/m64, with result stored in RAX ← Quotient, RDX ← Remainder. Share this post Link to post
Guest Posted October 31, 2020 9 minutes ago, Vandrovnik said: Why not? I thought RDX:RAX are used: Unsigned divide RDX:RAX by r/m64, with result stored in RAX ← Quotient, RDX ← Remainder. Exactly, RDX:RAX is 128 bit, so if he is doing mod "value <= 2^64" then there is no point doing the division using 128 instead of 64bit. On top of that the parameter "b" is passed on RDX and he is clearing it, the whole thing is a mess and meaningless. Share this post Link to post
Vandrovnik 215 Posted October 31, 2020 38 minutes ago, Kas Ob. said: Exactly, RDX:RAX is 128 bit, so if he is doing mod "value <= 2^64" then there is no point doing the division using 128 instead of 64bit. But when divisor is 64bit, he has to use the "128 bits divide by 64 bits", because, as far as I know, there is no instruction for "64 bits divide by 64 bits", or is there? Share this post Link to post
Guest Posted October 31, 2020 15 minutes ago, Vandrovnik said: But when divisor is 64bit, he has to use the "128 bits divide by 64 bits", because, as far as I know, there is no instruction for "64 bits divide by 64 bits", or is there? Reading what i wrote, i think i could write it clearer, which make me just like OP, 🙂 as for 64bit division you can go with "div ecx" it does work on 64bit exactly like 32bit architecture, and will use EDX just like RDX, my point was unless we do understand what is needed, we can't help, there are tricks you can use to avoid mod or division, even you can replace mod with simple "AND" in some cases, how many i saw multiply by 4 which could be a simple shift, and all of that depend on the algorithm, and more importantly is depends on the usage of that ModMul to begin with. If he do need that extreme speed then it is better to first, find the best algorithm, only after that comes assembly to help, even though it looks straight forward like translation, in some cases you need to tweak a little to avoid switching register by switching parameters. Share this post Link to post
Marat1961 17 Posted November 1, 2020 (edited) 18 hours ago, julkas said: I like portable code. But sometimes I need fast ASM Intel 64-bit Object Pascal implementation. Where will your code in the procedure be faster than the same code generated by the compiler. As far as I understand, the functionality of the code is the same, but the overhead of calling the procedure is added. As a result, if we calculate everything, we lose in performance. Edited November 1, 2020 by Marat1961 Share this post Link to post
julkas 12 Posted November 1, 2020 (edited) 6 hours ago, Marat1961 said: Where will your code in the procedure be faster than the same code generated by the compiler. As far as I understand, the functionality of the code is the same, but the overhead of calling the procedure is added. As a result, if we calculate everything, we lose in performance. Thanks all for replies. Please see question - https://www.quora.com/What-are-the-10-biggest-prime-numbers-below-2-64 Here is my last solution - https://ideone.com/oO83MV Can you do it faster using only Miller Rabin test? If - yes,, provide your solution. BTW - My pure Object Pascal implementation is here - https://github.com/JulStrat/primesieve-pas/blob/mult/mult/modar.pas Edited November 1, 2020 by julkas Share this post Link to post
Guest Posted November 1, 2020 (edited) @julkas Thank you for clearing this out. These algorithm you are after are specific for rings and fields, so the arithmetic operation mentioned here https://en.wikipedia.org/wiki/Modular_arithmetic#Integers_modulo_n are specific to Z/nZ ring. It looks good except for your assembly in your last solution, it is still wrong. Quote function MulModASM(a, b: UInt64; m: UInt64): UInt64; // RCX = a RDX = b R8 = m label MUL_MOD; asm MOV RCX, m // We lost a MOV RAX, a // a is m now ! CMP RAX, RCX JB MUL_MOD And that was for the assembly, now for the algorithm itself, we can apply the following (1) EDIT: replaced the formula with image But this is useful and exist to simplify the multiplication when the size/length of "a" and "b" can't fit simple type, or we have very big integers, because lets say "a" is an integer with "an" bit and "b" does have "ab" bits, then the multiplication will have an+ab bit, unlike the addition a+b which will have "max(an,ab)+1" bit, also worst case is when an+ab are longer than CPU register, but luckily this is not the case here, as you asking for 64bit, and our CPU's does support 128Bit division and multiplication !, right ? So i want you to fix it yourself, it is simple and don't need to use (1), because while that equation is there to simplify the calculation, in your case with UInt64 it will does the opposite and complicate the calculation making it slower. Hint : better than using local vars, you have plenty of registers to store a,b and m and manipulate the values, just remember for any register that you will use other than "RAX,RCX,RDX,R8,R9" you should save it then restore it (push and pop) Edited November 1, 2020 by Guest Share this post Link to post