Re: [Cfrg] Progress on curve recommendations for TLS WG

Here is a simple sketch how the prime affects the mop operations.

Comparing multiplication of 2 n-limb numbers with the reduction mod 
n-limb number p:

* in case of a random prime, we have n^2 multiplies. This gives 2n 
number, n higher limbs of which will need to be reduced. To accomplish 
this we multiply p by each of n high limbs, for the total n^2. We have 
2n^2 multiplies

* in case of p=2^k-C, we have the same n^2 multiplies, but the reduction 
results in only n multiplies of high limbs by C. We have n^2 + n

So this is (n^2 + n) / 2n^2 ~= 1/2 (pseudo-Mersenne primes have ~50% 
fewer limb multiplies for multiply+reduce)

This estimate underestimates  the reduction that takes place for the 
general p. One needs an additional n divisions (of a limp by the highest 
limb of p) in denominator, which gives the Mike Hamburg's quantity 
exactly (which uses the Montgomery reduction method). I didn't count 
additions/subtractions.

On 08/15/2014 02:28 PM, Mike Hamburg wrote:
> On 8/15/2014 4:26 AM, Johannes Merkle wrote:
>> Ilari Liusvaara wrote on 08.08.2014 16:15:
>>> Picking prime randomly has _devastating_ impact on performance.
>> I do not mean to doubt that statement, I'm just looking for hard numbers: Do you have any reference for a proper
>> analysis supporting that? I mean something beyond "oh, that's common sense".
> So, I don't have a proper reference, but in my experience the performance ratio is slightly more than 2, and probably creeps towards 2.5 or 3 as the number of bits increases towards 512.
>
> An n-digit multiply with no reduction algorithm costs n^2 muls, schoolbook.  A Montgomery reduction costs n^2+n additional multiplies, for a total of 2n^2 + n.  There may also be a conditional subtract, depending on the modulus.  A special prime still requires some reduction, and Montgomery ends up being slightly less than twice as expensive as schoolbook.
>
> Supporting numbers: a few years ago I implemented 252-bit special-Montgomery primes on Sandy Bridge (mod the special prime, reduction isn't free but costs n=4 mulaccs), and a multiply took about 55 cycles, vs about 100 cycles for a 252-bit general prime.  The algorithms were the same, and both implementations were written in assembly.  This is what you'd have predicted from the above guesses, since (n^2 + n) / (2n^2 + n) = 20 / 36 = 55.555...%.
>
> Squaring costed about 10 cycles less in either case, for a flat ratio of 90:45 = 2:1.  This is because Montgomery reduction does not go any faster when you square, which takes n(n+1)/2 multiplies schoolbook style, but 3x that with a Montgomery reduction.  These asymptotics are not even close to true with n=4, though.
>
> As p increases, the ratio will get worse.  The reduction mod the special prime is a smaller fraction of the total work, and squaring gets closer to asymptotic.  Montgomery reduction does not improve as much with Karatsuba multiplication.  Karatsuba may cut the cost of the multiplication by 25% or more at 384 or 512 bits, but it’s rather tricky to combine with a general Montgomery reduction efficiently.  Furthermore, if you try to do this, the state will be larger and may spill out of the register file.  For example, my 448-bit NEON code exactly uses the entire register file with no spilling; anything extra would add significant cost.
>
> Exactly 64n-bit aligned primes will be slightly worse when randomized, because numbers need to be reduced more often.
>
> Finally, powering ladders like inverse or square root are denser and therefore slower with a random prime  A powering ladder for a special prime usually has a multiply density of around 5%, but for a random prime it is likely to be 20%.  Inversion is about 10% of a variable-base and 25% of a fixed-base scalarmul, so this effect is unlikely to exceed 5% of the total time.
>
> In combination, these will push the ratio to ~2x at 256 bits, and higher for longer numbers, but probably not so high as 3x even at 500 bits (but maybe at 512 or 521 bits).  Not having implemented both a large random and a large special prime together on the same platform, I’m not sure exactly what the ratio will be.
>
> Cheers,
> -- Mike
>
> _______________________________________________
> Cfrg mailing list
> Cfrg@irtf.org
> http://www.irtf.org/mailman/listinfo/cfrg