Re: [Cfrg] 2^389-21 vs 2^414-17 vs 2^448-2^224-1

Hi Watson,

I mostly agree with your conclusions, but a few adjustments:

Your numbers on Goldilocks don't reflect how I've implemented it. So far 
I've always used 1 level of Karatsuba, with 192 multiplies on 32-bit 
platforms, and 48 multiplies on 64-bit platforms.  Because Karatsuba 
results in extra additions and extra complexity, this may be more 
expensive than a 49-mul delayed-carry schoolbook algorithm, but not by 
much because the reduction is included in the Karatsuba algorithm for 
Goldilocks.  This isn't just a matter of limb count: the prime 2^448 - 
2^224 - 1 was picked specifically because it has fast reduction 
integrated with Karatsuba multiplication.  So maybe 10%, or maybe no 
difference.

My NEON 1-level Karatsuba (192 muls, i.e. 96 pairs) takes about as long 
as DJB et al's 2-level Karatsuba (144 muls).  My guess is that the gains 
from the second level are mostly eaten up by overhead, but it's just a 
guess since we used different strategies and different primes, and 
neither team compared 1-level vs 2-level directly AFAIK.  It may merit 
an experiment eventually, but NEON is a pain to code for because the 
compiler support isn't very good.

I would expect 2^389-21 to support some sort of fractional-radix 
technique with 16 words on 32-bit platforms.

If you want a quick and dirty performance test, just implement 
multiplication and squaring mod 2^389-21 and compare it to Goldilocks' 
`make bench`, which includes time in nanoseconds for square and 
multiply.  Squaring will be different because Karatsuba is less of a win 
for squaring algorithms.

Cheers,
-- Mike

On 12/15/2014 06:56 PM, Watson Ladd wrote:
> Dear all,
>
> I've come up with some back of the envelope sketches of the
> performance differences for these primes. Note that there aren't hard
> measurements yet: I'm only counting limbs for various choices of
> radix, and these radix choices might not be themselves optimal. These
> counts don't include the cost of reduction, which is at worst an
> additional multiplication of each limb by a constant plus an addition
> with carry. They don't include additions. In these calculations I've
> permitted that addition to overflow, but not the preceding
> multiplications. I've also not used signed arithmetic, which can make
> things slightly better.
>
> On a 32 bit machine with 32x32->64 multiplier, but no 64 bit adder:
>
> 2^389-21 can use 15 26 bit words. The product can be computed without
> carries, but the final reduction requires a carry chain. Schoolboy
> costs 225 multiplications. Karatsuba will cost 140 mults. The best I
> can do without using a hardware carry is a much worse 18 23 bit words,
> which will cost 198 multiplies with two uses of Karatsuba. There may
> be a better way, and I didn't investigate overflow/addition tradeoff.
>
> 2^414-17 as described in http://eprint.iacr.org/2014/526.pdf uses 16
> franctional radix words, and two stages of Karatsuba for 144 mults.
>
> 2^448-2^224-1 uses a special method for multiplication which I don't
> have multiplication counts for. I believe 16 28 bit words is used in
> the current implementation, and so 144 multiplications.
>
> On a 64 bit machine with 64x64->128 multiplier:
>
> 2^389-21 can use 7 56 bit limbs. Here there is plenty of headroom, and
> so no carrying in the reduction is required.
>
> 2^414-17 could use 8 56 bit limbs, with no carrying in the reduction
> required. It could use 7 60 bit limbs, if the hardware carry is used
> in the reduction.
>
> 2^448-2^224-1 is clearly using 8 56 bit limbs: it matches exactly.
>
> At these sizes schoolboy is the right choice: 49 vs 64 multiplications.
>
> Please double-check this. I'll try to verify this in more detail,
> including implementing a few strategies. I know that the results I've
> listed for some 2^414-17 and 2^448-2^224-1 are not optimal, and would
> appreciate correction.
>
> The tentative conclusion is that the performance advantages of
> 2^389-21 vs 2^448-2^224-1 are likely to be minimal on 32-bit machines,
> and about 10% on 64 bit machines. As always, the believable results
> require implementations, which I don't think I will make in time for
> our deadline.
>
> Sincerely,
> Watson Ladd
>
> _______________________________________________
> Cfrg mailing list
> Cfrg@irtf.org
> http://www.irtf.org/mailman/listinfo/cfrg