Re: [Cfrg] Progress on curve recommendations for TLS WG

Andrey Jivsov <crypto@brainhub.org> Sun, 17 August 2014 06:56 UTC

Message-ID: <53F05207.1050509@brainhub.org>
Date: Sat, 16 Aug 2014 23:56:07 -0700
From: Andrey Jivsov <crypto@brainhub.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.7.0
MIME-Version: 1.0
To: Michael Hamburg <mike@shiftleft.org>
References: <CFFB1371.2916E%kenny.paterson@rhul.ac.uk> <20140808141506.GA24645@LK-Perkele-VII> <53EDEE67.6090308@secunet.com> <A1FCAAE8-4F20-4AC1-A635-AED20E8F7D5C@shiftleft.org> <53EE8F45.9070908@brainhub.org> <F3296E50-46AD-4653-BCCA-5435D6ECB78F@shiftleft.org>
In-Reply-To: <F3296E50-46AD-4653-BCCA-5435D6ECB78F@shiftleft.org>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/nQM-c3uq9TCYYIaGrMztlhT9IZY
Cc: cfrg@irtf.org
Subject: Re: [Cfrg] Progress on curve recommendations for TLS WG
Precedence: list

I looked at my implementation of modp reduction I did a few years ago 
for a fixed p. I used Barrett reduction.

At closer look I don't see why multiply+reduction mod p, for a random 
but fixed p, is 2n^2 + n. I got a corrected value for Barrett reduction 
below.

On 08/15/2014 05:55 PM, Michael Hamburg wrote:
> I have some more data.  Summary: “devastating” is probably between 1.8 and 3.1.
>
> Mod 2^521-1, you can do a multiply in 167 Haswell cycles using a variant of 3-way Chung-Hasan which is specific to that prime.  (Ordinary Karatsuba is no good because the prime has 9 limbs.)  Schoolbook takes 222 cycles with only minor benefits from the prime’s shape.  Maybe a better implementor can get better numbers than mine; this is a rough try in C++ with asm intrinsics for widening multiply.
>
> I don’t have an implementation on hand of a random prime at that level.  Suppose that for a random prime that size you can do an almost-as-good version of Chung-Hasan in 180 cycles, but then you need to fall back to a digit-serial Montgomery reduction which approximates the schoolbook method.  This includes some of the inter-digit reduction twice, so maybe 222 is 10% too high.  Then the ratio is (180 + 222 * 90%) / 167 ~ 2.3.  A random 512-bit prime also requires this amount of time if you use 9 limbs; with 8 limbs it’s again about the same or maybe slightly worse (see below).
>
> For comparison, based on the Microsoft data, their prime 2^512 - 569 takes about 255 Sandy Bridge cycles per mul.  This is written in assembly language using the schoolbook method.  This corresponds to about 212 Haswell cycles, since the going rate is about 20%.  So it’s 5% faster than my P-521 schoolbook, but in assembly vs C++/intrinsics.  The ratio with my above estimate for a random prime (using a completely different implementation) is 1.8.  Schoolbook Montgomery for a random prime using Microsoft’s implementation techniques should have a ratio of about (2n^2 + n) / (n^2 + n) ~ 1.9 with n=8 limbs.
>
> This assumes a random prime vs special one at the “>= 256-bit WF” level, but if tweaking the levels is allowed, the ratio is even more.  A 480-bit special prime, Ridinghood, takes 122 Haswell cycles for a multiply (same as 448-bit Goldilocks).  That’s a ratio of 3.1 vs the above estimate for a 512-bit random prime.  If you scale the numbers to take the shorter length (480 vs 512) into account, the “efficiency ratio” is about 2.8.
>
> The Ridinghood numbers reflect better tuning than the 521-bit tests, but probably not as much tuning as the Microsoft implementation.  My tests are in C with asm intrinsics.  The difference is maybe 10% vs the C++ above, so figure that a ratio of 2.8 (efficiency ratio 2.5) is probably more fair.
>
> This doesn’t take into account the ratio of multiplication, squaring, and addition / subtraction.
>
>> On Aug 15, 2014, at 3:52 PM, Andrey Jivsov <crypto@brainhub.org> wrote:
>>
>> Here is a simple sketch how the prime affects the mop operations.
>>
>> Comparing multiplication of 2 n-limb numbers with the reduction mod n-limb number p:
>>
>> * in case of a random prime, we have n^2 multiplies. This gives 2n number, n higher limbs of which will need to be reduced. To accomplish this we multiply p by each of n high limbs, for the total n^2. We have 2n^2 multiplies
Ignoring the fact that some operations in Barrett reduction involve +1, 
-1 of border limbs, Barrett has two multiplies with the following 
complexity:

* n^2 of n higher limbs (of the 2n-limb number we are reducing) by the 
n-limb precomputed quantity (b^2n)/p, where b is max limb value+1, e.g. 
b=2^64.
* it computes another quantity t*p mod b^(n+1), which is roughly n^2/2 
(while you estimate it as n).

The total then is ~ 2.5n^2 v.s. n^2+n for pseudo-Mersenne.


>>
>> * in case of p=2^k-C, we have the same n^2 multiplies, but the reduction results in only n multiplies of high limbs by C. We have n^2 + n
>>
>> So this is (n^2 + n) / 2n^2 ~= 1/2 (pseudo-Mersenne primes have ~50% fewer limb multiplies for multiply+reduce)

(n^2+n)/2.5n^2 ~= 2/5,

pseudo-Mersenne primes use ~40% multiplies.

For 256-bit primes n is likely too low to observe this. n=4, n^2/2 = 8 and I ignored extra limbs.


>>
>>
>> This estimate underestimates  the reduction that takes place for the general p. One needs an additional n divisions (of a limp by the highest limb of p) in denominator, which gives the Mike Hamburg's quantity exactly (which uses the Montgomery reduction method). I didn't count additions/subtractions.
> These divisions are actually multiplies, because you use either Montgomery or Barrett reduction.  So it’s (n^2 + n) / (2n^2 + n).
>
> Cheers,
> — Mike

[Cfrg] Progress on curve recommendations for TLS … Paterson, Kenny
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … Russ Housley
Re: [Cfrg] Progress on curve recommendations for … D. J. Bernstein
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Ilari Liusvaara
Re: [Cfrg] Progress on curve recommendations for … Robert Ransom
Re: [Cfrg] Progress on curve recommendations for … Johannes Merkle
Re: [Cfrg] Progress on curve recommendations for … Johannes Merkle
Re: [Cfrg] Progress on curve recommendations for … Alyssa Rowan
Re: [Cfrg] Progress on curve recommendations for … Johannes Merkle
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Johannes Merkle
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … Andy Lutomirski
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Mike Hamburg
Re: [Cfrg] Progress on curve recommendations for … Andrey Jivsov
Re: [Cfrg] Progress on curve recommendations for … D. J. Bernstein
Re: [Cfrg] Progress on curve recommendations for … Michael Hamburg
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … D. J. Bernstein
Re: [Cfrg] Progress on curve recommendations for … Andrey Jivsov
Re: [Cfrg] Progress on curve recommendations for … Michael Hamburg