Re: [Cfrg] Curve selection revisited

Andrey Jivsov <crypto@brainhub.org> Wed, 30 July 2014 07:24 UTC

Message-ID: <53D89DB0.9020505@brainhub.org>
Date: Wed, 30 Jul 2014 00:24:32 -0700
From: Andrey Jivsov <crypto@brainhub.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Robert Ransom <rransom.8774@gmail.com>
References: <CA+Vbu7xroa68=HOZtbf=oz7kK2EeUv_z1okpnjxHPR0ZtHD5cA@mail.gmail.com> <CFF7E184.28E9F%kenny.paterson@rhul.ac.uk> <53D2781B.8030605@sbcglobal.net> <CACsn0ckqFigWoH2+OOEHSd2VWPp8y6=m8H5OsFRyjXmjK7+m4w@mail.gmail.com> <CABqy+srxMNuG0AaQd0SaegHvZWgbW762EQq+iAHL_fbu6sOJJQ@mail.gmail.com> <53D420B3.10707@brainhub.org> <CABqy+so6JcL3drjXuiQfLhm-LPMOJuS9ES5Hyb1UQRhi-gV2jA@mail.gmail.com>
In-Reply-To: <CABqy+so6JcL3drjXuiQfLhm-LPMOJuS9ES5Hyb1UQRhi-gV2jA@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------010301070108070304020403"
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/-yzHPyRwz7zKlHRJ63ZZ6MJwczM
Cc: cfrg@irtf.org
Subject: Re: [Cfrg] Curve selection revisited
Precedence: list

On 07/29/2014 04:14 AM, Robert Ransom wrote:
> On 7/26/14, Andrey Jivsov <crypto@brainhub.org> wrote:
>> On 07/25/2014 07:25 PM, Robert Ransom wrote:
>>> The exact value of c determines whether a portable or vectorized
>>> implementation can delay carries until after reduction.  For example,
>>> after curve25519-donna performs a multiply or squaring, each of 19
>>> 64-bit limbs has absolute value less than 14*2^54.  When
>>> freduce_degree reduces a field element to 10 limbs without carrying,
>>> the result has absolute value less than (c+1)*14*2^54  = 280*2^54, or a
>>> bit more than 2^62.  In the Microsoft field 2^255  - 765, or even 2^256
>>> - 189, this would not be possible.
>>>
>>> Carries between limbs can be performed more efficiently than in
>>> curve25519-donna -- curve25519-donna goes to extra effort to keep
>>> limbs signed.  In the Montgomery ladder, this optimization does not
>>> add any extra carry steps, because the Montgomery-ladder formulas
>>> never perform more than two add/subtract operations between a
>>> multiply/square operation.  With Edwards-curve operations, this
>>> optimization may require an additional carry step in the doubling
>>> formula.
>>>
>>> The exact value of s determines whether carries can be vectorized
>>> within a coordinate-field element.  For example, 2^251  - 9 can be
>>> represented using one 26-bit limb (in the least significant position)
>>> and 9 25-bit limbs, but this interferes with the vectorization
>>> strategy described in the neoncrypto paper (by Dr. Bernstein et al.).
>>> Since c=9 is so small, this can be worked around by representing
>>> coordinates modulo 2^252  - 18 instead, but this trick is rarely
>>> possible without running into the bound on c.
>> The main argument for p = 2^255-19 v.s. p = 1^256-189 is that the C=19
>> is smaller than C=189 and this helps to "manage" the carries.
> Actually, the main argument for p = 2^255 - 19 is that Curve25519 is
> the de-facto standard curve for high-performance ECC in a group of
> near-256-bit order.  Part of the reason for that is that implementers
> have, in practice, been able to rely on the small value of what you
> denote as “C” for that field in optimizing their software, so no one
> has felt a need to switch to a different field of that form for
> performance reasons.

Yes, I accept it.

>
> (Mike Hamburg introduced ‘special Montgomery primes’ and showed
> evidence that they can provide better performance in some
> implementations than the Curve25519 coordinate field in
> <https://eprint.iacr.org/2012/309.pdf>, but that's a different form of
> prime.)
>
>
>> Let's assume that an F(p) integer is represented as n limbs (words).
>>
>> 256 bit Montgomery curve will have on the order of (5 mul + 4 squares)
>> and about the same number F(p) additions per one scalar bit.
>>
>> Therefore, we can talk about one unreduced F(p) multiplication + F(p)
>> reduction (where C plays its role) + F(p) addition. With this in mind
>> the contribution of C can be estimated within this group.
>>
>> F(p) multiplications (counting squares here as well) dominate the
>> performance of EC operations:
>> * the result of the multiplication is 2*n limbs and the complexity of
>> this operation is n^2 limb multiplications and O(n^2) additions.
>> * the 2*n limbs must be reduced (mod p) to n limbs immediately after
>> multiplication
>> * there is on average ~1 addition/subtraction of n limbs per each
>> multiplication due to EC formula
>>
>> Each multiplication needs degree reduction and settling of accumulated
>> carries before the next multiplication. One can only postpone settling
>> of carries while he does additions (prior to the next multiplciation).
>>
>> We have about n^2 limb additions, plus n limb additions due to EC
>> formular, plus we must perform n additions of the form <some_limb>*C
>> (for modp reduction).
>>
>> Assuming n=10 limbs, the entire contribution of n <some_limb>*C is bound
>> by ~10% of all the additions.
>>
>> We are not talking about C=0, however, but rather about |C|=4 bits v.s.
>> |C| = 8 bits difference.
>>
>> A rough estimate here is 5% penalty.
> As Dr. Bernstein has pointed out, a larger value of C requires 3*n
> carries instead of n carries (see
> <https://moderncrypto.org/mail-archive/curves/2014/000237.html>), and
> carries have high latency (see section 4, subsection ‘Reduction’ of
> naclcrypto).
>
>
>> One reason why I think this is a good estimate is that there may be a
>> benefit of more fine-tuned control of carries. The highest accumulation
>> of the carries happens in the inner limbs of the 2*n integer (imagine
>> paper-and-pencil multiplication of two n-limb numbers: the stack is the
>> highest in the middle). Therefore, if we have the need to control
>> overflow within, say, multiplication, perhaps it's possible to
>> accomplish this within a subset of n limbs.
> So you expect greater implementation effort and complexity to reduce
> the added cost, but not eliminate it?
I think it's fair to say that lower C will be faster if you only 
consider F(p) ( multiplies and squares; the additions/subtractions don't 
matter).

>
>
>> ( BTW, I don't see a clear argument why can't any F(p) be vectorized
>> (some more efficiently than others). The vectorization using x86 SSE2 is
>> known to speeds up RSA operations for generic F(p)s. )
> As I said, the vectorization strategy used in the neoncrypto paper is
> incompatible with 251-bit moduli.  That strategy requires pairing
> together limbs of the same size, so that a vectorized shift operation
> can be used to carry out of both limbs at the same time.
>
> In RSA, where (in your notation) C is necessarily huge, there is no
> performance penalty from making C a few bits larger, so it's easy to
> lengthen the modulus and make every limb the same size, so it's easy
> to vectorize carries across any pair of limbs.  But that's not very
> relevant to the primes used for fast ECC.
That's correct. For RSA with SSE you do similar things you do here, but 
the reduction is generic and slow.

>
>
> Do you see any benefit of p = 2^256 - 189 over p = 2^255 - 19 that you
> believe offsets the costs of switching to a different coordinate
> field?

They are :
* aesthetics
* it happened that square root of  p = 2^255 - 19 requires 
multiplication by a sqrt(-1) in half the cases when decompression is 
involved.

Both are tolerable from aesthetics point of view IMO if we (effectively) 
standardize on p = 2^255 - 19. p = 2^255 - 19's slightly higher 
performance than 2^256 - 189 should balance *sqrt(-1) out.

( Compare this with 2^414-17, which won't be trivial to match with other 
algorithms. )

[Cfrg] Curve selection revisited Benjamin Black
Re: [Cfrg] Curve selection revisited Yoav Nir
Re: [Cfrg] Curve selection revisited Paterson, Kenny
Re: [Cfrg] Curve selection revisited David Jacobson
Re: [Cfrg] Curve selection revisited Watson Ladd
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Watson Ladd
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Watson Ladd
Re: [Cfrg] Curve selection revisited Andrey Jivsov
Re: [Cfrg] Curve selection revisited Ilari Liusvaara
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Michael Hamburg
Re: [Cfrg] Curve selection revisited Michael Jenkins
Re: [Cfrg] Curve selection revisited Michael Hamburg
Re: [Cfrg] Curve selection revisited Hannes Tschofenig
Re: [Cfrg] Curve selection revisited Hannes Tschofenig
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Stephen Farrell
Re: [Cfrg] Curve selection revisited Michael Hamburg
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Michael Hamburg
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Paul Lambert
Re: [Cfrg] Curve selection revisited Paul Lambert
Re: [Cfrg] Curve selection revisited Mike Hamburg
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Andrey Jivsov
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Phillip Hallam-Baker
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Russ Housley
Re: [Cfrg] Curve selection revisited Salz, Rich
Re: [Cfrg] Curve selection revisited Phillip Hallam-Baker
Re: [Cfrg] Curve selection revisited Salz, Rich