Re: [Cfrg] Curve selection revisited

Robert Ransom <rransom.8774@gmail.com> Tue, 29 July 2014 11:14 UTC

MIME-Version: 1.0
In-Reply-To: <53D420B3.10707@brainhub.org>
References: <CA+Vbu7xroa68=HOZtbf=oz7kK2EeUv_z1okpnjxHPR0ZtHD5cA@mail.gmail.com> <CFF7E184.28E9F%kenny.paterson@rhul.ac.uk> <53D2781B.8030605@sbcglobal.net> <CACsn0ckqFigWoH2+OOEHSd2VWPp8y6=m8H5OsFRyjXmjK7+m4w@mail.gmail.com> <CABqy+srxMNuG0AaQd0SaegHvZWgbW762EQq+iAHL_fbu6sOJJQ@mail.gmail.com> <53D420B3.10707@brainhub.org>
Date: Tue, 29 Jul 2014 04:14:15 -0700
Message-ID: <CABqy+so6JcL3drjXuiQfLhm-LPMOJuS9ES5Hyb1UQRhi-gV2jA@mail.gmail.com>
From: Robert Ransom <rransom.8774@gmail.com>
To: Andrey Jivsov <crypto@brainhub.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/ineI42CtwjeMF8HbBdm0bz1iqno
Cc: cfrg@irtf.org
Subject: Re: [Cfrg] Curve selection revisited
Precedence: list

On 7/26/14, Andrey Jivsov <crypto@brainhub.org> wrote:
> On 07/25/2014 07:25 PM, Robert Ransom wrote:
>> The exact value of c determines whether a portable or vectorized
>> implementation can delay carries until after reduction.  For example,
>> after curve25519-donna performs a multiply or squaring, each of 19
>> 64-bit limbs has absolute value less than 14*2^54.  When
>> freduce_degree reduces a field element to 10 limbs without carrying,
>> the result has absolute value less than (c+1)*14*2^54  = 280*2^54, or a
>> bit more than 2^62.  In the Microsoft field 2^255  - 765, or even 2^256
>> - 189, this would not be possible.
>>
>> Carries between limbs can be performed more efficiently than in
>> curve25519-donna -- curve25519-donna goes to extra effort to keep
>> limbs signed.  In the Montgomery ladder, this optimization does not
>> add any extra carry steps, because the Montgomery-ladder formulas
>> never perform more than two add/subtract operations between a
>> multiply/square operation.  With Edwards-curve operations, this
>> optimization may require an additional carry step in the doubling
>> formula.
>>
>> The exact value of s determines whether carries can be vectorized
>> within a coordinate-field element.  For example, 2^251  - 9 can be
>> represented using one 26-bit limb (in the least significant position)
>> and 9 25-bit limbs, but this interferes with the vectorization
>> strategy described in the neoncrypto paper (by Dr. Bernstein et al.).
>> Since c=9 is so small, this can be worked around by representing
>> coordinates modulo 2^252  - 18 instead, but this trick is rarely
>> possible without running into the bound on c.

> The main argument for p = 2^255-19 v.s. p = 1^256-189 is that the C=19
> is smaller than C=189 and this helps to "manage" the carries.

Actually, the main argument for p = 2^255 - 19 is that Curve25519 is
the de-facto standard curve for high-performance ECC in a group of
near-256-bit order.  Part of the reason for that is that implementers
have, in practice, been able to rely on the small value of what you
denote as “C” for that field in optimizing their software, so no one
has felt a need to switch to a different field of that form for
performance reasons.

(Mike Hamburg introduced ‘special Montgomery primes’ and showed
evidence that they can provide better performance in some
implementations than the Curve25519 coordinate field in
<https://eprint.iacr.org/2012/309.pdf>, but that's a different form of
prime.)


> Let's assume that an F(p) integer is represented as n limbs (words).
>
> 256 bit Montgomery curve will have on the order of (5 mul + 4 squares)
> and about the same number F(p) additions per one scalar bit.
>
> Therefore, we can talk about one unreduced F(p) multiplication + F(p)
> reduction (where C plays its role) + F(p) addition. With this in mind
> the contribution of C can be estimated within this group.
>
> F(p) multiplications (counting squares here as well) dominate the
> performance of EC operations:
> * the result of the multiplication is 2*n limbs and the complexity of
> this operation is n^2 limb multiplications and O(n^2) additions.
> * the 2*n limbs must be reduced (mod p) to n limbs immediately after
> multiplication
> * there is on average ~1 addition/subtraction of n limbs per each
> multiplication due to EC formula
>
> Each multiplication needs degree reduction and settling of accumulated
> carries before the next multiplication. One can only postpone settling
> of carries while he does additions (prior to the next multiplciation).
>
> We have about n^2 limb additions, plus n limb additions due to EC
> formular, plus we must perform n additions of the form <some_limb>*C
> (for modp reduction).
>
> Assuming n=10 limbs, the entire contribution of n <some_limb>*C is bound
> by ~10% of all the additions.
>
> We are not talking about C=0, however, but rather about |C|=4 bits v.s.
> |C| = 8 bits difference.
>
> A rough estimate here is 5% penalty.

As Dr. Bernstein has pointed out, a larger value of C requires 3*n
carries instead of n carries (see
<https://moderncrypto.org/mail-archive/curves/2014/000237.html>), and
carries have high latency (see section 4, subsection ‘Reduction’ of
naclcrypto).


> One reason why I think this is a good estimate is that there may be a
> benefit of more fine-tuned control of carries. The highest accumulation
> of the carries happens in the inner limbs of the 2*n integer (imagine
> paper-and-pencil multiplication of two n-limb numbers: the stack is the
> highest in the middle). Therefore, if we have the need to control
> overflow within, say, multiplication, perhaps it's possible to
> accomplish this within a subset of n limbs.

So you expect greater implementation effort and complexity to reduce
the added cost, but not eliminate it?


> ( BTW, I don't see a clear argument why can't any F(p) be vectorized
> (some more efficiently than others). The vectorization using x86 SSE2 is
> known to speeds up RSA operations for generic F(p)s. )

As I said, the vectorization strategy used in the neoncrypto paper is
incompatible with 251-bit moduli.  That strategy requires pairing
together limbs of the same size, so that a vectorized shift operation
can be used to carry out of both limbs at the same time.

In RSA, where (in your notation) C is necessarily huge, there is no
performance penalty from making C a few bits larger, so it's easy to
lengthen the modulus and make every limb the same size, so it's easy
to vectorize carries across any pair of limbs.  But that's not very
relevant to the primes used for fast ECC.


Do you see any benefit of p = 2^256 - 189 over p = 2^255 - 19 that you
believe offsets the costs of switching to a different coordinate
field?


Robert Ransom

[Cfrg] Curve selection revisited Benjamin Black
Re: [Cfrg] Curve selection revisited Yoav Nir
Re: [Cfrg] Curve selection revisited Paterson, Kenny
Re: [Cfrg] Curve selection revisited David Jacobson
Re: [Cfrg] Curve selection revisited Watson Ladd
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Watson Ladd
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Watson Ladd
Re: [Cfrg] Curve selection revisited Andrey Jivsov
Re: [Cfrg] Curve selection revisited Ilari Liusvaara
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Michael Hamburg
Re: [Cfrg] Curve selection revisited Michael Jenkins
Re: [Cfrg] Curve selection revisited Michael Hamburg
Re: [Cfrg] Curve selection revisited Hannes Tschofenig
Re: [Cfrg] Curve selection revisited Hannes Tschofenig
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Stephen Farrell
Re: [Cfrg] Curve selection revisited Michael Hamburg
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Michael Hamburg
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Paul Lambert
Re: [Cfrg] Curve selection revisited Paul Lambert
Re: [Cfrg] Curve selection revisited Mike Hamburg
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Andrey Jivsov
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Robert Ransom
Re: [Cfrg] Curve selection revisited Phillip Hallam-Baker
Re: [Cfrg] Curve selection revisited Robert Moskowitz
Re: [Cfrg] Curve selection revisited Russ Housley
Re: [Cfrg] Curve selection revisited Salz, Rich
Re: [Cfrg] Curve selection revisited Phillip Hallam-Baker
Re: [Cfrg] Curve selection revisited Salz, Rich