Re: [Cfrg] Curve selection revisited

Andrey Jivsov <> Wed, 30 July 2014 07:24 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 04F151B2953 for <>; Wed, 30 Jul 2014 00:24:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id o2yIM-XJurFS for <>; Wed, 30 Jul 2014 00:24:35 -0700 (PDT)
Received: from ( [IPv6:2001:558:fe2d:43:76:96:30:56]) by (Postfix) with ESMTP id 7196A1B2954 for <>; Wed, 30 Jul 2014 00:24:34 -0700 (PDT)
Received: from ([]) by with comcast id YXMb1o0040EPchoA6XQZN5; Wed, 30 Jul 2014 07:24:33 +0000
Received: from [] ([]) by with comcast id YXQY1o00R4uhcbK8MXQZeJ; Wed, 30 Jul 2014 07:24:33 +0000
Message-ID: <>
Date: Wed, 30 Jul 2014 00:24:32 -0700
From: Andrey Jivsov <>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Robert Ransom <>
References: <> <> <> <> <> <> <>
In-Reply-To: <>
Content-Type: multipart/alternative; boundary="------------010301070108070304020403"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=q20140121; t=1406705073; bh=tFJASxSTQhPBb5J2PQPtMkwa2IHcf7JLAdw9YKtJQJU=; h=Received:Received:Message-ID:Date:From:MIME-Version:To:Subject: Content-Type; b=P6wlcy/eu2xbPJ7UiJvsua6SSnKOoF/eQBw+m+zYpRxIR2Cmk952FPJiECZWIOJal OvClhQnfgf7K9qpiUYrhTjT/KYHf1FCizE2rRYQ46bIfqiQR5mfMDpsqWmYL1S4n8J a6YKD4/sNYJUbN2F+zWCUMYDX95L+chk932vUOxPZMTpQfiMqepN+d1BBAxKMiz2HI UqSyyKD/B+XO4U4vzyHRBWh7GvJAVWeOfs/qxrVExeuEq/eHumDCVl6a6i1or1KMj8 7Nb+oMCQ4YSsiLIhwosJK3PRJFMgDJEIEFRUYzAE80eyXhGibGyCO4XLpqpKakklbs hJsiKoYYkGIqg==
Subject: Re: [Cfrg] Curve selection revisited
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 30 Jul 2014 07:24:40 -0000

On 07/29/2014 04:14 AM, Robert Ransom wrote:
> On 7/26/14, Andrey Jivsov <> wrote:
>> On 07/25/2014 07:25 PM, Robert Ransom wrote:
>>> The exact value of c determines whether a portable or vectorized
>>> implementation can delay carries until after reduction.  For example,
>>> after curve25519-donna performs a multiply or squaring, each of 19
>>> 64-bit limbs has absolute value less than 14*2^54.  When
>>> freduce_degree reduces a field element to 10 limbs without carrying,
>>> the result has absolute value less than (c+1)*14*2^54  = 280*2^54, or a
>>> bit more than 2^62.  In the Microsoft field 2^255  - 765, or even 2^256
>>> - 189, this would not be possible.
>>> Carries between limbs can be performed more efficiently than in
>>> curve25519-donna -- curve25519-donna goes to extra effort to keep
>>> limbs signed.  In the Montgomery ladder, this optimization does not
>>> add any extra carry steps, because the Montgomery-ladder formulas
>>> never perform more than two add/subtract operations between a
>>> multiply/square operation.  With Edwards-curve operations, this
>>> optimization may require an additional carry step in the doubling
>>> formula.
>>> The exact value of s determines whether carries can be vectorized
>>> within a coordinate-field element.  For example, 2^251  - 9 can be
>>> represented using one 26-bit limb (in the least significant position)
>>> and 9 25-bit limbs, but this interferes with the vectorization
>>> strategy described in the neoncrypto paper (by Dr. Bernstein et al.).
>>> Since c=9 is so small, this can be worked around by representing
>>> coordinates modulo 2^252  - 18 instead, but this trick is rarely
>>> possible without running into the bound on c.
>> The main argument for p = 2^255-19 v.s. p = 1^256-189 is that the C=19
>> is smaller than C=189 and this helps to "manage" the carries.
> Actually, the main argument for p = 2^255 - 19 is that Curve25519 is
> the de-facto standard curve for high-performance ECC in a group of
> near-256-bit order.  Part of the reason for that is that implementers
> have, in practice, been able to rely on the small value of what you
> denote as “C” for that field in optimizing their software, so no one
> has felt a need to switch to a different field of that form for
> performance reasons.

Yes, I accept it.

> (Mike Hamburg introduced ‘special Montgomery primes’ and showed
> evidence that they can provide better performance in some
> implementations than the Curve25519 coordinate field in
> <>, but that's a different form of
> prime.)
>> Let's assume that an F(p) integer is represented as n limbs (words).
>> 256 bit Montgomery curve will have on the order of (5 mul + 4 squares)
>> and about the same number F(p) additions per one scalar bit.
>> Therefore, we can talk about one unreduced F(p) multiplication + F(p)
>> reduction (where C plays its role) + F(p) addition. With this in mind
>> the contribution of C can be estimated within this group.
>> F(p) multiplications (counting squares here as well) dominate the
>> performance of EC operations:
>> * the result of the multiplication is 2*n limbs and the complexity of
>> this operation is n^2 limb multiplications and O(n^2) additions.
>> * the 2*n limbs must be reduced (mod p) to n limbs immediately after
>> multiplication
>> * there is on average ~1 addition/subtraction of n limbs per each
>> multiplication due to EC formula
>> Each multiplication needs degree reduction and settling of accumulated
>> carries before the next multiplication. One can only postpone settling
>> of carries while he does additions (prior to the next multiplciation).
>> We have about n^2 limb additions, plus n limb additions due to EC
>> formular, plus we must perform n additions of the form <some_limb>*C
>> (for modp reduction).
>> Assuming n=10 limbs, the entire contribution of n <some_limb>*C is bound
>> by ~10% of all the additions.
>> We are not talking about C=0, however, but rather about |C|=4 bits v.s.
>> |C| = 8 bits difference.
>> A rough estimate here is 5% penalty.
> As Dr. Bernstein has pointed out, a larger value of C requires 3*n
> carries instead of n carries (see
> <>), and
> carries have high latency (see section 4, subsection ‘Reduction’ of
> naclcrypto).
>> One reason why I think this is a good estimate is that there may be a
>> benefit of more fine-tuned control of carries. The highest accumulation
>> of the carries happens in the inner limbs of the 2*n integer (imagine
>> paper-and-pencil multiplication of two n-limb numbers: the stack is the
>> highest in the middle). Therefore, if we have the need to control
>> overflow within, say, multiplication, perhaps it's possible to
>> accomplish this within a subset of n limbs.
> So you expect greater implementation effort and complexity to reduce
> the added cost, but not eliminate it?
I think it's fair to say that lower C will be faster if you only 
consider F(p) ( multiplies and squares; the additions/subtractions don't 

>> ( BTW, I don't see a clear argument why can't any F(p) be vectorized
>> (some more efficiently than others). The vectorization using x86 SSE2 is
>> known to speeds up RSA operations for generic F(p)s. )
> As I said, the vectorization strategy used in the neoncrypto paper is
> incompatible with 251-bit moduli.  That strategy requires pairing
> together limbs of the same size, so that a vectorized shift operation
> can be used to carry out of both limbs at the same time.
> In RSA, where (in your notation) C is necessarily huge, there is no
> performance penalty from making C a few bits larger, so it's easy to
> lengthen the modulus and make every limb the same size, so it's easy
> to vectorize carries across any pair of limbs.  But that's not very
> relevant to the primes used for fast ECC.
That's correct. For RSA with SSE you do similar things you do here, but 
the reduction is generic and slow.

> Do you see any benefit of p = 2^256 - 189 over p = 2^255 - 19 that you
> believe offsets the costs of switching to a different coordinate
> field?

They are :
* aesthetics
* it happened that square root of  p = 2^255 - 19 requires 
multiplication by a sqrt(-1) in half the cases when decompression is 

Both are tolerable from aesthetics point of view IMO if we (effectively) 
standardize on p = 2^255 - 19. p = 2^255 - 19's slightly higher 
performance than 2^256 - 189 should balance *sqrt(-1) out.

( Compare this with 2^414-17, which won't be trivial to match with other 
algorithms. )