Re: [Cfrg] Curve selection revisited

Andrey Jivsov <crypto@brainhub.org> Wed, 30 July 2014 07:24 UTC

Return-Path: <crypto@brainhub.org>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 04F151B2953 for <cfrg@ietfa.amsl.com>; Wed, 30 Jul 2014 00:24:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id o2yIM-XJurFS for <cfrg@ietfa.amsl.com>; Wed, 30 Jul 2014 00:24:35 -0700 (PDT)
Received: from qmta06.emeryville.ca.mail.comcast.net (qmta06.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:43:76:96:30:56]) by ietfa.amsl.com (Postfix) with ESMTP id 7196A1B2954 for <cfrg@irtf.org>; Wed, 30 Jul 2014 00:24:34 -0700 (PDT)
Received: from omta01.emeryville.ca.mail.comcast.net ([76.96.30.11]) by qmta06.emeryville.ca.mail.comcast.net with comcast id YXMb1o0040EPchoA6XQZN5; Wed, 30 Jul 2014 07:24:33 +0000
Received: from [192.168.1.2] ([71.202.164.227]) by omta01.emeryville.ca.mail.comcast.net with comcast id YXQY1o00R4uhcbK8MXQZeJ; Wed, 30 Jul 2014 07:24:33 +0000
Message-ID: <53D89DB0.9020505@brainhub.org>
Date: Wed, 30 Jul 2014 00:24:32 -0700
From: Andrey Jivsov <crypto@brainhub.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Robert Ransom <rransom.8774@gmail.com>
References: <CA+Vbu7xroa68=HOZtbf=oz7kK2EeUv_z1okpnjxHPR0ZtHD5cA@mail.gmail.com> <CFF7E184.28E9F%kenny.paterson@rhul.ac.uk> <53D2781B.8030605@sbcglobal.net> <CACsn0ckqFigWoH2+OOEHSd2VWPp8y6=m8H5OsFRyjXmjK7+m4w@mail.gmail.com> <CABqy+srxMNuG0AaQd0SaegHvZWgbW762EQq+iAHL_fbu6sOJJQ@mail.gmail.com> <53D420B3.10707@brainhub.org> <CABqy+so6JcL3drjXuiQfLhm-LPMOJuS9ES5Hyb1UQRhi-gV2jA@mail.gmail.com>
In-Reply-To: <CABqy+so6JcL3drjXuiQfLhm-LPMOJuS9ES5Hyb1UQRhi-gV2jA@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------010301070108070304020403"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1406705073; bh=tFJASxSTQhPBb5J2PQPtMkwa2IHcf7JLAdw9YKtJQJU=; h=Received:Received:Message-ID:Date:From:MIME-Version:To:Subject: Content-Type; b=P6wlcy/eu2xbPJ7UiJvsua6SSnKOoF/eQBw+m+zYpRxIR2Cmk952FPJiECZWIOJal OvClhQnfgf7K9qpiUYrhTjT/KYHf1FCizE2rRYQ46bIfqiQR5mfMDpsqWmYL1S4n8J a6YKD4/sNYJUbN2F+zWCUMYDX95L+chk932vUOxPZMTpQfiMqepN+d1BBAxKMiz2HI UqSyyKD/B+XO4U4vzyHRBWh7GvJAVWeOfs/qxrVExeuEq/eHumDCVl6a6i1or1KMj8 7Nb+oMCQ4YSsiLIhwosJK3PRJFMgDJEIEFRUYzAE80eyXhGibGyCO4XLpqpKakklbs hJsiKoYYkGIqg==
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/-yzHPyRwz7zKlHRJ63ZZ6MJwczM
Cc: cfrg@irtf.org
Subject: Re: [Cfrg] Curve selection revisited
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jul 2014 07:24:40 -0000

On 07/29/2014 04:14 AM, Robert Ransom wrote:
> On 7/26/14, Andrey Jivsov <crypto@brainhub.org> wrote:
>> On 07/25/2014 07:25 PM, Robert Ransom wrote:
>>> The exact value of c determines whether a portable or vectorized
>>> implementation can delay carries until after reduction.  For example,
>>> after curve25519-donna performs a multiply or squaring, each of 19
>>> 64-bit limbs has absolute value less than 14*2^54.  When
>>> freduce_degree reduces a field element to 10 limbs without carrying,
>>> the result has absolute value less than (c+1)*14*2^54  = 280*2^54, or a
>>> bit more than 2^62.  In the Microsoft field 2^255  - 765, or even 2^256
>>> - 189, this would not be possible.
>>>
>>> Carries between limbs can be performed more efficiently than in
>>> curve25519-donna -- curve25519-donna goes to extra effort to keep
>>> limbs signed.  In the Montgomery ladder, this optimization does not
>>> add any extra carry steps, because the Montgomery-ladder formulas
>>> never perform more than two add/subtract operations between a
>>> multiply/square operation.  With Edwards-curve operations, this
>>> optimization may require an additional carry step in the doubling
>>> formula.
>>>
>>> The exact value of s determines whether carries can be vectorized
>>> within a coordinate-field element.  For example, 2^251  - 9 can be
>>> represented using one 26-bit limb (in the least significant position)
>>> and 9 25-bit limbs, but this interferes with the vectorization
>>> strategy described in the neoncrypto paper (by Dr. Bernstein et al.).
>>> Since c=9 is so small, this can be worked around by representing
>>> coordinates modulo 2^252  - 18 instead, but this trick is rarely
>>> possible without running into the bound on c.
>> The main argument for p = 2^255-19 v.s. p = 1^256-189 is that the C=19
>> is smaller than C=189 and this helps to "manage" the carries.
> Actually, the main argument for p = 2^255 - 19 is that Curve25519 is
> the de-facto standard curve for high-performance ECC in a group of
> near-256-bit order.  Part of the reason for that is that implementers
> have, in practice, been able to rely on the small value of what you
> denote as “C” for that field in optimizing their software, so no one
> has felt a need to switch to a different field of that form for
> performance reasons.

Yes, I accept it.

>
> (Mike Hamburg introduced ‘special Montgomery primes’ and showed
> evidence that they can provide better performance in some
> implementations than the Curve25519 coordinate field in
> <https://eprint.iacr.org/2012/309.pdf>, but that's a different form of
> prime.)
>
>
>> Let's assume that an F(p) integer is represented as n limbs (words).
>>
>> 256 bit Montgomery curve will have on the order of (5 mul + 4 squares)
>> and about the same number F(p) additions per one scalar bit.
>>
>> Therefore, we can talk about one unreduced F(p) multiplication + F(p)
>> reduction (where C plays its role) + F(p) addition. With this in mind
>> the contribution of C can be estimated within this group.
>>
>> F(p) multiplications (counting squares here as well) dominate the
>> performance of EC operations:
>> * the result of the multiplication is 2*n limbs and the complexity of
>> this operation is n^2 limb multiplications and O(n^2) additions.
>> * the 2*n limbs must be reduced (mod p) to n limbs immediately after
>> multiplication
>> * there is on average ~1 addition/subtraction of n limbs per each
>> multiplication due to EC formula
>>
>> Each multiplication needs degree reduction and settling of accumulated
>> carries before the next multiplication. One can only postpone settling
>> of carries while he does additions (prior to the next multiplciation).
>>
>> We have about n^2 limb additions, plus n limb additions due to EC
>> formular, plus we must perform n additions of the form <some_limb>*C
>> (for modp reduction).
>>
>> Assuming n=10 limbs, the entire contribution of n <some_limb>*C is bound
>> by ~10% of all the additions.
>>
>> We are not talking about C=0, however, but rather about |C|=4 bits v.s.
>> |C| = 8 bits difference.
>>
>> A rough estimate here is 5% penalty.
> As Dr. Bernstein has pointed out, a larger value of C requires 3*n
> carries instead of n carries (see
> <https://moderncrypto.org/mail-archive/curves/2014/000237.html>), and
> carries have high latency (see section 4, subsection ‘Reduction’ of
> naclcrypto).
>
>
>> One reason why I think this is a good estimate is that there may be a
>> benefit of more fine-tuned control of carries. The highest accumulation
>> of the carries happens in the inner limbs of the 2*n integer (imagine
>> paper-and-pencil multiplication of two n-limb numbers: the stack is the
>> highest in the middle). Therefore, if we have the need to control
>> overflow within, say, multiplication, perhaps it's possible to
>> accomplish this within a subset of n limbs.
> So you expect greater implementation effort and complexity to reduce
> the added cost, but not eliminate it?
I think it's fair to say that lower C will be faster if you only 
consider F(p) ( multiplies and squares; the additions/subtractions don't 
matter).

>
>
>> ( BTW, I don't see a clear argument why can't any F(p) be vectorized
>> (some more efficiently than others). The vectorization using x86 SSE2 is
>> known to speeds up RSA operations for generic F(p)s. )
> As I said, the vectorization strategy used in the neoncrypto paper is
> incompatible with 251-bit moduli.  That strategy requires pairing
> together limbs of the same size, so that a vectorized shift operation
> can be used to carry out of both limbs at the same time.
>
> In RSA, where (in your notation) C is necessarily huge, there is no
> performance penalty from making C a few bits larger, so it's easy to
> lengthen the modulus and make every limb the same size, so it's easy
> to vectorize carries across any pair of limbs.  But that's not very
> relevant to the primes used for fast ECC.
That's correct. For RSA with SSE you do similar things you do here, but 
the reduction is generic and slow.

>
>
> Do you see any benefit of p = 2^256 - 189 over p = 2^255 - 19 that you
> believe offsets the costs of switching to a different coordinate
> field?

They are :
* aesthetics
* it happened that square root of  p = 2^255 - 19 requires 
multiplication by a sqrt(-1) in half the cases when decompression is 
involved.

Both are tolerable from aesthetics point of view IMO if we (effectively) 
standardize on p = 2^255 - 19. p = 2^255 - 19's slightly higher 
performance than 2^256 - 189 should balance *sqrt(-1) out.

( Compare this with 2^414-17, which won't be trivial to match with other 
algorithms. )