Re: [Cfrg] Curve selection revisited

Robert Ransom <rransom.8774@gmail.com> Tue, 29 July 2014 11:14 UTC

Return-Path: <rransom.8774@gmail.com>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CDA981B27BC for <cfrg@ietfa.amsl.com>; Tue, 29 Jul 2014 04:14:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.75
X-Spam-Level:
X-Spam-Status: No, score=-1.75 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id j6be-UwNVrR3 for <cfrg@ietfa.amsl.com>; Tue, 29 Jul 2014 04:14:17 -0700 (PDT)
Received: from mail-qa0-x230.google.com (mail-qa0-x230.google.com [IPv6:2607:f8b0:400d:c00::230]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E88A11B27BB for <cfrg@irtf.org>; Tue, 29 Jul 2014 04:14:16 -0700 (PDT)
Received: by mail-qa0-f48.google.com with SMTP id m5so9255513qaj.35 for <cfrg@irtf.org>; Tue, 29 Jul 2014 04:14:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=ToAQxgq93EYAF2DBQuynz+VxfUHFQ+/Jn1ElbF4A13o=; b=KICrNCLeaeXROubf4KaaVLOSBqM2ASCRG/g83GTqtmxfrQePKtGNStCWMzklyYDH0h RoY13tswdMDswXexeR5LiIxIDBMw7gljYRfSojf8kEeYljwgM1FvduMinIRyalC2wpBV rSBxZKUrRC62suKy40muhJ6Q3ogDHSI3jRZHv+vkFgFdITDPGsyCXoJWqH9isBwtaySH EiCLpFi/+1jc2CU2BftkPsrK8xf7I5+LjJ4pfvO9ogTHY2RszroJ7nKElLOOTsXq8KiL cszQtKa+AtV6G42AzXkGhJUOKSeWC9rKBnGKFr5SUeI4biRw1lWNqWz1BYl8IG5gzmsA 9JaQ==
MIME-Version: 1.0
X-Received: by 10.229.231.74 with SMTP id jp10mr2286567qcb.29.1406632455890; Tue, 29 Jul 2014 04:14:15 -0700 (PDT)
Received: by 10.140.86.135 with HTTP; Tue, 29 Jul 2014 04:14:15 -0700 (PDT)
In-Reply-To: <53D420B3.10707@brainhub.org>
References: <CA+Vbu7xroa68=HOZtbf=oz7kK2EeUv_z1okpnjxHPR0ZtHD5cA@mail.gmail.com> <CFF7E184.28E9F%kenny.paterson@rhul.ac.uk> <53D2781B.8030605@sbcglobal.net> <CACsn0ckqFigWoH2+OOEHSd2VWPp8y6=m8H5OsFRyjXmjK7+m4w@mail.gmail.com> <CABqy+srxMNuG0AaQd0SaegHvZWgbW762EQq+iAHL_fbu6sOJJQ@mail.gmail.com> <53D420B3.10707@brainhub.org>
Date: Tue, 29 Jul 2014 04:14:15 -0700
Message-ID: <CABqy+so6JcL3drjXuiQfLhm-LPMOJuS9ES5Hyb1UQRhi-gV2jA@mail.gmail.com>
From: Robert Ransom <rransom.8774@gmail.com>
To: Andrey Jivsov <crypto@brainhub.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/ineI42CtwjeMF8HbBdm0bz1iqno
Cc: cfrg@irtf.org
Subject: Re: [Cfrg] Curve selection revisited
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jul 2014 11:14:22 -0000

On 7/26/14, Andrey Jivsov <crypto@brainhub.org> wrote:
> On 07/25/2014 07:25 PM, Robert Ransom wrote:
>> The exact value of c determines whether a portable or vectorized
>> implementation can delay carries until after reduction.  For example,
>> after curve25519-donna performs a multiply or squaring, each of 19
>> 64-bit limbs has absolute value less than 14*2^54.  When
>> freduce_degree reduces a field element to 10 limbs without carrying,
>> the result has absolute value less than (c+1)*14*2^54  = 280*2^54, or a
>> bit more than 2^62.  In the Microsoft field 2^255  - 765, or even 2^256
>> - 189, this would not be possible.
>>
>> Carries between limbs can be performed more efficiently than in
>> curve25519-donna -- curve25519-donna goes to extra effort to keep
>> limbs signed.  In the Montgomery ladder, this optimization does not
>> add any extra carry steps, because the Montgomery-ladder formulas
>> never perform more than two add/subtract operations between a
>> multiply/square operation.  With Edwards-curve operations, this
>> optimization may require an additional carry step in the doubling
>> formula.
>>
>> The exact value of s determines whether carries can be vectorized
>> within a coordinate-field element.  For example, 2^251  - 9 can be
>> represented using one 26-bit limb (in the least significant position)
>> and 9 25-bit limbs, but this interferes with the vectorization
>> strategy described in the neoncrypto paper (by Dr. Bernstein et al.).
>> Since c=9 is so small, this can be worked around by representing
>> coordinates modulo 2^252  - 18 instead, but this trick is rarely
>> possible without running into the bound on c.

> The main argument for p = 2^255-19 v.s. p = 1^256-189 is that the C=19
> is smaller than C=189 and this helps to "manage" the carries.

Actually, the main argument for p = 2^255 - 19 is that Curve25519 is
the de-facto standard curve for high-performance ECC in a group of
near-256-bit order.  Part of the reason for that is that implementers
have, in practice, been able to rely on the small value of what you
denote as “C” for that field in optimizing their software, so no one
has felt a need to switch to a different field of that form for
performance reasons.

(Mike Hamburg introduced ‘special Montgomery primes’ and showed
evidence that they can provide better performance in some
implementations than the Curve25519 coordinate field in
<https://eprint.iacr.org/2012/309.pdf>, but that's a different form of
prime.)


> Let's assume that an F(p) integer is represented as n limbs (words).
>
> 256 bit Montgomery curve will have on the order of (5 mul + 4 squares)
> and about the same number F(p) additions per one scalar bit.
>
> Therefore, we can talk about one unreduced F(p) multiplication + F(p)
> reduction (where C plays its role) + F(p) addition. With this in mind
> the contribution of C can be estimated within this group.
>
> F(p) multiplications (counting squares here as well) dominate the
> performance of EC operations:
> * the result of the multiplication is 2*n limbs and the complexity of
> this operation is n^2 limb multiplications and O(n^2) additions.
> * the 2*n limbs must be reduced (mod p) to n limbs immediately after
> multiplication
> * there is on average ~1 addition/subtraction of n limbs per each
> multiplication due to EC formula
>
> Each multiplication needs degree reduction and settling of accumulated
> carries before the next multiplication. One can only postpone settling
> of carries while he does additions (prior to the next multiplciation).
>
> We have about n^2 limb additions, plus n limb additions due to EC
> formular, plus we must perform n additions of the form <some_limb>*C
> (for modp reduction).
>
> Assuming n=10 limbs, the entire contribution of n <some_limb>*C is bound
> by ~10% of all the additions.
>
> We are not talking about C=0, however, but rather about |C|=4 bits v.s.
> |C| = 8 bits difference.
>
> A rough estimate here is 5% penalty.

As Dr. Bernstein has pointed out, a larger value of C requires 3*n
carries instead of n carries (see
<https://moderncrypto.org/mail-archive/curves/2014/000237.html>), and
carries have high latency (see section 4, subsection ‘Reduction’ of
naclcrypto).


> One reason why I think this is a good estimate is that there may be a
> benefit of more fine-tuned control of carries. The highest accumulation
> of the carries happens in the inner limbs of the 2*n integer (imagine
> paper-and-pencil multiplication of two n-limb numbers: the stack is the
> highest in the middle). Therefore, if we have the need to control
> overflow within, say, multiplication, perhaps it's possible to
> accomplish this within a subset of n limbs.

So you expect greater implementation effort and complexity to reduce
the added cost, but not eliminate it?


> ( BTW, I don't see a clear argument why can't any F(p) be vectorized
> (some more efficiently than others). The vectorization using x86 SSE2 is
> known to speeds up RSA operations for generic F(p)s. )

As I said, the vectorization strategy used in the neoncrypto paper is
incompatible with 251-bit moduli.  That strategy requires pairing
together limbs of the same size, so that a vectorized shift operation
can be used to carry out of both limbs at the same time.

In RSA, where (in your notation) C is necessarily huge, there is no
performance penalty from making C a few bits larger, so it's easy to
lengthen the modulus and make every limb the same size, so it's easy
to vectorize carries across any pair of limbs.  But that's not very
relevant to the primes used for fast ECC.


Do you see any benefit of p = 2^256 - 189 over p = 2^255 - 19 that you
believe offsets the costs of switching to a different coordinate
field?


Robert Ransom