Re: [Cfrg] Curve selection revisited

Andrey Jivsov <crypto@brainhub.org> Sat, 26 July 2014 21:42 UTC

Return-Path: <crypto@brainhub.org>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 929CE1A040C for <cfrg@ietfa.amsl.com>; Sat, 26 Jul 2014 14:42:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WeGnxo4L6oTo for <cfrg@ietfa.amsl.com>; Sat, 26 Jul 2014 14:42:12 -0700 (PDT)
Received: from qmta11.emeryville.ca.mail.comcast.net (qmta11.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:211]) by ietfa.amsl.com (Postfix) with ESMTP id 91E6F1A03FF for <cfrg@irtf.org>; Sat, 26 Jul 2014 14:42:12 -0700 (PDT)
Received: from omta21.emeryville.ca.mail.comcast.net ([76.96.30.88]) by qmta11.emeryville.ca.mail.comcast.net with comcast id X9bX1o0011u4NiLAB9iCKQ; Sat, 26 Jul 2014 21:42:12 +0000
Received: from [192.168.1.2] ([71.202.164.227]) by omta21.emeryville.ca.mail.comcast.net with comcast id X9iB1o0044uhcbK8h9iBK3; Sat, 26 Jul 2014 21:42:12 +0000
Message-ID: <53D420B3.10707@brainhub.org>
Date: Sat, 26 Jul 2014 14:42:11 -0700
From: Andrey Jivsov <crypto@brainhub.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: cfrg@irtf.org
References: <CA+Vbu7xroa68=HOZtbf=oz7kK2EeUv_z1okpnjxHPR0ZtHD5cA@mail.gmail.com> <CFF7E184.28E9F%kenny.paterson@rhul.ac.uk> <53D2781B.8030605@sbcglobal.net> <CACsn0ckqFigWoH2+OOEHSd2VWPp8y6=m8H5OsFRyjXmjK7+m4w@mail.gmail.com> <CABqy+srxMNuG0AaQd0SaegHvZWgbW762EQq+iAHL_fbu6sOJJQ@mail.gmail.com>
In-Reply-To: <CABqy+srxMNuG0AaQd0SaegHvZWgbW762EQq+iAHL_fbu6sOJJQ@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1406410932; bh=wp94k3uUfafgOTR/Pz0KziS8XfLFBJaWelWe4SKukJQ=; h=Received:Received:Message-ID:Date:From:MIME-Version:To:Subject: Content-Type; b=vAJMwCTZUQBayVmgwKmWswknNPpKlPBghgKyq2wmCog7KGxwujE5YOHvzQ7q4Gih7 qo8jmZaT8w0es7ClGchJozx5Uqb1o+JNGjNo/j3rhE8xb0sK54EudHLI6b41x5XKdH yV9DU885PdW8GesqBxw5xnk+Hez4Oq6XraN+9Ei5O1eIUZDBBOVs1eoUguFyzpmgBe IYey/TlbnEEb3qbj3WvvpZSJD3Je9D/SzxSv37QbxalPA4q28EMVVgdNF5Q5pep8cx TjsVoe31y3kexKcQ4f+O0cYEwIGQgxfw6E5/93ySAH9a1jvp0Ff2gFch3ddeTRfFK+ Q9C50eTPJzwLQ==
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/JYqu4L6Ksnh8Xs4AI_9yTfOwUOc
Subject: Re: [Cfrg] Curve selection revisited
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Sat, 26 Jul 2014 21:42:14 -0000

On 07/25/2014 07:25 PM, Robert Ransom wrote:
> The exact value of c determines whether a portable or vectorized
> implementation can delay carries until after reduction.  For example,
> after curve25519-donna performs a multiply or squaring, each of 19
> 64-bit limbs has absolute value less than 14*2^54.  When
> freduce_degree reduces a field element to 10 limbs without carrying,
> the result has absolute value less than (c+1)*14*2^54  = 280*2^54, or a
> bit more than 2^62.  In the Microsoft field 2^255  - 765, or even 2^256
> - 189, this would not be possible.
>
> Carries between limbs can be performed more efficiently than in
> curve25519-donna -- curve25519-donna goes to extra effort to keep
> limbs signed.  In the Montgomery ladder, this optimization does not
> add any extra carry steps, because the Montgomery-ladder formulas
> never perform more than two add/subtract operations between a
> multiply/square operation.  With Edwards-curve operations, this
> optimization may require an additional carry step in the doubling
> formula.
>
> The exact value of s determines whether carries can be vectorized
> within a coordinate-field element.  For example, 2^251  - 9 can be
> represented using one 26-bit limb (in the least significant position)
> and 9 25-bit limbs, but this interferes with the vectorization
> strategy described in the neoncrypto paper (by Dr. Bernstein et al.).
> Since c=9 is so small, this can be worked around by representing
> coordinates modulo 2^252  - 18 instead, but this trick is rarely
> possible without running into the bound on c.

On 07/25/2014 07:25 PM, Robert Ransom wrote:
 > The exact value of c determines whether a portable or vectorized
 > implementation can delay carries until after reduction.  For example,
 > after curve25519-donna performs a multiply or squaring, each of 19
 > 64-bit limbs has absolute value less than 14*2^54.  When
 > freduce_degree reduces a field element to 10 limbs without carrying,
 > the result has absolute value less than (c+1)*14*2^54  = 280*2^54, or a
 > bit more than 2^62.  In the Microsoft field 2^255  - 765, or even 2^256
 > - 189, this would not be possible.
 >
 > Carries between limbs can be performed more efficiently than in
 > curve25519-donna -- curve25519-donna goes to extra effort to keep
 > limbs signed.  In the Montgomery ladder, this optimization does not
 > add any extra carry steps, because the Montgomery-ladder formulas
 > never perform more than two add/subtract operations between a
 > multiply/square operation.  With Edwards-curve operations, this
 > optimization may require an additional carry step in the doubling
 > formula.
 >
 > The exact value of s determines whether carries can be vectorized
 > within a coordinate-field element.  For example, 2^251  - 9 can be
 > represented using one 26-bit limb (in the least significant position)
 > and 9 25-bit limbs, but this interferes with the vectorization
 > strategy described in the neoncrypto paper (by Dr. Bernstein et al.).
 > Since c=9 is so small, this can be worked around by representing
 > coordinates modulo 2^252  - 18 instead, but this trick is rarely
 > possible without running into the bound on c.

The main argument for p = 2^255-19 v.s. p = 1^256-189 is that the C=19 
is smaller than C=189 and this helps to "manage" the carries.

Let's assume that an F(p) integer is represented as n limbs (words).

256 bit Montgomery curve will have on the order of (5 mul + 4 squares) 
and about the same number F(p) additions per one scalar bit.

Therefore, we can talk about one unreduced F(p) multiplication + F(p) 
reduction (where C plays its role) + F(p) addition. With this in mind 
the contribution of C can be estimated within this group.

F(p) multiplications (counting squares here as well) dominate the 
performance of EC operations:
* the result of the multiplication is 2*n limbs and the complexity of 
this operation is n^2 limb multiplications and O(n^2) additions.
* the 2*n limbs must be reduced (mod p) to n limbs immediately after 
multiplication
* there is on average ~1 addition/subtraction of n limbs per each 
multiplication due to EC formula

Each multiplication needs degree reduction and settling of accumulated 
carries before the next multiplication. One can only postpone settling 
of carries while he does additions (prior to the next multiplciation).

We have about n^2 limb additions, plus n limb additions due to EC 
formular, plus we must perform n additions of the form <some_limb>*C 
(for modp reduction).

Assuming n=10 limbs, the entire contribution of n <some_limb>*C is bound 
by ~10% of all the additions.

We are not talking about C=0, however, but rather about |C|=4 bits v.s. 
|C| = 8 bits difference.

A rough estimate here is 5% penalty.

One reason why I think this is a good estimate is that there may be a 
benefit of more fine-tuned control of carries. The highest accumulation 
of the carries happens in the inner limbs of the 2*n integer (imagine 
paper-and-pencil multiplication of two n-limb numbers: the stack is the 
highest in the middle). Therefore, if we have the need to control 
overflow within, say, multiplication, perhaps it's possible to 
accomplish this within a subset of n limbs.


( BTW, I don't see a clear argument why can't any F(p) be vectorized 
(some more efficiently than others). The vectorization using x86 SSE2 is 
known to speeds up RSA operations for generic F(p)s. )