Re: [Cfrg] On relative performance of Edwards v.s. Montgomery Curve25519, variable base

Andrey Jivsov <crypto@brainhub.org> Mon, 05 January 2015 23:08 UTC

Message-ID: <54AB194A.6020104@brainhub.org>
Date: Mon, 05 Jan 2015 15:07:54 -0800
From: Andrey Jivsov <crypto@brainhub.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Michael Hamburg <mike@shiftleft.org>
References: <54AA4AB9.70505@brainhub.org> <54AA5AD3.9020009@shiftleft.org> <54AAEEFC.9060309@brainhub.org> <EBF3EC55-3057-496A-8BAE-7EAD405518A7@shiftleft.org>
In-Reply-To: <EBF3EC55-3057-496A-8BAE-7EAD405518A7@shiftleft.org>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/fMvBhi4OCEZ4No7GxZBMOAPCpfo
Cc: "cfrg@irtf.org" <cfrg@irtf.org>
Subject: Re: [Cfrg] On relative performance of Edwards v.s. Montgomery Curve25519, variable base
Precedence: list

On 01/05/2015 01:45 PM, Michael Hamburg wrote:
>> On Jan 5, 2015, at 12:07 PM, Andrey Jivsov <crypto@brainhub.org> wrote:
>>
>> Thank you, Mike, this is helpful. This brings a few thoughts, below.
> You’re welcome.
>
>> On 01/05/2015 01:35 AM, Mike Hamburg wrote:
>>> Microsoft's benchmarks on Ed-384-mers, and even some of my earlier Ed448 numbers disagree with the Ed448 data above.  My earlier numbers were about 6% less favorable to Montgomery, and Microsoft's numbers were about 10-15% less favorable to Montgomery.  ...
>> My tests seem to confirm this that Montgomery is 10-15% less favourable as well.
> But it looks to me like you compared a constant-time Montgomery implementation to a variable-time Edwards one.

The method uses window exponentiation-style scalar multiplication with 
doubling for each bit. The data-dependent decisions are done for 
additions, which are additional ~ 50% of these. The fix may lower the 
performance gap, but I think it will be within these 15%.

>
>>> So, for larger primes, the Montgomery ladder is probably slightly faster than Edwards when operating on compressed points, and slower or the same otherwise.
>> A common assumption is that we need point compression in ECDH, from which it follows that Montgomery representation is a wash for p ~ 2^256.
>>
>> However, for the sake of technical argument let's leave aside the weight assigned to "widespread existing practice" and:
>> * assume that the curve is specified in Edwards form, e.g. as in http://www.ietf.org/id/draft-black-rpgecc-01.txt
>> * assume that the point is sent uncompressed.
>>
>>
>> This is fine for authenticated keys, e.g. because these will be distributed in X.509 certificates, DNS records, etc. Point validation is cheaper than decompression in other cases, and this is is how it's presently done on the Internet. Optional point compression is also well-understood.
> Sure, this is an OK option.  I prefer always-compressed for reasons which have come up before, but optional compression is reasonable.
>
>> * The fixed-base scalar multiplication is very fast, e.g. 15 times faster with 5.24 Mb of pre-computed static tables for the 13 bit window.
> Most designers don't do this, because it’s vulnerable to timing attacks and can have caching problems (important for anything other than benchmarks).  More common values are somewhere between 3kiB and 20kiB, with constant-time table lookups, yielding about a 3x performance improvement instead of 15x.
The openssl does this for in ECDSA P-256 code with w=7. My comments are 
regarding the generalization of this code for w=13, which means you are 
doing 20 ECC additions for the entire P-256 fixed-base scalar 
multiplication. At this rate the memory access is what slows you down, 
and things like unneeded inverse in ECDSA v.s. EdDSA.

It may not be necessary to have side-channel protected code that 
generates one-time ECDH pair, otherwise, one can add cache line SC 
protection.

>
>> This is beneficial for protocols or implementation that cannot reuse an ephemeral EC key. E.g. a command-line application launched to encrypt data, a browser launched with open tabs, etc.
>>
>> Given that the calculation of a client ECDH share is so fast relatively to an ECDH shared secret point calculation, an implementation may as well always calculate it fresh. This reduces complexity and fits many APIs. This is because there is no need to to maintain a "state" with reused ephemeral share.
>>
>> * The same F(p) and ECC code can be used for signing and ECDH. Also, a small d in the Edwards equation is used.
> Implementations which compress points and/or use the ladder will share F(p) code.  The small d in the Edwards equation can be used in either the isomorphic curve (as Robert Ransom and I found) or the isogenous one (a la Microsoft).  If an implementor wants to use the Edwards code for Montgomery-x ECDH as well, it costs no more than with any other point compression scheme.

I put some of this here 
http://www.ietf.org/mail-archive/web/cfrg/current/msg05113.html

However, "free" assumes that you use compression. Removing compression 
gives you 10% performance gain, if you don't need to do conversions to 
isogenous curves.

Is it worth to subject the X.509 cert verifiers to the 10% penalty for 
32 byte saving?

>> Down the road, something like EdDSA+ECDH, uncompressed points in Edwards form, looks like the fastest alternative. Essentially this enables the system design with a single ECC primitive that does x*P for same field size, which is very fast when P=G, the base point.
> Faster than compressed, yes.  Better, though… maybe?  Speaking of faster, most verifies will use double-scalar multiply because it’s faster, and they usually have different code between signing/ECDH and verification for side-channel reasons.
"Batching" is a good thing. When x * basepoint is just a few additions, 
the main benefit of doing double scalar multiply is probably due to 
"simultaneous inverse" (single conversion to affine v.s. twice). TLS 1.3 
enables additional "batching" between ECDH and signature. These cases 
can be handled by a single x*P primitive with a n-point export function.

>
>> * Traditional ECC assumed that one can add points. This may affect adoption of the new curve.
>>
>> This is the case for NIST P-256. It may be desirable to sign with an ECDH key, e.g. for proof of key possession, without a special protocol.
>>
>> * Curves that allow point addition offer space/speed tradeoff option through the window size, and other methods.
>>
>> Unfortunately, the conversion between representations costs ~10% due to F(p) inverse. Decompression costs about the same. We also know that the performance difference between Montgomery/Edwards operations is in the similar range. Therefore, selection of the format on the wire determines the implementation. If we are sending a Montgomery x, we are committing to Montgomery ladder implementation. Is this justified?
>
> Exactly what features are possible or desirable in a point encoding — and at what cost in complexity and generality — is an interesting question.
>
> The point format in the experimental Goldilocks software (and in Ed252) allows adding points, because it transmits sign information.  I used the same format is used for laddered ECDH and for signatures.  It is suitable for both the Montgomery ladder and Edwards windowed implementations.  Furthermore, it reduces the curve’s cofactor to 2 (see also my post on reducing the cofactor to 1), and allows the Montgomery ladder to reject twist points.  It costs about the same as an ordinary point compression/decompression system, but it is slightly more complex and not batchable either for compression or decompression.
>
> Others have also advocated a simpler format with similar properties, such as (Montgomery x, low bit of Edwards x).  This doesn’t reduce the cofactor, but it’s easier to describe and it allows compression to be batched.

As a side note, I see difficulties in implementing batching in practice. 
I think, realistically, it's an engineering challenge to batch ECC 
operations for parallel TLS streams. A realistic pattern for batching is 
something like double scalar multiplication that are guaranteed to occur 
at the protocol level. Therefore, the 10% hit due to extra inverse is, 
unfortunately, might not be easy to lower to below 5% (and that's when 
you are lucky e.g. when the cert and DH group match)

[Cfrg] On relative performance of Edwards v.s. Mo… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Mike Hamburg
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Michael Hamburg
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Michael Hamburg
Re: [Cfrg] On relative performance of Edwards v.s… Michael Hamburg
Re: [Cfrg] On relative performance of Edwards v.s… Watson Ladd
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Mike Hamburg
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Peter Dettman
Re: [Cfrg] On relative performance of Edwards v.s… Michael Hamburg
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Michael Hamburg
Re: [Cfrg] On relative performance of Edwards v.s… Watson Ladd
Re: [Cfrg] On relative performance of Edwards v.s… Michael Hamburg
Re: [Cfrg] On relative performance of Edwards v.s… Watson Ladd
Re: [Cfrg] On relative performance of Edwards v.s… Kurt Roeckx
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Watson Ladd
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Watson Ladd
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov
Re: [Cfrg] On relative performance of Edwards v.s… Watson Ladd
Re: [Cfrg] On relative performance of Edwards v.s… Andrey Jivsov