Re: [Cfrg] Progress on curve recommendations for TLS WG

Mike Hamburg <mike@shiftleft.org> Fri, 15 August 2014 21:28 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1971.5\))
From: Mike Hamburg <mike@shiftleft.org>
In-Reply-To: <53EDEE67.6090308@secunet.com>
Date: Fri, 15 Aug 2014 14:28:46 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <A1FCAAE8-4F20-4AC1-A635-AED20E8F7D5C@shiftleft.org>
References: <CFFB1371.2916E%kenny.paterson@rhul.ac.uk> <20140808141506.GA24645@LK-Perkele-VII> <53EDEE67.6090308@secunet.com>
To: Johannes Merkle <johannes.merkle@secunet.com>, Ilari Liusvaara <ilari.liusvaara@elisanet.fi>, "Paterson, Kenny" <Kenny.Paterson@rhul.ac.uk>
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/7kJyWarsi4hHO8J4jpBFSSGpi8Q
Cc: "cfrg@irtf.org" <cfrg@irtf.org>
Subject: Re: [Cfrg] Progress on curve recommendations for TLS WG
Precedence: list

On 8/15/2014 4:26 AM, Johannes Merkle wrote:
> Ilari Liusvaara wrote on 08.08.2014 16:15:
>> Picking prime randomly has _devastating_ impact on performance. 
> I do not mean to doubt that statement, I'm just looking for hard numbers: Do you have any reference for a proper
> analysis supporting that? I mean something beyond "oh, that's common sense".
So, I don't have a proper reference, but in my experience the performance ratio is slightly more than 2, and probably creeps towards 2.5 or 3 as the number of bits increases towards 512.

An n-digit multiply with no reduction algorithm costs n^2 muls, schoolbook.  A Montgomery reduction costs n^2+n additional multiplies, for a total of 2n^2 + n.  There may also be a conditional subtract, depending on the modulus.  A special prime still requires some reduction, and Montgomery ends up being slightly less than twice as expensive as schoolbook.

Supporting numbers: a few years ago I implemented 252-bit special-Montgomery primes on Sandy Bridge (mod the special prime, reduction isn't free but costs n=4 mulaccs), and a multiply took about 55 cycles, vs about 100 cycles for a 252-bit general prime.  The algorithms were the same, and both implementations were written in assembly.  This is what you'd have predicted from the above guesses, since (n^2 + n) / (2n^2 + n) = 20 / 36 = 55.555...%.

Squaring costed about 10 cycles less in either case, for a flat ratio of 90:45 = 2:1.  This is because Montgomery reduction does not go any faster when you square, which takes n(n+1)/2 multiplies schoolbook style, but 3x that with a Montgomery reduction.  These asymptotics are not even close to true with n=4, though.

As p increases, the ratio will get worse.  The reduction mod the special prime is a smaller fraction of the total work, and squaring gets closer to asymptotic.  Montgomery reduction does not improve as much with Karatsuba multiplication.  Karatsuba may cut the cost of the multiplication by 25% or more at 384 or 512 bits, but it’s rather tricky to combine with a general Montgomery reduction efficiently.  Furthermore, if you try to do this, the state will be larger and may spill out of the register file.  For example, my 448-bit NEON code exactly uses the entire register file with no spilling; anything extra would add significant cost.

Exactly 64n-bit aligned primes will be slightly worse when randomized, because numbers need to be reduced more often.

Finally, powering ladders like inverse or square root are denser and therefore slower with a random prime  A powering ladder for a special prime usually has a multiply density of around 5%, but for a random prime it is likely to be 20%.  Inversion is about 10% of a variable-base and 25% of a fixed-base scalarmul, so this effect is unlikely to exceed 5% of the total time.

In combination, these will push the ratio to ~2x at 256 bits, and higher for longer numbers, but probably not so high as 3x even at 500 bits (but maybe at 512 or 521 bits).  Not having implemented both a large random and a large special prime together on the same platform, I’m not sure exactly what the ratio will be.

Cheers,
-- Mike

[Cfrg] Progress on curve recommendations for TLS … Paterson, Kenny
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … Russ Housley
Re: [Cfrg] Progress on curve recommendations for … D. J. Bernstein
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Ilari Liusvaara
Re: [Cfrg] Progress on curve recommendations for … Robert Ransom
Re: [Cfrg] Progress on curve recommendations for … Johannes Merkle
Re: [Cfrg] Progress on curve recommendations for … Johannes Merkle
Re: [Cfrg] Progress on curve recommendations for … Alyssa Rowan
Re: [Cfrg] Progress on curve recommendations for … Johannes Merkle
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Johannes Merkle
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … Andy Lutomirski
Re: [Cfrg] Progress on curve recommendations for … Dan Brown
Re: [Cfrg] Progress on curve recommendations for … Mike Hamburg
Re: [Cfrg] Progress on curve recommendations for … Andrey Jivsov
Re: [Cfrg] Progress on curve recommendations for … D. J. Bernstein
Re: [Cfrg] Progress on curve recommendations for … Michael Hamburg
Re: [Cfrg] Progress on curve recommendations for … Watson Ladd
Re: [Cfrg] Progress on curve recommendations for … D. J. Bernstein
Re: [Cfrg] Progress on curve recommendations for … Andrey Jivsov
Re: [Cfrg] Progress on curve recommendations for … Michael Hamburg