Re: [Cfrg] ECC reboot

Michael Hamburg <> Fri, 24 October 2014 03:27 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 5C7DC1A87D0 for <>; Thu, 23 Oct 2014 20:27:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: 1.555
X-Spam-Level: *
X-Spam-Status: No, score=1.555 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FH_HOST_EQ_D_D_D_D=0.765, FH_HOST_EQ_D_D_D_DB=0.888, HELO_MISMATCH_ORG=0.611, HOST_MISMATCH_NET=0.311, RDNS_DYNAMIC=0.982, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id cuMTPl7x7AXq for <>; Thu, 23 Oct 2014 20:27:06 -0700 (PDT)
Received: from ( []) (using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 799B21A1AF1 for <>; Thu, 23 Oct 2014 20:27:06 -0700 (PDT)
Received: from [] (unknown []) by (Postfix) with ESMTPSA id 3D2273AA13; Thu, 23 Oct 2014 20:24:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;; s=sldo; t=1414121080; bh=uYHfN7z7eGGnAGyt9iTlt4pb/kNVq32b1+dg+AEC6r8=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=cJIbgSNy6nUwOu0JXurfC++oIDD9qXwy+wKlNZVe4Qk26h3ePPibSrs1gOWV5EbgT OaOocUqaRqmwD/iJ/T80I8dLPpaCAH2jaLgr+ihWL8uJeGeZ2dHgXsULNxSpL/JrkX 5aECmErINiwkH0FtDsEUR8JoQT0tkk2G3Eglsct4=
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1990.1\))
From: Michael Hamburg <>
In-Reply-To: <>
Date: Thu, 23 Oct 2014 20:27:03 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <>
To: Phillip Hallam-Baker <>
X-Mailer: Apple Mail (2.1990.1)
Cc: "" <>
Subject: Re: [Cfrg] ECC reboot
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 24 Oct 2014 03:27:08 -0000

> On Oct 23, 2014, at 1:52 PM, Phillip Hallam-Baker <> wrote:
> Now if you are arguing that 512 will also break stride on a 512 bit architecture then that is an important data point that by the same logic argues for looking at a curve that does fit and maybe Golilocks is actually the one. 
> But before conceding that point I would like to see more of an explanation of the optimization being used and how much it delivers. 

I really ought to write this up in a proper paper.  But here’s a sketch.

First of all, many modern implementations use reduced-radix arithmetic.  So curve25519-donna uses 5x51-bit words, and ed448-goldilocks uses 8x56-bit words, or 16x28-bit words on 32-bit platforms.  Let’s consider Goldilocks on 64-bit.

The multiply routine can accumulate several 112-bit products in a 128-bit register pair without carrying.  This is important because carry handling slows you down even on x86-64 and arm-scalar.  It’s even more important with vector units such as NEON, because carry handling is much slower when you have to exchange elements between vectors.  The 448-bit size is chosen so that the whole multiply routine can run on a 32-bit platform (like NEON) without propagating any carries until the end.  For a 64-bit platform, there’s more headroom and fewer multiplies in that routine, so the next prime up in the family (2^480 - 2^240 - 1, “Ed480-Ridinghood”) could be used instead.  But I wanted to ensure that NEON would be fast, so I went with 448.  Conveniently, at this size the working state for a multiply fits exactly in the NEON register file.

Reduced radix also speeds up addition and subtraction, which can now be vectorized.  With too many additions or subtractions, you’ll need to reduce to avoid overflow (conservatively, of course, since you can’t usually check in constant-time software).  But in the “hot” routines like point additions, most numbers only go through at most one addition or subtraction before being multiplied.  So they don’t need to be reduced if the multiply routine has a little extra slack.  For Goldilocks, this is true on 64 bit.  On 32-bit, there’s enough slack for an addition in each multiplicand, but not for two subtractions unless the implementation uses signed numbers (which it currently doesn’t).  So it’s on the edge with a small performance penalty for reductions, and could probably be tuned a bit more.

OK, so that’s one reason that the prime is less than 512 bits: it’s designed to be represented internally in 512 bits, but some of those bits are headroom for carries.  It’s probably the biggest prime that can do this on a 32-bit machine.  (The biggest such “Crandall” prime is DJB's 2^414-17.)

The other main design point is the “golden Solinas” form p = t^2 - t - 1, where t = 2^224.  I’m calling it “golden” (and Goldilocks) because t is the golden ratio mod p.  This is a Solinas trinomial, the sparsest kind of prime except for Mersenne primes like 2^521-1.  The NIST curves taught us a lesson about Solinas primes: they can be fast, but the coefficients are constraining.  Since most of the NIST primes have lots of 32-bit aligned coefficients, they suck on 64-bit machines and prevent you from reducing the radix to eg 28 bits.

The Goldilocks prime mitigates that problem by having only one coefficient, right in the middle.  So if you are representing your numbers in an even number of words, reduction is really simple.  This makes any word size favorable if it is equal to (full radix) or slightly less than (reduced radix) a divisor of 448/2: [1, 2, 4, 7, 8, 14, 16, 28, 32, 56, 112, 224].  So while it doesn’t work perfectly on every word size, it should work well on full-radix 16, full- or reduced-radix 32, or full-radix 64-bit machines.  Many of the options involve 2^n limbs in reduced-radix mode, which vectorizes really well.

In short, because it’s a Solinas trinomial, p has the second most efficient reduction algorithm after Mersenne primes.  You don’t get quite as many choices of efficient word layout as with a Crandall prime 2^big - small (which has no internal coefficient), but the golden-ratio structure is pretty much the best you can get on a Solinas prime.

There are only a few options for bit lengths of primes of this shape.  The crypto-sized ones are 216, 322, 416, 448, 450 and 480.  Note also that 2^448 - p ~ sqrt(p).  So security issues such as biases from not being really close to a power of 2 are always less than 1/sqrt(p), which is your security target anyway.  Likewise, the curve’s order will look basically the same as for a Crandall prime.

There’s one more advantage to the golden-ratio structure.  The Goldilocks prime works very well with Karatsuba multiplication.  Recall that Karatsuba computes (a + bt) (c + dt) = ac + bdt^2 + ((a+b)(c+d)-ac-bd)t, where t is either a polynomial term or a multiple of the word size.  So it saves one multiplication at the cost of several additions.  This is usually combined with reduced-radix math so that you don’t have to handle carries in (a+b) and (c+d).

Let t = 2^224, so that t^2 = t+1 mod the Goldilocks prime.  Reducing the above, (a+bt)(c+dt) == ac + bd + ((a+b)(c+d)-ac)t.  That is, reduction mod p makes the Karatsuba formula slightly simpler instead of more complex.

The products above — ac etc. — will be wider than t, and so will also need to be reduced.  So it’s also important that this formula multiplied by t is also simple:

(a+bt)(c+dt)t = (a+b)(c+d)(1+t) + bdt - ac.

The actual math code uses variants of these formulas depending on the platform, for better locality or pipelining or whatever, but the basic idea is the same.  Since a and b are 4x56-bit numbers, they can be added with a single 4-lane vectorized add; likewise for c and d.  All in all, this Karatsuba technique may or may not be profitable on AVX-512, but it is definitely profitable on x86-64 and NEON.

So for example, on a 64-bit machine, the Goldilocks field multiply uses approximately:
* 2x4x64-bit vector adds
* 5 widening multiplies and 43 multiply-accumulates
* 5 128-bit additions and 4 128-bit subtractions
* Two final 64-bit for a total of about 54 ADD/SUB and 52 ADC/SBB.  I’m not looking at asm though, so I probably missed some.
* 10 radix adjustments (mask and shift a 128-bit variable).

The multiplies are easy to interleave at different place values, because carries don’t propagate.  This improves performance noticeably on some platforms.

Without Karatsuba, you’d need at least 64 multiply-accumulates.  Or with packed math at this size, you’d need 49 multiplies (48 accumulates), but the accumulates would be wider, 192 bits, and thus much slower.

A Goldilocks add takes:
* 2x4x64-bit vector adds
It should be done in a few cycles, maybe less if it’s inline.

By comparison, a 512-bit packed multiply (Comba like in MS NUMS) uses:
* 1 widening multiply
* 3 64x64->128-bit multiply-accumulates
* 60 64x64->192-bit multiply-accumulates, if I’ve calculated right 
* 8 small multiply-accumulates (64x64->128 bits) for reduction
* at least one, maybe two rounds of carry propagation after the reduction

This comes out to order of 80 ADD and 140 ADC instructions.  Remember that MUL isn’t that much more expensive than ADC on recent Intel cores, and even on AMD it’s only twice as expensive.  It’s still possible to interleave the multiplications, but it’s trickier because of carry propagation.

Some platforms have faster carry propagation, including ARM scalar (UMAAL) and the upcoming Intel Broadwell (MULX/ADCX/ADOX).  But usually vector units don’t have this, and ARM's NEON vector unit is a huge win — like triple the performance — if you can use it efficiently.

A packed NUMSp512 add takes:
* one add
* 7 adds with carry in a chain
* a multplication for the reduction
* probably two rounds of carry propagation

This sort of carry propagation is most of why NUMSp512 takes about twice as long as Goldilocks.

Overall, these optimizations give a curve which is 64 bits longer than NUMS p384 with <10% worse performance, and 64 bits shorter than NUMS p512 with about double the performance.  So if you mock up an “efficiency” metric which trades off bit length for speed, Goldilocks scores very well.  Its main competitors in this region would be its 480-bit cousin (but not on 32-bit), or 2^521-1, which I still think is pretty good despite your criticism.

There are other families of fast primes, as outlined in my email on prime performance, but at a size of “between about 384 bits and about 512 bits”, Goldilocks is one of only a few really good ones.

— Mike