Re: [Cfrg] Progress on curve recommendations for TLS WG

Michael Hamburg <> Sat, 16 August 2014 00:55 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 480011A0926 for <>; Fri, 15 Aug 2014 17:55:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: 1.555
X-Spam-Level: *
X-Spam-Status: No, score=1.555 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FH_HOST_EQ_D_D_D_D=0.765, FH_HOST_EQ_D_D_D_DB=0.888, HELO_MISMATCH_ORG=0.611, HOST_MISMATCH_NET=0.311, RDNS_DYNAMIC=0.982, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id veISdJthHlIN for <>; Fri, 15 Aug 2014 17:55:31 -0700 (PDT)
Received: from ( []) (using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 0A31C1A0924 for <>; Fri, 15 Aug 2014 17:55:30 -0700 (PDT)
Received: from [] (unknown []) by (Postfix) with ESMTPSA id 6154A3AA12; Fri, 15 Aug 2014 17:54:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;; s=sldo; t=1408150449; bh=iHjRyGpGBTkOqPKFQazEnqC6WtzXliH4vMb53ms6cUo=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=CSQDJ38RjhqGUHwhrY0cyl21VL2sQSAaRMs6sgGDbicol7zGOWbUQGH1trd7mhW30 Gi8VJLPMQtZUa1CERhG4CFTqq7TcOIq7e/uv6iI1GQl6qFsFsOQOxD0Ql5tImS/SuR zUddWk1e90doDHEHg3j7FxuW09lhlHzWRG9R2WO8=
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1971.5\))
From: Michael Hamburg <>
In-Reply-To: <>
Date: Fri, 15 Aug 2014 17:55:27 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <20140808141506.GA24645@LK-Perkele-VII> <> <> <>
To: Andrey Jivsov <>
X-Mailer: Apple Mail (2.1971.5)
Subject: Re: [Cfrg] Progress on curve recommendations for TLS WG
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 16 Aug 2014 00:55:32 -0000

I have some more data.  Summary: “devastating” is probably between 1.8 and 3.1.

Mod 2^521-1, you can do a multiply in 167 Haswell cycles using a variant of 3-way Chung-Hasan which is specific to that prime.  (Ordinary Karatsuba is no good because the prime has 9 limbs.)  Schoolbook takes 222 cycles with only minor benefits from the prime’s shape.  Maybe a better implementor can get better numbers than mine; this is a rough try in C++ with asm intrinsics for widening multiply.

I don’t have an implementation on hand of a random prime at that level.  Suppose that for a random prime that size you can do an almost-as-good version of Chung-Hasan in 180 cycles, but then you need to fall back to a digit-serial Montgomery reduction which approximates the schoolbook method.  This includes some of the inter-digit reduction twice, so maybe 222 is 10% too high.  Then the ratio is (180 + 222 * 90%) / 167 ~ 2.3.  A random 512-bit prime also requires this amount of time if you use 9 limbs; with 8 limbs it’s again about the same or maybe slightly worse (see below).

For comparison, based on the Microsoft data, their prime 2^512 - 569 takes about 255 Sandy Bridge cycles per mul.  This is written in assembly language using the schoolbook method.  This corresponds to about 212 Haswell cycles, since the going rate is about 20%.  So it’s 5% faster than my P-521 schoolbook, but in assembly vs C++/intrinsics.  The ratio with my above estimate for a random prime (using a completely different implementation) is 1.8.  Schoolbook Montgomery for a random prime using Microsoft’s implementation techniques should have a ratio of about (2n^2 + n) / (n^2 + n) ~ 1.9 with n=8 limbs.

This assumes a random prime vs special one at the “>= 256-bit WF” level, but if tweaking the levels is allowed, the ratio is even more.  A 480-bit special prime, Ridinghood, takes 122 Haswell cycles for a multiply (same as 448-bit Goldilocks).  That’s a ratio of 3.1 vs the above estimate for a 512-bit random prime.  If you scale the numbers to take the shorter length (480 vs 512) into account, the “efficiency ratio” is about 2.8.

The Ridinghood numbers reflect better tuning than the 521-bit tests, but probably not as much tuning as the Microsoft implementation.  My tests are in C with asm intrinsics.  The difference is maybe 10% vs the C++ above, so figure that a ratio of 2.8 (efficiency ratio 2.5) is probably more fair.

This doesn’t take into account the ratio of multiplication, squaring, and addition / subtraction.

> On Aug 15, 2014, at 3:52 PM, Andrey Jivsov <>; wrote:
> Here is a simple sketch how the prime affects the mop operations.
> Comparing multiplication of 2 n-limb numbers with the reduction mod n-limb number p:
> * in case of a random prime, we have n^2 multiplies. This gives 2n number, n higher limbs of which will need to be reduced. To accomplish this we multiply p by each of n high limbs, for the total n^2. We have 2n^2 multiplies
> * in case of p=2^k-C, we have the same n^2 multiplies, but the reduction results in only n multiplies of high limbs by C. We have n^2 + n
> So this is (n^2 + n) / 2n^2 ~= 1/2 (pseudo-Mersenne primes have ~50% fewer limb multiplies for multiply+reduce)
> This estimate underestimates  the reduction that takes place for the general p. One needs an additional n divisions (of a limp by the highest limb of p) in denominator, which gives the Mike Hamburg's quantity exactly (which uses the Montgomery reduction method). I didn't count additions/subtractions.

These divisions are actually multiplies, because you use either Montgomery or Barrett reduction.  So it’s (n^2 + n) / (2n^2 + n).

— Mike