Re: [Cfrg] Progress on curve recommendations for TLS WG

Andrey Jivsov <> Sun, 17 August 2014 06:56 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 289361A0742 for <>; Sat, 16 Aug 2014 23:56:10 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, SPF_PASS=-0.001] autolearn=ham
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id WLPRQW6PfUGL for <>; Sat, 16 Aug 2014 23:56:08 -0700 (PDT)
Received: from ( [IPv6:2001:558:fe2d:44:76:96:27:228]) by (Postfix) with ESMTP id A72461A0740 for <>; Sat, 16 Aug 2014 23:56:08 -0700 (PDT)
Received: from ([]) by with comcast id fivQ1o0011bwxycAFiw8bk; Sun, 17 Aug 2014 06:56:08 +0000
Received: from [] ([]) by with comcast id fiw71o0034uhcbK8eiw7Dy; Sun, 17 Aug 2014 06:56:08 +0000
Message-ID: <>
Date: Sat, 16 Aug 2014 23:56:07 -0700
From: Andrey Jivsov <>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.7.0
MIME-Version: 1.0
To: Michael Hamburg <>
References: <> <20140808141506.GA24645@LK-Perkele-VII> <> <> <> <>
In-Reply-To: <>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=q20140121; t=1408258568; bh=x/k6GkRpuqOxEP5hXP1kIrJVzn0vRzyR3xytsNL31c8=; h=Received:Received:Message-ID:Date:From:MIME-Version:To:Subject: Content-Type; b=L25JNSTzWVrFaRCfq0tPPFXzj1Yeiahh6htLFqgzeI9lzE2i3CKPFmsH5wqIqkIA5 ne7PbVpANwk5UI0MUORhvakrFzgMLkf4WzTu4cszxzcSxnS1L6PIEH1ExSLP9GJHfW WesaSk743/KNdNmJkYx7biJlhTD8wD2LYxHBeAGdC8ZjJ7JvT49e2lhokY8FlyVfe0 Y34x/1+eAD2AJ+EyksVT79HCfDmswQKpF1V3dmPyIYxQQiIzeqjB7ml4g5rTSm0mLa tXjJFafNr9RZ+YYUSe5W3bD9eMI0fyJpl/unq5agTEBh108NI4mHRVz3E0lknCVHCu vvJ//+rS60stg==
Subject: Re: [Cfrg] Progress on curve recommendations for TLS WG
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sun, 17 Aug 2014 06:56:10 -0000

I looked at my implementation of modp reduction I did a few years ago 
for a fixed p. I used Barrett reduction.

At closer look I don't see why multiply+reduction mod p, for a random 
but fixed p, is 2n^2 + n. I got a corrected value for Barrett reduction 

On 08/15/2014 05:55 PM, Michael Hamburg wrote:
> I have some more data.  Summary: “devastating” is probably between 1.8 and 3.1.
> Mod 2^521-1, you can do a multiply in 167 Haswell cycles using a variant of 3-way Chung-Hasan which is specific to that prime.  (Ordinary Karatsuba is no good because the prime has 9 limbs.)  Schoolbook takes 222 cycles with only minor benefits from the prime’s shape.  Maybe a better implementor can get better numbers than mine; this is a rough try in C++ with asm intrinsics for widening multiply.
> I don’t have an implementation on hand of a random prime at that level.  Suppose that for a random prime that size you can do an almost-as-good version of Chung-Hasan in 180 cycles, but then you need to fall back to a digit-serial Montgomery reduction which approximates the schoolbook method.  This includes some of the inter-digit reduction twice, so maybe 222 is 10% too high.  Then the ratio is (180 + 222 * 90%) / 167 ~ 2.3.  A random 512-bit prime also requires this amount of time if you use 9 limbs; with 8 limbs it’s again about the same or maybe slightly worse (see below).
> For comparison, based on the Microsoft data, their prime 2^512 - 569 takes about 255 Sandy Bridge cycles per mul.  This is written in assembly language using the schoolbook method.  This corresponds to about 212 Haswell cycles, since the going rate is about 20%.  So it’s 5% faster than my P-521 schoolbook, but in assembly vs C++/intrinsics.  The ratio with my above estimate for a random prime (using a completely different implementation) is 1.8.  Schoolbook Montgomery for a random prime using Microsoft’s implementation techniques should have a ratio of about (2n^2 + n) / (n^2 + n) ~ 1.9 with n=8 limbs.
> This assumes a random prime vs special one at the “>= 256-bit WF” level, but if tweaking the levels is allowed, the ratio is even more.  A 480-bit special prime, Ridinghood, takes 122 Haswell cycles for a multiply (same as 448-bit Goldilocks).  That’s a ratio of 3.1 vs the above estimate for a 512-bit random prime.  If you scale the numbers to take the shorter length (480 vs 512) into account, the “efficiency ratio” is about 2.8.
> The Ridinghood numbers reflect better tuning than the 521-bit tests, but probably not as much tuning as the Microsoft implementation.  My tests are in C with asm intrinsics.  The difference is maybe 10% vs the C++ above, so figure that a ratio of 2.8 (efficiency ratio 2.5) is probably more fair.
> This doesn’t take into account the ratio of multiplication, squaring, and addition / subtraction.
>> On Aug 15, 2014, at 3:52 PM, Andrey Jivsov <>; wrote:
>> Here is a simple sketch how the prime affects the mop operations.
>> Comparing multiplication of 2 n-limb numbers with the reduction mod n-limb number p:
>> * in case of a random prime, we have n^2 multiplies. This gives 2n number, n higher limbs of which will need to be reduced. To accomplish this we multiply p by each of n high limbs, for the total n^2. We have 2n^2 multiplies
Ignoring the fact that some operations in Barrett reduction involve +1, 
-1 of border limbs, Barrett has two multiplies with the following 

* n^2 of n higher limbs (of the 2n-limb number we are reducing) by the 
n-limb precomputed quantity (b^2n)/p, where b is max limb value+1, e.g. 
* it computes another quantity t*p mod b^(n+1), which is roughly n^2/2 
(while you estimate it as n).

The total then is ~ 2.5n^2 v.s. n^2+n for pseudo-Mersenne.

>> * in case of p=2^k-C, we have the same n^2 multiplies, but the reduction results in only n multiplies of high limbs by C. We have n^2 + n
>> So this is (n^2 + n) / 2n^2 ~= 1/2 (pseudo-Mersenne primes have ~50% fewer limb multiplies for multiply+reduce)

(n^2+n)/2.5n^2 ~= 2/5,

pseudo-Mersenne primes use ~40% multiplies.

For 256-bit primes n is likely too low to observe this. n=4, n^2/2 = 8 and I ignored extra limbs.

>> This estimate underestimates  the reduction that takes place for the general p. One needs an additional n divisions (of a limp by the highest limb of p) in denominator, which gives the Mike Hamburg's quantity exactly (which uses the Montgomery reduction method). I didn't count additions/subtractions.
> These divisions are actually multiplies, because you use either Montgomery or Barrett reduction.  So it’s (n^2 + n) / (2n^2 + n).
> Cheers,
> — Mike