Return-Path:
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1])
by ietfa.amsl.com (Postfix) with ESMTP id 480011A0926
for ; Fri, 15 Aug 2014 17:55:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.555
X-Spam-Level: *
X-Spam-Status: No, score=1.555 tagged_above=-999 required=5
tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
DKIM_VALID_AU=-0.1, FH_HOST_EQ_D_D_D_D=0.765,
FH_HOST_EQ_D_D_D_DB=0.888, HELO_MISMATCH_ORG=0.611,
HOST_MISMATCH_NET=0.311, RDNS_DYNAMIC=0.982, SPF_HELO_PASS=-0.001,
SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44])
by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id veISdJthHlIN for ;
Fri, 15 Aug 2014 17:55:31 -0700 (PDT)
Received: from aspartame.shiftleft.org
(199-116-74-168-v301.PUBLIC.monkeybrains.net [199.116.74.168])
(using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits))
(No client certificate requested)
by ietfa.amsl.com (Postfix) with ESMTPS id 0A31C1A0924
for ; Fri, 15 Aug 2014 17:55:30 -0700 (PDT)
Received: from [10.184.148.249] (unknown [209.36.6.242])
by aspartame.shiftleft.org (Postfix) with ESMTPSA id 6154A3AA12;
Fri, 15 Aug 2014 17:54:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shiftleft.org; s=sldo;
t=1408150449; bh=iHjRyGpGBTkOqPKFQazEnqC6WtzXliH4vMb53ms6cUo=;
h=Subject:From:In-Reply-To:Date:Cc:References:To:From;
b=CSQDJ38RjhqGUHwhrY0cyl21VL2sQSAaRMs6sgGDbicol7zGOWbUQGH1trd7mhW30
Gi8VJLPMQtZUa1CERhG4CFTqq7TcOIq7e/uv6iI1GQl6qFsFsOQOxD0Ql5tImS/SuR
zUddWk1e90doDHEHg3j7FxuW09lhlHzWRG9R2WO8=
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1971.5\))
From: Michael Hamburg
In-Reply-To: <53EE8F45.9070908@brainhub.org>
Date: Fri, 15 Aug 2014 17:55:27 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id:
References:
<20140808141506.GA24645@LK-Perkele-VII> <53EDEE67.6090308@secunet.com>
<53EE8F45.9070908@brainhub.org>
To: Andrey Jivsov
X-Mailer: Apple Mail (2.1971.5)
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/uknR0WIRcgffzkJ7AUODlZamCqg
Cc: cfrg@irtf.org
Subject: Re: [Cfrg] Progress on curve recommendations for TLS WG
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
X-List-Received-Date: Sat, 16 Aug 2014 00:55:32 -0000
I have some more data. Summary: =E2=80=9Cdevastating=E2=80=9D is =
probably between 1.8 and 3.1.
Mod 2^521-1, you can do a multiply in 167 Haswell cycles using a variant =
of 3-way Chung-Hasan which is specific to that prime. (Ordinary =
Karatsuba is no good because the prime has 9 limbs.) Schoolbook takes =
222 cycles with only minor benefits from the prime=E2=80=99s shape. =
Maybe a better implementor can get better numbers than mine; this is a =
rough try in C++ with asm intrinsics for widening multiply.
I don=E2=80=99t have an implementation on hand of a random prime at that =
level. Suppose that for a random prime that size you can do an =
almost-as-good version of Chung-Hasan in 180 cycles, but then you need =
to fall back to a digit-serial Montgomery reduction which approximates =
the schoolbook method. This includes some of the inter-digit reduction =
twice, so maybe 222 is 10% too high. Then the ratio is (180 + 222 * =
90%) / 167 ~ 2.3. A random 512-bit prime also requires this amount of =
time if you use 9 limbs; with 8 limbs it=E2=80=99s again about the same =
or maybe slightly worse (see below).
For comparison, based on the Microsoft data, their prime 2^512 - 569 =
takes about 255 Sandy Bridge cycles per mul. This is written in =
assembly language using the schoolbook method. This corresponds to =
about 212 Haswell cycles, since the going rate is about 20%. So it=E2=80=99=
s 5% faster than my P-521 schoolbook, but in assembly vs C++/intrinsics. =
The ratio with my above estimate for a random prime (using a completely =
different implementation) is 1.8. Schoolbook Montgomery for a random =
prime using Microsoft=E2=80=99s implementation techniques should have a =
ratio of about (2n^2 + n) / (n^2 + n) ~ 1.9 with n=3D8 limbs.
This assumes a random prime vs special one at the =E2=80=9C>=3D 256-bit =
WF=E2=80=9D level, but if tweaking the levels is allowed, the ratio is =
even more. A 480-bit special prime, Ridinghood, takes 122 Haswell =
cycles for a multiply (same as 448-bit Goldilocks). That=E2=80=99s a =
ratio of 3.1 vs the above estimate for a 512-bit random prime. If you =
scale the numbers to take the shorter length (480 vs 512) into account, =
the =E2=80=9Cefficiency ratio=E2=80=9D is about 2.8.
The Ridinghood numbers reflect better tuning than the 521-bit tests, but =
probably not as much tuning as the Microsoft implementation. My tests =
are in C with asm intrinsics. The difference is maybe 10% vs the C++ =
above, so figure that a ratio of 2.8 (efficiency ratio 2.5) is probably =
more fair.
This doesn=E2=80=99t take into account the ratio of multiplication, =
squaring, and addition / subtraction.
> On Aug 15, 2014, at 3:52 PM, Andrey Jivsov =
wrote:
>=20
> Here is a simple sketch how the prime affects the mop operations.
>=20
> Comparing multiplication of 2 n-limb numbers with the reduction mod =
n-limb number p:
>=20
> * in case of a random prime, we have n^2 multiplies. This gives 2n =
number, n higher limbs of which will need to be reduced. To accomplish =
this we multiply p by each of n high limbs, for the total n^2. We have =
2n^2 multiplies
>=20
> * in case of p=3D2^k-C, we have the same n^2 multiplies, but the =
reduction results in only n multiplies of high limbs by C. We have n^2 + =
n
>=20
> So this is (n^2 + n) / 2n^2 ~=3D 1/2 (pseudo-Mersenne primes have ~50% =
fewer limb multiplies for multiply+reduce)
>=20
>=20
> This estimate underestimates the reduction that takes place for the =
general p. One needs an additional n divisions (of a limp by the highest =
limb of p) in denominator, which gives the Mike Hamburg's quantity =
exactly (which uses the Montgomery reduction method). I didn't count =
additions/subtractions.
These divisions are actually multiplies, because you use either =
Montgomery or Barrett reduction. So it=E2=80=99s (n^2 + n) / (2n^2 + =
n).
Cheers,
=E2=80=94 Mike=