Re: [Cfrg] Progress on curve recommendations for TLS WG

Mike Hamburg <> Fri, 15 August 2014 21:28 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 98C871A06CD for <>; Fri, 15 Aug 2014 14:28:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: 3.454
X-Spam-Level: ***
X-Spam-Status: No, score=3.454 tagged_above=-999 required=5 tests=[BAYES_20=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FH_HOST_EQ_D_D_D_D=0.765, FH_HOST_EQ_D_D_D_DB=0.888, HELO_MISMATCH_ORG=0.611, HOST_MISMATCH_NET=0.311, RDNS_DYNAMIC=0.982, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id hovVM2BABXeu for <>; Fri, 15 Aug 2014 14:28:54 -0700 (PDT)
Received: from ( []) (using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 598051A065F for <>; Fri, 15 Aug 2014 14:28:54 -0700 (PDT)
Received: from [] (unknown []) by (Postfix) with ESMTPSA id 4DE5F3AA12; Fri, 15 Aug 2014 14:27:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;; s=sldo; t=1408138051; bh=zQcsZP1lapI64oVz21XLNIsDRNXVRtOVi5AI7ZSfkf0=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=SWKeHPcSUh6i0O+3PF78RWFoAt89QdZpJaFPZeJLRtNJbMNhHMTc7Te2esaP2ROZp XEOMNQM7+ztm4iRFwArmyMxHOioLhtVGu3JcOzfoaS91lE8LA723gr2mFjneMJbErl TjtsPoBONLXmHVO/6bWtD1WS5v0T8/V+waT2ePSo=
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1971.5\))
From: Mike Hamburg <>
In-Reply-To: <>
Date: Fri, 15 Aug 2014 14:28:46 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <20140808141506.GA24645@LK-Perkele-VII> <>
To: Johannes Merkle <>, Ilari Liusvaara <>, "Paterson, Kenny" <>
X-Mailer: Apple Mail (2.1971.5)
Cc: "" <>
Subject: Re: [Cfrg] Progress on curve recommendations for TLS WG
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 15 Aug 2014 21:28:55 -0000

On 8/15/2014 4:26 AM, Johannes Merkle wrote:
> Ilari Liusvaara wrote on 08.08.2014 16:15:
>> Picking prime randomly has _devastating_ impact on performance. 
> I do not mean to doubt that statement, I'm just looking for hard numbers: Do you have any reference for a proper
> analysis supporting that? I mean something beyond "oh, that's common sense".
So, I don't have a proper reference, but in my experience the performance ratio is slightly more than 2, and probably creeps towards 2.5 or 3 as the number of bits increases towards 512.

An n-digit multiply with no reduction algorithm costs n^2 muls, schoolbook.  A Montgomery reduction costs n^2+n additional multiplies, for a total of 2n^2 + n.  There may also be a conditional subtract, depending on the modulus.  A special prime still requires some reduction, and Montgomery ends up being slightly less than twice as expensive as schoolbook.

Supporting numbers: a few years ago I implemented 252-bit special-Montgomery primes on Sandy Bridge (mod the special prime, reduction isn't free but costs n=4 mulaccs), and a multiply took about 55 cycles, vs about 100 cycles for a 252-bit general prime.  The algorithms were the same, and both implementations were written in assembly.  This is what you'd have predicted from the above guesses, since (n^2 + n) / (2n^2 + n) = 20 / 36 = 55.555...%.

Squaring costed about 10 cycles less in either case, for a flat ratio of 90:45 = 2:1.  This is because Montgomery reduction does not go any faster when you square, which takes n(n+1)/2 multiplies schoolbook style, but 3x that with a Montgomery reduction.  These asymptotics are not even close to true with n=4, though.

As p increases, the ratio will get worse.  The reduction mod the special prime is a smaller fraction of the total work, and squaring gets closer to asymptotic.  Montgomery reduction does not improve as much with Karatsuba multiplication.  Karatsuba may cut the cost of the multiplication by 25% or more at 384 or 512 bits, but it’s rather tricky to combine with a general Montgomery reduction efficiently.  Furthermore, if you try to do this, the state will be larger and may spill out of the register file.  For example, my 448-bit NEON code exactly uses the entire register file with no spilling; anything extra would add significant cost.

Exactly 64n-bit aligned primes will be slightly worse when randomized, because numbers need to be reduced more often.

Finally, powering ladders like inverse or square root are denser and therefore slower with a random prime  A powering ladder for a special prime usually has a multiply density of around 5%, but for a random prime it is likely to be 20%.  Inversion is about 10% of a variable-base and 25% of a fixed-base scalarmul, so this effect is unlikely to exceed 5% of the total time.

In combination, these will push the ratio to ~2x at 256 bits, and higher for longer numbers, but probably not so high as 3x even at 500 bits (but maybe at 512 or 521 bits).  Not having implemented both a large random and a large special prime together on the same platform, I’m not sure exactly what the ratio will be.

-- Mike