[Cfrg] Benchmarks: 384 vs 389 vs Goldilocks vs ... on Haswell

Mike Hamburg <mike@shiftleft.org> Mon, 29 December 2014 23:14 UTC

Return-Path: <mike@shiftleft.org>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com []) by ietfa.amsl.com (Postfix) with ESMTP id C327D1A90FB for <cfrg@ietfa.amsl.com>; Mon, 29 Dec 2014 15:14:19 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 4.256
X-Spam-Level: ****
X-Spam-Status: No, score=4.256 tagged_above=-999 required=5 tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FH_HOST_EQ_D_D_D_D=0.765, FH_HOST_EQ_D_D_D_DB=0.888, HELO_MISMATCH_ORG=0.611, HOST_MISMATCH_NET=0.311, HTML_MESSAGE=0.001, RDNS_DYNAMIC=0.982, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([]) by localhost (ietfa.amsl.com []) (amavisd-new, port 10024) with ESMTP id Z_J6liUcXYGu for <cfrg@ietfa.amsl.com>; Mon, 29 Dec 2014 15:14:18 -0800 (PST)
Received: from aspartame.shiftleft.org (199-116-74-168-v301.PUBLIC.monkeybrains.net []) (using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DF7851A9030 for <cfrg@irtf.org>; Mon, 29 Dec 2014 15:14:17 -0800 (PST)
Received: from [] (unknown []) by aspartame.shiftleft.org (Postfix) with ESMTPSA id 37C033AA12 for <cfrg@irtf.org>; Mon, 29 Dec 2014 15:12:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shiftleft.org; s=sldo; t=1419894734; bh=QGb7nhxPrlCSFZ0BzZGGxZe17Vvs3RTcEJw99F0j7wk=; h=Date:From:To:Subject:From; b=cO79EvTrestu+xtgIRhyDPfAMN0ayrcbMsqqgAAmSs2KmAriE1Cu3CCVM4LMLin2g FCWH5ssCl0iUCJQBl8HgC+zLUh8ncpf3klXpbj9EjxBRSk6x1ZZI+jBn3mh2WqM60f pSxUA5kuOPaiLMVLCDrn6PMLjN4nQXN3NaJvO3Kk=
Message-ID: <54A1E049.9000404@shiftleft.org>
Date: Mon, 29 Dec 2014 15:14:17 -0800
From: Mike Hamburg <mike@shiftleft.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: "cfrg@irtf.org" <cfrg@irtf.org>
Content-Type: multipart/alternative; boundary="------------090807080305020704050406"
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/AhaM4XitN8bwh_a2vxklBsqNkXo
Subject: [Cfrg] Benchmarks: 384 vs 389 vs Goldilocks vs ... on Haswell
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 29 Dec 2014 23:14:20 -0000

Hi all,

The chairs requested that I implement and benchmark 2^389 - 21, 
suggested by Ilari.

I've benchmarked several primes that have been discussed on this list.  
I used a Haswell machine running Linux test everything, including MSR 
ECCLib 1.1.  Cycle counts for mul/sqr:

MSR ECCLib (all asm, but no BMI2):
P256-mers: M = 59.2, S = 47.8.
P384-mers: M = 117.5, S = 89.8.
P512-mers: M = 199.9, S = 140.6.

My code (C with BMI2 mul+acc intrinsics):
P389: M = 112.5, S = 75.5.
Goldilocks P448: M = 118.0, S = 88.9.
Ridinghood P480: M = 118.2, S = 89.1.
NIST P521: M = 145.5, S = 110.7.

With this machine/compiler combination, 2^389 - 21 is showing at about 
8-9% faster than MSR 384-mers and Goldilocks, which are within 0.1% of 
each other.

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz .

Ubuntu clang version 3.5-1ubuntu1 (trunk) (based on LLVM 3.5)
Target: x86_64-pc-linux-gnu
Thread model: posix


* All benchmarks computed by timing and multiplying by 3.5932GHz, since 
that's what calib() spat out.

* The P389 code is mul-after-add clean.  I did this by multiplying the 
arguments to a multiply separately by 8 and 21.  The code without this 
feature is marginally (<1%) faster.

* I haven't tested the correctness of the P389 code, nor have I built it 
into a curve.  So sue me, I'm on vacation.

* ECCLib, P389*, P448: 10^8 trials, noinline'd functions, 
32-byte-aligned stack buffers, no dependencies in the multiplications.  
TurboBoost and HyperThreading disabled. Benchmarks are stable to about 3 
significant figures.

* P480, P521, because I'm lazy: Goldilocks `make bench`, 10^8/2 trials, 
noinlined functions, 32-byte-aligned stack buffers, no dependencies.  
Almost exactly consistent when benchmarking P448.

* For MSR ECCLib I called directly into ecc384.o's functions, using the 
same benchmarking framework code as for P389.

* Earlier, offlist testing on a Haswell MacBook Air showed about a 2% 
advantage to 2^389 - 21 instead of 8%.  The code improvements to P389 
since then are not sufficient to explain the better results: they are 
apparently machine/compiler dependent.  I thought that Clang 3.5 might 
be responsible, since I've seen performance differences in arithmetic 
code with it.  But clang-3.3, 3.4 and 3.5 all give similar results for 
those primes.  GCC 4.9 gives slightly worse results for everything but 
the MSR asm code.

* The P389 code is probably less tuned than its competitors. The MS code 
is written in pure asm, but it doesn't take advantage of BMI2. The 
Goldilocks and P389 code are C with an asm multiply/accumulate 
intrinsic, but I spent more time tuning the Goldilocks code.

* The Ridinghood code is the same as the Goldilocks code with different 
shift constants.  The P-521 code is the arch_x86_64_r12 version, using 9 
limbs vector-aligned to a 12-word structure.

* I have no data about 2^389 - 21 on other platforms, but offlist 
discussions suggest that it will be OK, if perhaps annoying to 
implement, on 32-bit platforms.

-- Mike