[Cfrg] Publicly verifiable benchmarks

"D. J. Bernstein" <djb@cr.yp.to> Fri, 10 October 2014 07:30 UTC

Date: Fri, 10 Oct 2014 07:18:47 -0000
Message-ID: <20141010071847.9478.qmail@cr.yp.to>
From: "D. J. Bernstein" <djb@cr.yp.to>
To: cfrg@irtf.org
Mail-Followup-To: cfrg@irtf.org
In-Reply-To: <2FBC676C3BBFBB4AA82945763B361DE608F1D021@MX17A.corp.emc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/1XIdOEmR5GezJkytrKcysLIj4Ks
Subject: [Cfrg] Publicly verifiable benchmarks
Precedence: list

Parkinson, Sean writes:
> Making a decision on new elliptic curves based on data that hasn't
> been corroborated by a 3rd party is bad practice.

More than 1500 implementations of various cryptographic functions have
been contributed to, and are publicly available as part of, the
state-of-the-art SUPERCOP benchmarking framework:

   http://bench.cr.yp.to/supercop.html

Practically all of the implementations are free to use, and many of them
are in fact widely used. Most of the researchers producing speed records
for cryptography are contributing their software to the same system.

The eBACS benchmarking project systematically and reproducibly measures
these implementations on more than 20 different microarchitectures:

   http://bench.cr.yp.to/computers.html

It's easy for other people to download and run the benchmarks on their
favorite CPU, and to contribute the results to the central system, where
detailed speed reports are posted publicly for other people to see and
verify. eBACS has practically eliminated the measurement errors and
disputes that plagued previous approaches to cryptographic benchmarking.
eBASH, the hash component of eBACS, was mentioned 30 times in NIST's
final report on the SHA-3 competition.

As a concrete example, now that Mike has sent crypto_dh/ed448goldilocks
in for benchmarking, eBACS is automatically filling lines into

   http://bench.cr.yp.to/results-dh.html
   http://bench.cr.yp.to/impl-dh/ed448goldilocks.html

whenever machines finish benchmark runs: 529066 cycles on titan0, 689020
cycles on hydra8, 757676 cycles on h6sandy, etc. These don't exactly
confirm Mike's comparisons to the Sandy Bridge numbers that Microsoft
claimed in http://eprint.iacr.org/2014/130.pdf---although they do seem
adequate to support his point about ed448goldilocks hitting a sweet spot
on the security/speed curve while Microsoft's design strategy
compromises the security/speed tradeoff:

   * ed448goldilocks isn't quite twice as fast as numsp512t1
     (ed-512-mers): 757676 cycles vs. 1293000 cycles.

   * ed448goldilocks is about 23% slower than numsp384t1 (ed-384-mers):
     757676 cycles vs. 617000 cycles.

Of course, if Mike or anyone else thinks that ed448goldilocks can be
computed more efficiently, he's welcome to prove it by contributing a
better implementation of that function to SUPERCOP, and then the
benchmarks will be updated appropriately. He can also raise reasonable
questions about the accuracy of Microsoft's claims; if Microsoft's
numbers are actually correct then Microsoft can dispel the skepticism
by contributing their own code to SUPERCOP.

As a more detailed example of reproducibility, let's look at what the
benchmarks say about X25519 on Haswell. Checking

   http://bench.cr.yp.to/results-dh.html

we see a median of 145907 cycles (quartiles: 144894 and 147191 cycles)
for the crypto_dh/curve25519 software on an Intel Xeon E3-1275 V3.
Clicking on "titan0" shows more information: the best speeds found for
crypto_scalarmult/curve25519 on this machine used

   gcc-4.8.1 -m64 -O -march=native -mtune=native -fomit-frame-pointer

to compile the "amd64-51" implementation. Anyone can use the same free
implementation with the same free compiler and will obtain the same
compiled code running in the same number of Haswell cycles:

   wget https://hyperelliptic.org/ebats/supercop-20140924.tar.bz2
   tar -xf supercop-20140924.tar.bz2
   cd supercop-20140924
   # compile and measure everything: nohup sh data-do &
   # alternatively, extract X25519 as follows:
   mkdir x25519
   cp measure-anything.c x25519
   cp crypto_scalarmult/measure.c x25519
   cp crypto_scalarmult/curve25519/amd64-51/* x25519
   cp include/randombytes.h x25519
   cp cpucycles/amd64cpuinfo.h x25519/cpucycles.h
   cp cpucycles/amd64cpuinfo.c x25519/cpucycles.c
   cp cpucycles/osfreq.c x25519/osfreq.c
   cd x25519
   ( sed s/CRYPTO_/crypto_scalarmult_/ < api.h
     echo '#define crypto_scalarmult_IMPLEMENTATION "amd64-51"'
     echo '#define crypto_scalarmult_VERSION "-"'
   ) > crypto_scalarmult.h
   echo 'static const char cpuid[] = {0};' > cpuid.h
   gcc -m64 -O -march=native -mtune=native -fomit-frame-pointer \
   -D COMPILER='"gcc"' \
   -D LOOPS=1 \
   -o measure measure-anything.c measure.c cpucycles.c \
   mont* fe*.c *.s
   ./measure

For example, on one core of Andrey's 3.4GHz i7-4770 (Haswell), this
X25519 code will take the same ~146000 cycles: i.e., more than 23000
operations/second, whereas the latest Haswell-optimized OpenSSL NIST
P-256 ECDH code that he measured was only 15000 operations/second.

This is, by the way, rather old Curve25519 code optimized for Nehalem,
the microarchitecture of the first Core i7 CPUs in 2008---but on Intel's
latest Haswell CPUs it's still solidly beating NIST P-256 code that's
optimized for Haswell. There's ample literature explaining that

   * reductions mod 2^255-19 are faster than reductions mod
     2^256-2^224+2^192+2^96-1 on a broad range of platforms, and that

   * Montgomery scalarmult is faster than Weierstrass scalarmult,

so the performance gap is unsurprising.

Why did Andrey report only 17289 operations/second for X25519 on
Haswell? The answer, in a nutshell, is that there's an active ecosystem
of Curve25519/X25519/Ed25519 implementations, and it's easy to find
implementations that prioritize simplicity over speed---including the
one implementation included in Andrey's manual benchmarks. Of course,
any application developer who needs more speed will look for, and find,
the faster X25519 implementations.

---Dan

[Cfrg] When's the decision? Watson Ladd
Re: [Cfrg] When's the decision? Yoav Nir
Re: [Cfrg] When's the decision? Stephen Farrell
Re: [Cfrg] When's the decision? Watson Ladd
Re: [Cfrg] When's the decision? David Jacobson
Re: [Cfrg] When's the decision? Watson Ladd
Re: [Cfrg] When's the decision? Michael Hamburg
Re: [Cfrg] When's the decision? David Jacobson
Re: [Cfrg] When's the decision? D. J. Bernstein
[Cfrg] Publicly verifiable benchmarks D. J. Bernstein
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] When's the decision? Watson Ladd
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] When's the decision? Mike Hamburg
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] When's the decision? Phillip Hallam-Baker
Re: [Cfrg] When's the decision? Mike Hamburg
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] Publicly verifiable benchmarks David Jacobson
Re: [Cfrg] Publicly verifiable benchmarks Michael Hamburg
Re: [Cfrg] Publicly verifiable benchmarks Andrey Jivsov
Re: [Cfrg] Publicly verifiable benchmarks Watson Ladd
Re: [Cfrg] Publicly verifiable benchmarks Parkinson, Sean
Re: [Cfrg] Publicly verifiable benchmarks D. J. Bernstein
Re: [Cfrg] Publicly verifiable benchmarks Michael Hamburg
[Cfrg] Constant-time implementations D. J. Bernstein
Re: [Cfrg] Constant-time implementations David Jacobson
Re: [Cfrg] Constant-time implementations Adam Langley
Re: [Cfrg] Constant-time implementations Yoav Nir
Re: [Cfrg] Constant-time implementations Watson Ladd
Re: [Cfrg] Constant-time implementations Mike Hamburg
Re: [Cfrg] When's the decision? Paterson, Kenny
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] When's the decision? Ilari Liusvaara
Re: [Cfrg] When's the decision? Yoav Nir
[Cfrg] ed448goldilocks vs. numsp384t1 and numsp51… D. J. Bernstein
Re: [Cfrg] ed448goldilocks vs. numsp384t1 and num… Ilari Liusvaara
Re: [Cfrg] ed448goldilocks vs. numsp384t1 and num… Michael Hamburg
Re: [Cfrg] ed448goldilocks vs. numsp384t1 and num… Ilari Liusvaara
Re: [Cfrg] ed448goldilocks vs. numsp384t1 and num… Michael Hamburg