Re: [Cfrg] Publicly verifiable benchmarks

Andrey Jivsov <crypto@brainhub.org> Sun, 12 October 2014 02:54 UTC

Message-ID: <5439ED77.3010701@brainhub.org>
Date: Sat, 11 Oct 2014 19:54:47 -0700
From: Andrey Jivsov <crypto@brainhub.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: cfrg@irtf.org
References: <20141010071847.9478.qmail@cr.yp.to>
In-Reply-To: <20141010071847.9478.qmail@cr.yp.to>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/wTV6zTfn22q-rD78EkJqYUlbbr4
Subject: Re: [Cfrg] Publicly verifiable benchmarks
Precedence: list

On 10/10/2014 12:18 AM, D. J. Bernstein wrote:
> Parkinson, Sean writes:
>> Making a decision on new elliptic curves based on data that hasn't
>> been corroborated by a 3rd party is bad practice.
>
> More than 1500 implementations of various cryptographic functions have
> been contributed to, and are publicly available as part of, the
> state-of-the-art SUPERCOP benchmarking framework:
>
>     http://bench.cr.yp.to/supercop.html
>
> Practically all of the implementations are free to use, and many of them
> are in fact widely used. Most of the researchers producing speed records
> for cryptography are contributing their software to the same system.
>
> The eBACS benchmarking project systematically and reproducibly measures
> these implementations on more than 20 different microarchitectures:
>
>     http://bench.cr.yp.to/computers.html
>
> It's easy for other people to download and run the benchmarks on their
> favorite CPU, and to contribute the results to the central system, where
> detailed speed reports are posted publicly for other people to see and
> verify. eBACS has practically eliminated the measurement errors and
> disputes that plagued previous approaches to cryptographic benchmarking.
> eBASH, the hash component of eBACS, was mentioned 30 times in NIST's
> final report on the SHA-3 competition.
>
> As a concrete example, now that Mike has sent crypto_dh/ed448goldilocks
> in for benchmarking, eBACS is automatically filling lines into
>
>     http://bench.cr.yp.to/results-dh.html
>     http://bench.cr.yp.to/impl-dh/ed448goldilocks.html
>
> whenever machines finish benchmark runs: 529066 cycles on titan0, 689020
> cycles on hydra8, 757676 cycles on h6sandy, etc. These don't exactly
> confirm Mike's comparisons to the Sandy Bridge numbers that Microsoft
> claimed in http://eprint.iacr.org/2014/130.pdf---although they do seem
> adequate to support his point about ed448goldilocks hitting a sweet spot
> on the security/speed curve while Microsoft's design strategy
> compromises the security/speed tradeoff:
>
>     * ed448goldilocks isn't quite twice as fast as numsp512t1
>       (ed-512-mers): 757676 cycles vs. 1293000 cycles.
>
>     * ed448goldilocks is about 23% slower than numsp384t1 (ed-384-mers):
>       757676 cycles vs. 617000 cycles.
>
> Of course, if Mike or anyone else thinks that ed448goldilocks can be
> computed more efficiently, he's welcome to prove it by contributing a
> better implementation of that function to SUPERCOP, and then the
> benchmarks will be updated appropriately. He can also raise reasonable
> questions about the accuracy of Microsoft's claims; if Microsoft's
> numbers are actually correct then Microsoft can dispel the skepticism
> by contributing their own code to SUPERCOP.
>
> As a more detailed example of reproducibility, let's look at what the
> benchmarks say about X25519 on Haswell. Checking
>
>     http://bench.cr.yp.to/results-dh.html
>
> we see a median of 145907 cycles (quartiles: 144894 and 147191 cycles)
> for the crypto_dh/curve25519 software on an Intel Xeon E3-1275 V3.
> Clicking on "titan0" shows more information: the best speeds found for
> crypto_scalarmult/curve25519 on this machine used
>
>     gcc-4.8.1 -m64 -O -march=native -mtune=native -fomit-frame-pointer
>
> to compile the "amd64-51" implementation. Anyone can use the same free
> implementation with the same free compiler and will obtain the same
> compiled code running in the same number of Haswell cycles:
>
>     wget https://hyperelliptic.org/ebats/supercop-20140924.tar.bz2
>     tar -xf supercop-20140924.tar.bz2
>     cd supercop-20140924
>     # compile and measure everything: nohup sh data-do &
>     # alternatively, extract X25519 as follows:
>     mkdir x25519
>     cp measure-anything.c x25519
>     cp crypto_scalarmult/measure.c x25519
>     cp crypto_scalarmult/curve25519/amd64-51/* x25519
>     cp include/randombytes.h x25519
>     cp cpucycles/amd64cpuinfo.h x25519/cpucycles.h
>     cp cpucycles/amd64cpuinfo.c x25519/cpucycles.c
>     cp cpucycles/osfreq.c x25519/osfreq.c
>     cd x25519
>     ( sed s/CRYPTO_/crypto_scalarmult_/ < api.h
>       echo '#define crypto_scalarmult_IMPLEMENTATION "amd64-51"'
>       echo '#define crypto_scalarmult_VERSION "-"'
>     ) > crypto_scalarmult.h
>     echo 'static const char cpuid[] = {0};' > cpuid.h
>     gcc -m64 -O -march=native -mtune=native -fomit-frame-pointer \
>     -D COMPILER='"gcc"' \
>     -D LOOPS=1 \
>     -o measure measure-anything.c measure.c cpucycles.c \
>     mont* fe*.c *.s
>     ./measure


Thank you for this information. I added timing code and built the 
"measure" program in the x25519 directory (see above) on

    Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz, Ivy Bridge (not Haswell).

3 results are in op/sec:

above x25519 variable base : openssl : curve25519-donna:

18167.7 : 11030.5 : 14098.2

18167.7/11030.5 = 65%

The ./measure reported 182112 cycles for variable base before and after 
my changes (which looks correct: 3300000000 Hz/182112=18120 sec). 
Specifically, I timed the crypto_scalarmult().

I won't have access to my Haswell machine for a few days.

>
> For example, on one core of Andrey's 3.4GHz i7-4770 (Haswell), this
> X25519 code will take the same ~146000 cycles: i.e., more than 23000
> operations/second, whereas the latest Haswell-optimized OpenSSL NIST
> P-256 ECDH code that he measured was only 15000 operations/second.
>
> This is, by the way, rather old Curve25519 code optimized for Nehalem,
> the microarchitecture of the first Core i7 CPUs in 2008---but on Intel's
> latest Haswell CPUs it's still solidly beating NIST P-256 code that's
> optimized for Haswell. There's ample literature explaining that
>
>     * reductions mod 2^255-19 are faster than reductions mod
>       2^256-2^224+2^192+2^96-1 on a broad range of platforms, and that
>
>     * Montgomery scalarmult is faster than Weierstrass scalarmult,
>
> so the performance gap is unsurprising.
>
> Why did Andrey report only 17289 operations/second for X25519 on
> Haswell? The answer, in a nutshell, is that there's an active ecosystem
> of Curve25519/X25519/Ed25519 implementations, and it's easy to find
> implementations that prioritize simplicity over speed---including the
> one implementation included in Andrey's manual benchmarks. Of course,
> any application developer who needs more speed will look for, and find,
> the faster X25519 implementations.
>
> ---Dan

It's natural for a developer, the consumer of these open source 
libraries, to test-drive the speed tests the way I did. If the sequence 
of steps needed to run the tests is clear and short, the results are 
easily verifiable.

In case of openssl the results depends on:
* the exact version of the source code
* the exact version of the compiler and assembler tools
* the options given to the config, configure
* CPU

OpenSSL will autodetect the gcc and build the binary by excluding or 
including certain highly-optimized pieces without any relevant reports. 
Some beneficial configuration options are not always the defaults. Not 
surprisingly, the openssl that comes with the latest OS I test on, 
Fedora Core 20, is 3+ times slower than what I report above.

I find that the quickest way for me to confirm that I am timing the 
right code is to use the debugger. For example, I confirmed that 
./measure is spending time in 
crypto_scalarmult_curve25519_amd64_51_ladderstep in a hand-written 
assembler file.

http://bench.cr.yp.to looks like an excellent way to try code on 
platforms one doesn't own.

[Cfrg] When's the decision? Watson Ladd
Re: [Cfrg] When's the decision? Yoav Nir
Re: [Cfrg] When's the decision? Stephen Farrell
Re: [Cfrg] When's the decision? Watson Ladd
Re: [Cfrg] When's the decision? David Jacobson
Re: [Cfrg] When's the decision? Watson Ladd
Re: [Cfrg] When's the decision? Michael Hamburg
Re: [Cfrg] When's the decision? David Jacobson
Re: [Cfrg] When's the decision? D. J. Bernstein
[Cfrg] Publicly verifiable benchmarks D. J. Bernstein
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] When's the decision? Watson Ladd
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] When's the decision? Mike Hamburg
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] When's the decision? Phillip Hallam-Baker
Re: [Cfrg] When's the decision? Mike Hamburg
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] Publicly verifiable benchmarks David Jacobson
Re: [Cfrg] Publicly verifiable benchmarks Michael Hamburg
Re: [Cfrg] Publicly verifiable benchmarks Andrey Jivsov
Re: [Cfrg] Publicly verifiable benchmarks Watson Ladd
Re: [Cfrg] Publicly verifiable benchmarks Parkinson, Sean
Re: [Cfrg] Publicly verifiable benchmarks D. J. Bernstein
Re: [Cfrg] Publicly verifiable benchmarks Michael Hamburg
[Cfrg] Constant-time implementations D. J. Bernstein
Re: [Cfrg] Constant-time implementations David Jacobson
Re: [Cfrg] Constant-time implementations Adam Langley
Re: [Cfrg] Constant-time implementations Yoav Nir
Re: [Cfrg] Constant-time implementations Watson Ladd
Re: [Cfrg] Constant-time implementations Mike Hamburg
Re: [Cfrg] When's the decision? Paterson, Kenny
Re: [Cfrg] When's the decision? Parkinson, Sean
Re: [Cfrg] When's the decision? Ilari Liusvaara
Re: [Cfrg] When's the decision? Yoav Nir
[Cfrg] ed448goldilocks vs. numsp384t1 and numsp51… D. J. Bernstein
Re: [Cfrg] ed448goldilocks vs. numsp384t1 and num… Ilari Liusvaara
Re: [Cfrg] ed448goldilocks vs. numsp384t1 and num… Michael Hamburg
Re: [Cfrg] ed448goldilocks vs. numsp384t1 and num… Ilari Liusvaara
Re: [Cfrg] ed448goldilocks vs. numsp384t1 and num… Michael Hamburg