Re: [Cfrg] Publicly verifiable benchmarks

Andrey Jivsov <crypto@brainhub.org> Sun, 12 October 2014 02:54 UTC

Return-Path: <crypto@brainhub.org>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CA46F1A0A6B for <cfrg@ietfa.amsl.com>; Sat, 11 Oct 2014 19:54:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.301
X-Spam-Level:
X-Spam-Status: No, score=-1.301 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, J_CHICKENPOX_15=0.6, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id maKLxgZYnlZU for <cfrg@ietfa.amsl.com>; Sat, 11 Oct 2014 19:54:49 -0700 (PDT)
Received: from resqmta-po-11v.sys.comcast.net (resqmta-po-11v.sys.comcast.net [IPv6:2001:558:fe16:19:96:114:154:170]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 07D751A19E4 for <cfrg@irtf.org>; Sat, 11 Oct 2014 19:54:48 -0700 (PDT)
Received: from resomta-po-14v.sys.comcast.net ([96.114.154.238]) by resqmta-po-11v.sys.comcast.net with comcast id 22uo1p00158ss0Y012uoZt; Sun, 12 Oct 2014 02:54:48 +0000
Received: from [192.168.1.2] ([71.202.164.227]) by resomta-po-14v.sys.comcast.net with comcast id 22un1p00E4uhcbK012unqH; Sun, 12 Oct 2014 02:54:48 +0000
Message-ID: <5439ED77.3010701@brainhub.org>
Date: Sat, 11 Oct 2014 19:54:47 -0700
From: Andrey Jivsov <crypto@brainhub.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: cfrg@irtf.org
References: <20141010071847.9478.qmail@cr.yp.to>
In-Reply-To: <20141010071847.9478.qmail@cr.yp.to>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 7bit
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1413082488; bh=4RXdlie6oXfQjdDs56DIU0xPyGRYdJe6OSAU2dGk6yg=; h=Received:Received:Message-ID:Date:From:MIME-Version:To:Subject: Content-Type; b=FLglQdaTtdlFqqJ8aI/T7drHwPCbw+S1VghHrZz6jWq/8jNv9KL1mH1vcyVBwj+Sg PBMGILA7FfgpnzNM9wQF6ftxWQmRckx13444gBSfA1vxK6gF82Gc1wNOLnNFMzx/0K 4wKtPa9BmdZMoU/892G6aBRB/c+lnpVQPw+IC+qHqIQ4/vMwesAiC5RPwfyOBAQ4ew 1ib/Ugt9eoYbf4ULL+dZg3vp6WxtPIHsXCRHWT1SR1HWvW2khGgWlDZmpu0Wbv/kw3 IaFUsVlLtVp/iipvJRn5ImV4Qpa1zkvTI/56PemNrFed2GdKFk3Q+wVYhgzp/U17EY H9wU+H7fQm02A==
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/wTV6zTfn22q-rD78EkJqYUlbbr4
Subject: Re: [Cfrg] Publicly verifiable benchmarks
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Sun, 12 Oct 2014 02:54:52 -0000

On 10/10/2014 12:18 AM, D. J. Bernstein wrote:
> Parkinson, Sean writes:
>> Making a decision on new elliptic curves based on data that hasn't
>> been corroborated by a 3rd party is bad practice.
>
> More than 1500 implementations of various cryptographic functions have
> been contributed to, and are publicly available as part of, the
> state-of-the-art SUPERCOP benchmarking framework:
>
>     http://bench.cr.yp.to/supercop.html
>
> Practically all of the implementations are free to use, and many of them
> are in fact widely used. Most of the researchers producing speed records
> for cryptography are contributing their software to the same system.
>
> The eBACS benchmarking project systematically and reproducibly measures
> these implementations on more than 20 different microarchitectures:
>
>     http://bench.cr.yp.to/computers.html
>
> It's easy for other people to download and run the benchmarks on their
> favorite CPU, and to contribute the results to the central system, where
> detailed speed reports are posted publicly for other people to see and
> verify. eBACS has practically eliminated the measurement errors and
> disputes that plagued previous approaches to cryptographic benchmarking.
> eBASH, the hash component of eBACS, was mentioned 30 times in NIST's
> final report on the SHA-3 competition.
>
> As a concrete example, now that Mike has sent crypto_dh/ed448goldilocks
> in for benchmarking, eBACS is automatically filling lines into
>
>     http://bench.cr.yp.to/results-dh.html
>     http://bench.cr.yp.to/impl-dh/ed448goldilocks.html
>
> whenever machines finish benchmark runs: 529066 cycles on titan0, 689020
> cycles on hydra8, 757676 cycles on h6sandy, etc. These don't exactly
> confirm Mike's comparisons to the Sandy Bridge numbers that Microsoft
> claimed in http://eprint.iacr.org/2014/130.pdf---although they do seem
> adequate to support his point about ed448goldilocks hitting a sweet spot
> on the security/speed curve while Microsoft's design strategy
> compromises the security/speed tradeoff:
>
>     * ed448goldilocks isn't quite twice as fast as numsp512t1
>       (ed-512-mers): 757676 cycles vs. 1293000 cycles.
>
>     * ed448goldilocks is about 23% slower than numsp384t1 (ed-384-mers):
>       757676 cycles vs. 617000 cycles.
>
> Of course, if Mike or anyone else thinks that ed448goldilocks can be
> computed more efficiently, he's welcome to prove it by contributing a
> better implementation of that function to SUPERCOP, and then the
> benchmarks will be updated appropriately. He can also raise reasonable
> questions about the accuracy of Microsoft's claims; if Microsoft's
> numbers are actually correct then Microsoft can dispel the skepticism
> by contributing their own code to SUPERCOP.
>
> As a more detailed example of reproducibility, let's look at what the
> benchmarks say about X25519 on Haswell. Checking
>
>     http://bench.cr.yp.to/results-dh.html
>
> we see a median of 145907 cycles (quartiles: 144894 and 147191 cycles)
> for the crypto_dh/curve25519 software on an Intel Xeon E3-1275 V3.
> Clicking on "titan0" shows more information: the best speeds found for
> crypto_scalarmult/curve25519 on this machine used
>
>     gcc-4.8.1 -m64 -O -march=native -mtune=native -fomit-frame-pointer
>
> to compile the "amd64-51" implementation. Anyone can use the same free
> implementation with the same free compiler and will obtain the same
> compiled code running in the same number of Haswell cycles:
>
>     wget https://hyperelliptic.org/ebats/supercop-20140924.tar.bz2
>     tar -xf supercop-20140924.tar.bz2
>     cd supercop-20140924
>     # compile and measure everything: nohup sh data-do &
>     # alternatively, extract X25519 as follows:
>     mkdir x25519
>     cp measure-anything.c x25519
>     cp crypto_scalarmult/measure.c x25519
>     cp crypto_scalarmult/curve25519/amd64-51/* x25519
>     cp include/randombytes.h x25519
>     cp cpucycles/amd64cpuinfo.h x25519/cpucycles.h
>     cp cpucycles/amd64cpuinfo.c x25519/cpucycles.c
>     cp cpucycles/osfreq.c x25519/osfreq.c
>     cd x25519
>     ( sed s/CRYPTO_/crypto_scalarmult_/ < api.h
>       echo '#define crypto_scalarmult_IMPLEMENTATION "amd64-51"'
>       echo '#define crypto_scalarmult_VERSION "-"'
>     ) > crypto_scalarmult.h
>     echo 'static const char cpuid[] = {0};' > cpuid.h
>     gcc -m64 -O -march=native -mtune=native -fomit-frame-pointer \
>     -D COMPILER='"gcc"' \
>     -D LOOPS=1 \
>     -o measure measure-anything.c measure.c cpucycles.c \
>     mont* fe*.c *.s
>     ./measure


Thank you for this information. I added timing code and built the 
"measure" program in the x25519 directory (see above) on

    Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz, Ivy Bridge (not Haswell).

3 results are in op/sec:

above x25519 variable base : openssl : curve25519-donna:

18167.7 : 11030.5 : 14098.2

18167.7/11030.5 = 65%

The ./measure reported 182112 cycles for variable base before and after 
my changes (which looks correct: 3300000000 Hz/182112=18120 sec). 
Specifically, I timed the crypto_scalarmult().

I won't have access to my Haswell machine for a few days.

>
> For example, on one core of Andrey's 3.4GHz i7-4770 (Haswell), this
> X25519 code will take the same ~146000 cycles: i.e., more than 23000
> operations/second, whereas the latest Haswell-optimized OpenSSL NIST
> P-256 ECDH code that he measured was only 15000 operations/second.
>
> This is, by the way, rather old Curve25519 code optimized for Nehalem,
> the microarchitecture of the first Core i7 CPUs in 2008---but on Intel's
> latest Haswell CPUs it's still solidly beating NIST P-256 code that's
> optimized for Haswell. There's ample literature explaining that
>
>     * reductions mod 2^255-19 are faster than reductions mod
>       2^256-2^224+2^192+2^96-1 on a broad range of platforms, and that
>
>     * Montgomery scalarmult is faster than Weierstrass scalarmult,
>
> so the performance gap is unsurprising.
>
> Why did Andrey report only 17289 operations/second for X25519 on
> Haswell? The answer, in a nutshell, is that there's an active ecosystem
> of Curve25519/X25519/Ed25519 implementations, and it's easy to find
> implementations that prioritize simplicity over speed---including the
> one implementation included in Andrey's manual benchmarks. Of course,
> any application developer who needs more speed will look for, and find,
> the faster X25519 implementations.
>
> ---Dan

It's natural for a developer, the consumer of these open source 
libraries, to test-drive the speed tests the way I did. If the sequence 
of steps needed to run the tests is clear and short, the results are 
easily verifiable.

In case of openssl the results depends on:
* the exact version of the source code
* the exact version of the compiler and assembler tools
* the options given to the config, configure
* CPU

OpenSSL will autodetect the gcc and build the binary by excluding or 
including certain highly-optimized pieces without any relevant reports. 
Some beneficial configuration options are not always the defaults. Not 
surprisingly, the openssl that comes with the latest OS I test on, 
Fedora Core 20, is 3+ times slower than what I report above.

I find that the quickest way for me to confirm that I am timing the 
right code is to use the debugger. For example, I confirmed that 
./measure is spending time in 
crypto_scalarmult_curve25519_amd64_51_ladderstep in a hand-written 
assembler file.

http://bench.cr.yp.to looks like an excellent way to try code on 
platforms one doesn't own.