Re: [Cfrg] Benchmarks: 384 vs 389 vs Goldilocks vs ... on Haswell

"Paterson, Kenny" <Kenny.Paterson@rhul.ac.uk> Wed, 31 December 2014 20:14 UTC

Return-Path: <Kenny.Paterson@rhul.ac.uk>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EC3371A00DF for <cfrg@ietfa.amsl.com>; Wed, 31 Dec 2014 12:14:19 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.798
X-Spam-Level:
X-Spam-Status: No, score=0.798 tagged_above=-999 required=5 tests=[BAYES_50=0.8, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LzQx_9A_dV9O for <cfrg@ietfa.amsl.com>; Wed, 31 Dec 2014 12:14:16 -0800 (PST)
Received: from emea01-am1-obe.outbound.protection.outlook.com (mail-am1on0619.outbound.protection.outlook.com [IPv6:2a01:111:f400:fe00::619]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 79B291A1A10 for <cfrg@irtf.org>; Wed, 31 Dec 2014 12:14:16 -0800 (PST)
Received: from DBXPR03MB383.eurprd03.prod.outlook.com (10.141.10.15) by DBXPR03MB382.eurprd03.prod.outlook.com (10.141.10.12) with Microsoft SMTP Server (TLS) id 15.1.49.12; Wed, 31 Dec 2014 20:01:43 +0000
Received: from DBXPR03MB383.eurprd03.prod.outlook.com ([10.141.10.15]) by DBXPR03MB383.eurprd03.prod.outlook.com ([10.141.10.15]) with mapi id 15.01.0049.002; Wed, 31 Dec 2014 20:01:43 +0000
From: "Paterson, Kenny" <Kenny.Paterson@rhul.ac.uk>
To: Mike Hamburg <mike@shiftleft.org>, "cfrg@irtf.org" <cfrg@irtf.org>
Thread-Topic: [Cfrg] Benchmarks: 384 vs 389 vs Goldilocks vs ... on Haswell
Thread-Index: AQHQI701KGpxET6GdkKNiBIGv4KjYZyqIa6A
Date: Wed, 31 Dec 2014 20:01:43 +0000
Message-ID: <D0CA0568.3B27A%kenny.paterson@rhul.ac.uk>
References: <54A1E049.9000404@shiftleft.org>
In-Reply-To: <54A1E049.9000404@shiftleft.org>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
user-agent: Microsoft-MacOutlook/14.4.7.141117
x-originating-ip: [2.96.147.218]
authentication-results: spf=none (sender IP is ) smtp.mailfrom=Kenny.Paterson@rhul.ac.uk;
x-microsoft-antispam: BCL:0;PCL:0;RULEID:;SRVR:DBXPR03MB382;
x-exchange-antispam-report-test: UriScan:;
x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:;SRVR:DBXPR03MB382;
x-forefront-prvs: 0442E569BC
x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(479174004)(53754006)(24454002)(199003)(52044002)(51704005)(189002)(50986999)(62966003)(31966008)(68736005)(102836002)(87936001)(2656002)(92566001)(122556002)(40100003)(54356999)(2501002)(77096005)(106116001)(105586002)(120916001)(64706001)(77156002)(74482002)(21056001)(46102003)(66066001)(99396003)(97736003)(19580395003)(19580405001)(106356001)(2900100001)(76176999)(20776003)(36756003)(86362001)(107886001)(83506001)(4396001)(2950100001); DIR:OUT; SFP:1101; SCL:1; SRVR:DBXPR03MB382; H:DBXPR03MB383.eurprd03.prod.outlook.com; FPR:; SPF:None; MLV:sfv; PTR:InfoNoRecords; A:1; MX:1; LANG:en;
received-spf: None (protection.outlook.com: rhul.ac.uk does not designate permitted sender hosts)
Content-Type: text/plain; charset="us-ascii"
Content-ID: <BE2109CEA2A6454CAD91BAC5B521982E@eurprd03.prod.outlook.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: rhul.ac.uk
X-MS-Exchange-CrossTenant-originalarrivaltime: 31 Dec 2014 20:01:43.6553 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 2efd699a-1922-4e69-b601-108008d28a2e
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DBXPR03MB382
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/qtHfxMcKscIXv1l_B-_4KXBaI28
Subject: Re: [Cfrg] Benchmarks: 384 vs 389 vs Goldilocks vs ... on Haswell
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 31 Dec 2014 20:14:20 -0000

Hi Mike,

Thanks a lot for doing this work, especially over the holiday season -
it's much appreciated.

My take-away from your work is that there's no strong reason (in
performance terms) to prefer P389 over P384-mers or vice-versa - the 8-9%
difference is there, yes, but could easily disappear or increase with
further optimisations on either side.

Does that seem fair to you? Please feel free to give an alternative
interpretation if you like.

Regards,

Kenny 

On 29/12/2014 23:14, "Mike Hamburg" <mike@shiftleft.org> wrote:

>
>
>
>Hi all, 
>
>The chairs requested that I implement and benchmark 2^389 - 21, suggested
>by Ilari.
>
>I've benchmarked several primes that have been discussed on this list.  I
>used a Haswell machine running Linux test everything, including MSR
>ECCLib 1.1.  Cycle counts for mul/sqr:
>
>MSR ECCLib (all asm, but no BMI2):
>P256-mers: M = 59.2, S = 47.8.
>P384-mers: M = 117.5, S = 89.8.
>P512-mers: M = 199.9, S = 140.6.
>
>My code (C with BMI2 mul+acc intrinsics):
>P389: M = 112.5, S = 75.5.
>Goldilocks P448: M = 118.0, S = 88.9.
>Ridinghood P480: M = 118.2, S = 89.1.
>NIST P521: M = 145.5, S = 110.7.
>
>With this machine/compiler combination, 2^389 - 21 is showing at about
>8-9% faster than MSR 384-mers and Goldilocks, which are within 0.1% of
>each other.
>
>CPU:
>Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz .
>
>Compiler:
>Ubuntu clang version 3.5-1ubuntu1 (trunk) (based on LLVM 3.5)
>Target: x86_64-pc-linux-gnu
>Thread model: posix
>
>Notes: 
>
>* All benchmarks computed by timing and multiplying by 3.5932GHz, since
>that's what calib() spat out.
>
>
>* The P389 code is mul-after-add clean.  I did this by multiplying the
>arguments to a multiply separately by 8 and 21.  The code without this
>feature is marginally (<1%) faster.
>
>* I haven't tested the correctness of the P389 code, nor have I built it
>into a curve.  So sue me, I'm on vacation.
>
>* ECCLib, P389*, P448: 10^8 trials, noinline'd functions, 32-byte-aligned
>stack buffers, no dependencies in the multiplications.  TurboBoost and
>HyperThreading disabled.  Benchmarks are stable to about 3 significant
>figures.
>
>
>* P480, P521, because I'm lazy: Goldilocks `make bench`, 10^8/2 trials,
>noinlined functions, 32-byte-aligned stack buffers, no dependencies.
>Almost exactly
> consistent when benchmarking P448.
>
>* For MSR ECCLib I called directly into ecc384.o's functions, using the
>same benchmarking framework code as for P389.
>
>* Earlier, offlist testing on a Haswell MacBook Air showed about a 2%
>advantage to 2^389 - 21 instead of 8%.  The code improvements to P389
>since then are not sufficient to explain the better results: they are
>apparently machine/compiler dependent.  I thought
> that Clang 3.5 might be responsible, since I've seen performance
>differences in arithmetic code with it.  But clang-3.3, 3.4 and 3.5 all
>give similar results for those primes.  GCC 4.9 gives slightly worse
>results for everything but the MSR asm code.
>
>* The P389 code is probably less tuned than its competitors. The MS code
>is written in pure asm, but it doesn't take advantage of BMI2.  The
>Goldilocks and P389 code are C with an asm multiply/accumulate intrinsic,
>but I spent more time tuning the Goldilocks
> code. 
>
>* The Ridinghood code is the same as the Goldilocks code with different
>shift constants.  The P-521 code is the arch_x86_64_r12 version, using 9
>limbs vector-aligned to a 12-word structure.
>
>* I have no data about 2^389 - 21 on other platforms, but offlist
>discussions suggest that it will be OK, if perhaps annoying to implement,
>on 32-bit platforms.
>
>Cheers, 
>-- Mike
>
>