Re: [Cfrg] 2^389-21 vs 2^414-17 vs 2^448-2^224-1
Mike Hamburg <mike@shiftleft.org> Tue, 16 December 2014 04:12 UTC
Return-Path: <mike@shiftleft.org>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A688D1A0379 for <cfrg@ietfa.amsl.com>; Mon, 15 Dec 2014 20:12:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.955
X-Spam-Level: **
X-Spam-Status: No, score=2.955 tagged_above=-999 required=5 tests=[BAYES_05=-0.5, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FH_HOST_EQ_D_D_D_D=0.765, FH_HOST_EQ_D_D_D_DB=0.888, HELO_MISMATCH_ORG=0.611, HOST_MISMATCH_NET=0.311, RDNS_DYNAMIC=0.982, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id wafxpr0Qq9TC for <cfrg@ietfa.amsl.com>; Mon, 15 Dec 2014 20:12:12 -0800 (PST)
Received: from aspartame.shiftleft.org (199-116-74-168-v301.PUBLIC.monkeybrains.net [199.116.74.168]) (using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DE4881A0378 for <cfrg@irtf.org>; Mon, 15 Dec 2014 20:12:11 -0800 (PST)
Received: from [192.168.1.102] (unknown [192.168.1.1]) by aspartame.shiftleft.org (Postfix) with ESMTPSA id B64143AA12; Mon, 15 Dec 2014 20:11:00 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shiftleft.org; s=sldo; t=1418703060; bh=Z/hhz9fo5LWFT0tppsON2FR0euMO6d84UhnC9KOuMd0=; h=Date:From:To:Subject:References:In-Reply-To:From; b=jPhe/dZ1X1B0R/qjTCB1DRUWqziSM/Kqu1c7EQx5LydXnWdl+sZXKWj2Q6Gp9xI4y nkXeSuz8NygQ+4ji1vS0fpdAlZul2DCQ5xV/ow9dDXZumkYFhgqBRs0rXJyIHgt1hi IQ0ZZ2yL9Ug/f6iKaYqeD/sss+GbLlhE03mCfhnU=
Message-ID: <548FB118.3090402@shiftleft.org>
Date: Mon, 15 Dec 2014 20:12:08 -0800
From: Mike Hamburg <mike@shiftleft.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Watson Ladd <watsonbladd@gmail.com>, "cfrg@irtf.org" <cfrg@irtf.org>
References: <CACsn0cm=f3pJQXybLsMVT5cozrZkJv9N4GFT1m2uychzPvsAeg@mail.gmail.com>
In-Reply-To: <CACsn0cm=f3pJQXybLsMVT5cozrZkJv9N4GFT1m2uychzPvsAeg@mail.gmail.com>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/4m5VGX7-Az72RgMUFgiK4xA3ZMI
Subject: Re: [Cfrg] 2^389-21 vs 2^414-17 vs 2^448-2^224-1
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Dec 2014 04:12:13 -0000
Hi Watson, I mostly agree with your conclusions, but a few adjustments: Your numbers on Goldilocks don't reflect how I've implemented it. So far I've always used 1 level of Karatsuba, with 192 multiplies on 32-bit platforms, and 48 multiplies on 64-bit platforms. Because Karatsuba results in extra additions and extra complexity, this may be more expensive than a 49-mul delayed-carry schoolbook algorithm, but not by much because the reduction is included in the Karatsuba algorithm for Goldilocks. This isn't just a matter of limb count: the prime 2^448 - 2^224 - 1 was picked specifically because it has fast reduction integrated with Karatsuba multiplication. So maybe 10%, or maybe no difference. My NEON 1-level Karatsuba (192 muls, i.e. 96 pairs) takes about as long as DJB et al's 2-level Karatsuba (144 muls). My guess is that the gains from the second level are mostly eaten up by overhead, but it's just a guess since we used different strategies and different primes, and neither team compared 1-level vs 2-level directly AFAIK. It may merit an experiment eventually, but NEON is a pain to code for because the compiler support isn't very good. I would expect 2^389-21 to support some sort of fractional-radix technique with 16 words on 32-bit platforms. If you want a quick and dirty performance test, just implement multiplication and squaring mod 2^389-21 and compare it to Goldilocks' `make bench`, which includes time in nanoseconds for square and multiply. Squaring will be different because Karatsuba is less of a win for squaring algorithms. Cheers, -- Mike On 12/15/2014 06:56 PM, Watson Ladd wrote: > Dear all, > > I've come up with some back of the envelope sketches of the > performance differences for these primes. Note that there aren't hard > measurements yet: I'm only counting limbs for various choices of > radix, and these radix choices might not be themselves optimal. These > counts don't include the cost of reduction, which is at worst an > additional multiplication of each limb by a constant plus an addition > with carry. They don't include additions. In these calculations I've > permitted that addition to overflow, but not the preceding > multiplications. I've also not used signed arithmetic, which can make > things slightly better. > > On a 32 bit machine with 32x32->64 multiplier, but no 64 bit adder: > > 2^389-21 can use 15 26 bit words. The product can be computed without > carries, but the final reduction requires a carry chain. Schoolboy > costs 225 multiplications. Karatsuba will cost 140 mults. The best I > can do without using a hardware carry is a much worse 18 23 bit words, > which will cost 198 multiplies with two uses of Karatsuba. There may > be a better way, and I didn't investigate overflow/addition tradeoff. > > 2^414-17 as described in http://eprint.iacr.org/2014/526.pdf uses 16 > franctional radix words, and two stages of Karatsuba for 144 mults. > > 2^448-2^224-1 uses a special method for multiplication which I don't > have multiplication counts for. I believe 16 28 bit words is used in > the current implementation, and so 144 multiplications. > > On a 64 bit machine with 64x64->128 multiplier: > > 2^389-21 can use 7 56 bit limbs. Here there is plenty of headroom, and > so no carrying in the reduction is required. > > 2^414-17 could use 8 56 bit limbs, with no carrying in the reduction > required. It could use 7 60 bit limbs, if the hardware carry is used > in the reduction. > > 2^448-2^224-1 is clearly using 8 56 bit limbs: it matches exactly. > > At these sizes schoolboy is the right choice: 49 vs 64 multiplications. > > Please double-check this. I'll try to verify this in more detail, > including implementing a few strategies. I know that the results I've > listed for some 2^414-17 and 2^448-2^224-1 are not optimal, and would > appreciate correction. > > The tentative conclusion is that the performance advantages of > 2^389-21 vs 2^448-2^224-1 are likely to be minimal on 32-bit machines, > and about 10% on 64 bit machines. As always, the believable results > require implementations, which I don't think I will make in time for > our deadline. > > Sincerely, > Watson Ladd > > _______________________________________________ > Cfrg mailing list > Cfrg@irtf.org > http://www.irtf.org/mailman/listinfo/cfrg
- [Cfrg] 2^389-21 vs 2^414-17 vs 2^448-2^224-1 Watson Ladd
- Re: [Cfrg] 2^389-21 vs 2^414-17 vs 2^448-2^224-1 Mike Hamburg