Re: [Cfrg] 2^389-21 vs 2^414-17 vs 2^448-2^224-1

Mike Hamburg <mike@shiftleft.org> Tue, 16 December 2014 04:12 UTC

Return-Path: <mike@shiftleft.org>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A688D1A0379 for <cfrg@ietfa.amsl.com>; Mon, 15 Dec 2014 20:12:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.955
X-Spam-Level: **
X-Spam-Status: No, score=2.955 tagged_above=-999 required=5 tests=[BAYES_05=-0.5, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FH_HOST_EQ_D_D_D_D=0.765, FH_HOST_EQ_D_D_D_DB=0.888, HELO_MISMATCH_ORG=0.611, HOST_MISMATCH_NET=0.311, RDNS_DYNAMIC=0.982, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id wafxpr0Qq9TC for <cfrg@ietfa.amsl.com>; Mon, 15 Dec 2014 20:12:12 -0800 (PST)
Received: from aspartame.shiftleft.org (199-116-74-168-v301.PUBLIC.monkeybrains.net [199.116.74.168]) (using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DE4881A0378 for <cfrg@irtf.org>; Mon, 15 Dec 2014 20:12:11 -0800 (PST)
Received: from [192.168.1.102] (unknown [192.168.1.1]) by aspartame.shiftleft.org (Postfix) with ESMTPSA id B64143AA12; Mon, 15 Dec 2014 20:11:00 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shiftleft.org; s=sldo; t=1418703060; bh=Z/hhz9fo5LWFT0tppsON2FR0euMO6d84UhnC9KOuMd0=; h=Date:From:To:Subject:References:In-Reply-To:From; b=jPhe/dZ1X1B0R/qjTCB1DRUWqziSM/Kqu1c7EQx5LydXnWdl+sZXKWj2Q6Gp9xI4y nkXeSuz8NygQ+4ji1vS0fpdAlZul2DCQ5xV/ow9dDXZumkYFhgqBRs0rXJyIHgt1hi IQ0ZZ2yL9Ug/f6iKaYqeD/sss+GbLlhE03mCfhnU=
Message-ID: <548FB118.3090402@shiftleft.org>
Date: Mon, 15 Dec 2014 20:12:08 -0800
From: Mike Hamburg <mike@shiftleft.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Watson Ladd <watsonbladd@gmail.com>, "cfrg@irtf.org" <cfrg@irtf.org>
References: <CACsn0cm=f3pJQXybLsMVT5cozrZkJv9N4GFT1m2uychzPvsAeg@mail.gmail.com>
In-Reply-To: <CACsn0cm=f3pJQXybLsMVT5cozrZkJv9N4GFT1m2uychzPvsAeg@mail.gmail.com>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: http://mailarchive.ietf.org/arch/msg/cfrg/4m5VGX7-Az72RgMUFgiK4xA3ZMI
Subject: Re: [Cfrg] 2^389-21 vs 2^414-17 vs 2^448-2^224-1
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Dec 2014 04:12:13 -0000

Hi Watson,

I mostly agree with your conclusions, but a few adjustments:

Your numbers on Goldilocks don't reflect how I've implemented it. So far 
I've always used 1 level of Karatsuba, with 192 multiplies on 32-bit 
platforms, and 48 multiplies on 64-bit platforms.  Because Karatsuba 
results in extra additions and extra complexity, this may be more 
expensive than a 49-mul delayed-carry schoolbook algorithm, but not by 
much because the reduction is included in the Karatsuba algorithm for 
Goldilocks.  This isn't just a matter of limb count: the prime 2^448 - 
2^224 - 1 was picked specifically because it has fast reduction 
integrated with Karatsuba multiplication.  So maybe 10%, or maybe no 
difference.

My NEON 1-level Karatsuba (192 muls, i.e. 96 pairs) takes about as long 
as DJB et al's 2-level Karatsuba (144 muls).  My guess is that the gains 
from the second level are mostly eaten up by overhead, but it's just a 
guess since we used different strategies and different primes, and 
neither team compared 1-level vs 2-level directly AFAIK.  It may merit 
an experiment eventually, but NEON is a pain to code for because the 
compiler support isn't very good.

I would expect 2^389-21 to support some sort of fractional-radix 
technique with 16 words on 32-bit platforms.

If you want a quick and dirty performance test, just implement 
multiplication and squaring mod 2^389-21 and compare it to Goldilocks' 
`make bench`, which includes time in nanoseconds for square and 
multiply.  Squaring will be different because Karatsuba is less of a win 
for squaring algorithms.

Cheers,
-- Mike

On 12/15/2014 06:56 PM, Watson Ladd wrote:
> Dear all,
>
> I've come up with some back of the envelope sketches of the
> performance differences for these primes. Note that there aren't hard
> measurements yet: I'm only counting limbs for various choices of
> radix, and these radix choices might not be themselves optimal. These
> counts don't include the cost of reduction, which is at worst an
> additional multiplication of each limb by a constant plus an addition
> with carry. They don't include additions. In these calculations I've
> permitted that addition to overflow, but not the preceding
> multiplications. I've also not used signed arithmetic, which can make
> things slightly better.
>
> On a 32 bit machine with 32x32->64 multiplier, but no 64 bit adder:
>
> 2^389-21 can use 15 26 bit words. The product can be computed without
> carries, but the final reduction requires a carry chain. Schoolboy
> costs 225 multiplications. Karatsuba will cost 140 mults. The best I
> can do without using a hardware carry is a much worse 18 23 bit words,
> which will cost 198 multiplies with two uses of Karatsuba. There may
> be a better way, and I didn't investigate overflow/addition tradeoff.
>
> 2^414-17 as described in http://eprint.iacr.org/2014/526.pdf uses 16
> franctional radix words, and two stages of Karatsuba for 144 mults.
>
> 2^448-2^224-1 uses a special method for multiplication which I don't
> have multiplication counts for. I believe 16 28 bit words is used in
> the current implementation, and so 144 multiplications.
>
> On a 64 bit machine with 64x64->128 multiplier:
>
> 2^389-21 can use 7 56 bit limbs. Here there is plenty of headroom, and
> so no carrying in the reduction is required.
>
> 2^414-17 could use 8 56 bit limbs, with no carrying in the reduction
> required. It could use 7 60 bit limbs, if the hardware carry is used
> in the reduction.
>
> 2^448-2^224-1 is clearly using 8 56 bit limbs: it matches exactly.
>
> At these sizes schoolboy is the right choice: 49 vs 64 multiplications.
>
> Please double-check this. I'll try to verify this in more detail,
> including implementing a few strategies. I know that the results I've
> listed for some 2^414-17 and 2^448-2^224-1 are not optimal, and would
> appreciate correction.
>
> The tentative conclusion is that the performance advantages of
> 2^389-21 vs 2^448-2^224-1 are likely to be minimal on 32-bit machines,
> and about 10% on 64 bit machines. As always, the believable results
> require implementations, which I don't think I will make in time for
> our deadline.
>
> Sincerely,
> Watson Ladd
>
> _______________________________________________
> Cfrg mailing list
> Cfrg@irtf.org
> http://www.irtf.org/mailman/listinfo/cfrg