Re: [Cfrg] Point format endian (was: Adoption of draft-ladd-spake2 as a RG document)

Michael Clark <michael@metaparadigm.com> Tue, 27 January 2015 06:24 UTC

Return-Path: <michael@metaparadigm.com>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6018A1B2BA9 for <cfrg@ietfa.amsl.com>; Mon, 26 Jan 2015 22:24:26 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.458
X-Spam-Level:
X-Spam-Status: No, score=-1.458 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, IP_NOT_FRIENDLY=0.334, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_DKIM_INVALID=0.01] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3Ylf0YnEJPQt for <cfrg@ietfa.amsl.com>; Mon, 26 Jan 2015 22:24:24 -0800 (PST)
Received: from tlsx.org (tlsx.org [67.207.128.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0667D1B2BA7 for <cfrg@irtf.org>; Mon, 26 Jan 2015 22:24:23 -0800 (PST)
Received: from monty.local (unknown.maxonline.com.sg [58.182.168.20] (may be forged)) (authenticated bits=0) by tlsx.org (8.14.4/8.14.4/Debian-4) with ESMTP id t0R6kBNJ002205 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Tue, 27 Jan 2015 06:46:15 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=metaparadigm.com; s=klaatu; t=1422341180; bh=i/D6V1IvO9tIfLVpkNT+uof4jaDv9iICKRdBYBW0fUE=; h=Date:From:To:CC:Subject:References:In-Reply-To:From; b=gD4TS/xElwt0C4iAkbWFSTyko05bIPPSZx1cCViy+HzMbmnPLsRvDCdful+ca5QYJ TdhJjNZmNc/X3NgM/bPRHez6Ow5us6AfNZrWmScG4MFluSRFRQKRo/0eyEG56MSSYs 4JW0aeZnszovqBnPrG1gm2QvtAlfrObkbSy5zI0o=
Message-ID: <54C72F01.4000203@metaparadigm.com>
Date: Tue, 27 Jan 2015 14:24:01 +0800
From: Michael Clark <michael@metaparadigm.com>
Organization: Metaparadigm
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Watson Ladd <watsonbladd@gmail.com>
References: <9A043F3CF02CD34C8E74AC1594475C73AAF65EBB@uxcn10-tdc05.UoA.auckland.ac.nz> <54C639D5.5050104@metaparadigm.com> <CACsn0ckdwGsm_YM1+uJ9a-_ZFAfCJOqeLTUE+aazYGpbYuB8fQ@mail.gmail.com>
In-Reply-To: <CACsn0ckdwGsm_YM1+uJ9a-_ZFAfCJOqeLTUE+aazYGpbYuB8fQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: clamav-milter 0.98.4 at klaatu.tlsx.org
X-Virus-Status: Clean
Archived-At: <http://mailarchive.ietf.org/arch/msg/cfrg/xghaYKJuZIWpM1Z6_DxEyi2GJME>
Cc: "cfrg@irtf.org" <cfrg@irtf.org>, Peter Gutmann <pgut001@cs.auckland.ac.nz>
Subject: Re: [Cfrg] Point format endian (was: Adoption of draft-ladd-spake2 as a RG document)
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jan 2015 06:24:26 -0000

On 27/1/15 2:00 pm, Watson Ladd wrote:
> On Mon, Jan 26, 2015 at 4:57 AM, Michael Clark <michael@metaparadigm.com> wrote:
>> On 26/1/15 6:20 pm, Peter Gutmann wrote:
>>> Mike Hamburg <mike@shiftleft.org> writes:
>>>
>>>> But will the endian complicate the implementation of the new curves?
>>>
>>> Given that every single crypto bignum representation that I'm aware of (X.509,
>>> S/MIME, PGP, SSH, TLS, OCSP, CMP, TSP, and many more) are all big-endian,
>>> there are probably lots of crypto libraries that simply don't have little-
>>> endian import/export routines.  Looking at BN_bn2bin()/BN_bin2bn() from one
>>> widely-used library that powers half the Internet (and mobile phone market),
>>> it only supports the big-endian format.  So that's automatic non-support for
>>> the world's most widely-used crypto library outside of MS CryptoAPI (which
>>> doesn't expose its bignum library).
>>>
>>> I can't believe we're even having this endless discussion.  The universal
>>> standard for external representations of bignums in IETF protocols is big-
>>> endian.  Even the PGP and S/MIME, and SSH and TLS, folks have managed to agree
>>> on that one.
>>
>> Yes. Legacy. "network byte order" is big-endian.
>>
>> It's tradition versus efficiency. big-endian is tradition. little-endian
>> is how most CPUs work these days. The reality is Sparc9, ARMv3, and
>> Power8 are now all bi-endian and are adopting little endian ABIs. Power8
>> being the most recent to change (ppc64le).
> 
> No it's not because of efficiency. Nor is C code necessarily endian
> dependent. See http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html
> for details, and explanations of why code that depends on endianness
> is usually wrong. Rather, it has to do with the format of existing
> implementations: everyone who grabs the NaCl implementation is using
> an little-endian encoding. Changing this would mean that there are no
> existing implementations. It would cause significant confusion between
> old and new versions.
> 
>>
>> Given the algo is new, one could /consider/ a BN_bn2binle and
>> BN_binle2bn (or an endian argument variant). It's a small change.
> 
> Why do people feel the need to write their own buggy implementations
> of Curve25519, when there are several public domain ones for the
> taking, especially when this means giving up the optimizations related
> to reduced radix format?

I hope this is not implied by my email.

>> I see the current code is working byte by byte due to endianness. It
>> takes 2nd order little endian (with host endian limbs due to the shifts)
>> and converts to big endian wire format. It reads from the distant big
>> end (little-endian) and writes out big endian. It bswaps.
>>
>>         n=i=BN_num_bytes(a);
>>         while (i--)
>>                 {
>>                 l=a->d[i/BN_BYTES];
>>                 *(to++)=(unsigned char)(l>>(8*(i%BN_BYTES)))&0xff;
>>                 }
>>         return(n);
>>
>> This code could do with some optimization, bswap(32|64) intrinsic (uses
>> BSWAP machine instruction on x86_64, ARM, etc), working with a machine
>> word size loop with a small tail loop to handle the modulus of the
>> difference in length of the bignum from the machine word size. It is
>> working byte by byte. It may not even register on a profile (we can use
>> gperftools to check). This code is *absolutely fine*. That said we could
>> make BN_bn2bin faster, use BSWAP intrinsic on x86_64, sparc, arm, ppc
>> and use 64-bit loads and stores (including test cases and corner cases).
>> Let me know if anyone wants to sponsor?
> 
> And you have benchmarks showing that this code is the bottleneck, and
> not the bignum multiply inner loop?

Not yet. However I doubt the bswap in and out has a large bearing (as I
hint). I was rather responding to a mention of the existing bignum code;
and went and had a look at the code to see for myself.

It makes sense to use the Curve25519 format and calculations if they are
already optimized. i.e. consider it an intrinsic with its own wire
format. Given there is no such thing as a standard wire format, given
all the different encodings within IETF and ones adopted from outside.

>> Also BN_num_bits_word could use __builtin_clz or __builtin_clzll
>> (_BitscanForward, _BitscanReverse on MSC). No cache surface for a table
>> lookup. Modern compilers have these intrinsics. On Pentium III there is
>> BSR, and in SSE4 there is LZCNT instruction, and there is even a cache
>> friendly alternative that uses Henry S. Warren's Hacker's Delight tricks:
>>
>>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041
>>
>> Let me know if you can find me a sponsor. I'm currently (as a passtime)
>> doing cycle counts on intrinsic and C fallbacks for these kinds of
>> routines.. (to fill in for compilers without them). Testing on
>> MSC/GCC/Clang on Darwin/Android/Windows/Linux.
>>
>> With little endian wire format and use of 32-bit or 64-bits limbs on all
>> modern systems (sparc9, power8, x86, arm) BN_bn2binle routine would
>> essentially be memcpy. That's the sad part.
>>
>> So with big-endian wire-formats, practically all machines are going to
>> bswap as the internal rep will be little endian. That cost is probably
>> low compared to the cost of the computation, so there is a reasonable
>> rationale to stick to big-endian.
>>
>>   BSWAP is one 1 cycle per 32-bit word on i386
>>   BSWAP is one 1 cycle per 64-bit word on x86_64
>>   BSWAP has been around since i486
>>
>> Someone starting today would indicate endianness statically or
>> dyanamically in their number format and use the hot potato principle.
>> Get it off your hands at the lowest cost to yourself, which nowadays is
>> little-endian. Remember GHASH is Bit Reflected (base 2 little endian)
>> and it made it into a spec.
> 
>  Re GHASH. It doesn't matter in hardware either way, as reflection is
> just relabeling, but in software you have to do some weird twists to
> make it work out. Even Intel's carryless multiplies do things the
> natural way. By contrast the Curve25519 format we are deciding is
> between s[0..31] is s[0]+256*s[1]+..256^(31)*s[31], or
> 256^(31)*s[0]+256^(30)*s[1]+..s[31]: all systems agree on the value of
> each byte, and the question is which bignum it represents. It's easy
> to turn this into whatever internal representation you need, unlike
> swapping bits in a byte.
> 
> Let me see if I get your second line correct. Instead of picking one
> format, and encoding to and from it with a trivial amount of portable
> C, we should use architecture specific micro-optimizations, and force
> the implementation and testing of two distinct output and input
> methods, for a nearly unmeasurable speed gain. Of course, there are
> many architectures where you cannot serialize in the way you suggest:
> unaligned loads and stores are not supported on many MIPS processors,
> and there is no guarantee that alignment of bignum fields will be
> maintained.

I'm not suggesting that. I'm merely illustrating the cost of swaps.
1st and 2nd order little endian: 32-bit or 64-bit limbs, containing
8-bit big-endian limbs ends up as a memcpy on little-endian.

You can then argue either way for little-endian or big-endian based on
this logic, however even small additional costs are redundant for
optimal performance, if one wants to optimize for platforms (as
typically ends up being done). serialization/deserialization is probably
low on the profile as I hint. it still is neither a good case for either
change or tradition. eliminated cost over a large scale of current CPUs
is eliminated cost. it all adds up.

>> Peter makes a pretty good case to stick with big-endian, from a
>> consistency point of view, but not from an efficiency point of view.
> 
> He argued that this would mean automatic OpenSSL nonsupport. Given
> that Rich Salz is on the development team of OpenSSL, I consider this
> unlikely. Furthermore, you can take curve25519-donna and use it today:
> it doesn't entail modifications (which are themselves trivial) to the
> bignum library.
> 
> I am amazed that we are repeating discussions that were had in TLS. Or
> maybe I shouldn't be. At what point can we declare that Curve25519 can
> be adopted by IETF protocols? It would be nice if we could report
> something by Austin, and maybe have a plan to finish up higher
> security levels and signatures by the summer.

Yes. It's not a good argument for non-support. The code is not
difficult. It's all of the established test cases, and if they are
already established within the domain of this curve in its format then
it is clearly harder to change it to another format.

GCM was the other case where the primitive has its own encoding.

The issue really is the inertia from a). how Curve25519 represents and
b). how current implementations have a tradition of representing.

given Curve25519 is new, a *prudent* risk mitigation strategy is to
avoid change, and accept its representation. it is not making any
changes to any existing representation so introduces no risk of
regression when added to * implementations. this is always my stance on
change management when looking at additive changes.