Re: [Cfrg] revised requirements for new curves

The chart looks consistent with my measurements
http://www.ietf.org/mail-archive/web/cfrg/current/msg05053.html for x86-64.

Relying on approximate numbers with visual extrapolation and my eye sight:

1. Fully optimized with AVX2:
    365/220=1.66

2. Really apples-to-apples OpenSSL v.s. NUMS, in particular:
* 64 bit C code with uint128
* same author
* code that takes advantage of special structure of the primes
    570/220=2.60

In addition, I timed the #2 scenario on the discontinued Xeon CPU E5540 
@ 2.53GHz, LGA 1366, Nehalem architecture, for similar 2.79 factor (v.s. 
my reference 2.70 for Ivy bridge). I was timing curve25519-donna-master, 
but NUMS/Curve25519 this should not matter when comparing with P-256.

The architecture-neutral C code should demonstrate a factor of about 2.7 
slower performance between P-256 and Curve25519. In the case of OpenSSL 
it must be properly configured. I don't see why ARM code would be 
different here.

The factor of ~5 corresponds to OpenSSL doing Montgomery reduction 
treating NIST prime as a random prime. The factor of 5, therefore, 
corresponds to the code doing Brainpool-style random prime Fp 
operations. Prime-specific optimization should be allowed in both cases 
for fairness.

I've mentioned a couple of times that Curve25519 (and NUMS) are only 50% 
faster than P-256 for AVX2-optimized code. There is a similar unfairness 
present in this claim, however, this is a bit less unfair because:
* the comparison uses the best code easily available today
* there is a chance that Curve25519 optimized for AVX2 will see lower 
payout because the parallelism of AVX2 may reduce the benefit of the 
special structure of the pseudo-Mersenne prime.

The ratio that should be used by default is ~2.7.

On 09/14/2014 05:46 PM, Michael Hamburg wrote:
> While this is partly true, it’s also an example of the NIST curves not getting an entirely fair shake because I just ran `openssl speed` with whatever was on my Mac, whereas most of the other curves are optimized.
>
> I’ve added teal dots from the Gueron-Krasnov NIST-P256 paper (citing also the Käsper-Langley paper) to show this effect: at least for NIST-P256, the Mavericks OpenSSL release is about 3.4x worse than the state of the art.
>
> — Mike
>
>> On Sep 14, 2014, at 5:22 PM, Phillip Hallam-Baker <phill@hallambaker.com> wrote:
>>
>> Well one thing that leaps out is that I can get 512 bits of the NUMS
>> curves for the price of 256 bits of the NIST curves. And that is a BIG
>> difference as far as I am concerned.
>>
>> The performance difference between the NUMS curves and the various
>> 'cherry picked' primes is visible but much less significant.
>>
>> On Sun, Sep 14, 2014 at 6:59 PM, Michael Hamburg <mike@shiftleft.org> wrote:
>>>
>>> On Sep 14, 2014, at 7:22 AM, Phillip Hallam-Baker <phill@hallambaker.com>
>>> wrote:
>>>
>>> But if we plot the points on graphs of defender effort vs attacker
>>> work factor and look at the curves we can probably see quite easily
>>> what we are buying with the different approaches.
>>>
>>>
>>> I hacked up some plots for ECDH (protected variable-base scalarmul) on Intel
>>> Sandy Bridge and Cortex A8, A9 platforms at
>>>
>>> http://ed448goldilocks.sourceforge.net/comparison/
>>>
>>> Lots of caveats follow!
>>>
>>> It’s important to note that the curves aren’t all on equal footing here.
>>> The measurements are from the NUMS paper, the Curve41417 paper, `openssl
>>> speed`, the Goldilocks bench utility (on SBR and Tegra 2); curve25519-donna
>>> (Curve25519 on Tegra2), SUPERCOP (Curve25519 on other platforms, Goldilocks
>>> on A8+NEON).  Of these, SUPERCOP and `openssl speed` almost certainly give
>>> less favorable numbers.
>>>
>>> I haven’t fully implemented Ed480-Ridinghood.  The point in the graph is an
>>> estimate based on the field arithmetic being identical to Ed448-Goldilocks
>>> on 64-bit except for the shift amounts.  This is also why it doesn’t appear
>>> in the 32-bit numbers: it cannot use the same arithmetic on 32-bit
>>> platforms, but instead would take ~50% longer, which is why I didn’t propose
>>> it.
>>>
>>> Curve25519, Curve41417, Goldilocks, and the MS “mont” curves are performing
>>> point (de)compression, but OpenSSL and the NUMS “ed” curves are not.  Point
>>> decompression would be about a 10% performance hit, but this hit may be
>>> mitigated (for stronger curves) or negated (for weaker ones) by the use of
>>> the Montgomery ladder.  Some of the software is also hashing the results and
>>> some are not.  (Goldilocks hashes; NUMS do not; I don’t know about the
>>> others.)  Goldilocks tests whether points are on the curve or twist while
>>> Curve25519 does not, but this test is cheap and adds little security.
>>>
>>> The software benchmarked here has different amounts of optimization and is
>>> designed for different platforms.  Curve41417 has only been benchmarked on
>>> Cortex A8+NEON, and the NUMS curves have only been benchmarked on SBR.
>>>
>>> The OpenSSL curves don’t include the latest optimizations, since they’re
>>> from 1.0.1 or 1.0.1f, compiled however the OS maintainers decided.
>>>
>>> Curve25519-donna is not at all optimized for ARM, but I don’t have any other
>>> vectorless ARM numbers to go by, since I couldn’t get Curve25519 working in
>>> SUPERCOP on that platform.
>>>
>>> I expect that the NUMS curves would perform passably well in scalar code on
>>> the Cortex-A9, because UMAAL can help with the carry handling.  But they
>>> would be at a disadvantage in NEON, where carry handling is much more
>>> expensive.  So I would expect the situation on Tegra 2 (with its fast scalar
>>> core but no NEON) to be comparable to that on Sandy Bridge, but on A8+NEON I
>>> would expect it to be considerably worse for the NUMS curves.
>>>
>>> In the other direction, I expect that the numbers for Curve41417 would be
>>> similar on Sandy Bridge to how they are on NEON.  Field arithmetic is
>>> probably comparable to Ed448-Goldilocks, just as on NEON, so curve ops
>>> should be about 8% faster due to the 8% smaller curve.
>>>
>>> I know that variable-base scalar multiplication is not the only benchmark
>>> which matters, but it’s consistently available across many curves.
>>>
>>> I believe that all of the software contains comparable levels of
>>> side-channel protection.
>>>
>>> Cheers,
>>> — Mike
>
> _______________________________________________
> Cfrg mailing list
> Cfrg@irtf.org
> http://www.irtf.org/mailman/listinfo/cfrg
>