Re: [Cfrg] Hardware requirements for elliptic curves

First of  all, I would like to thank Joppe for bringing up the topic of hardware requirements. I fully agree with his
opinion. If we assume that the current discussion should not only be concerned with requirements
for elliptic curves for usage in the TLS protocol but also for general usage, new questions arise:
On what kind of plattforms will the elliptic curves be implemented?
Are they running in a, what one could call, secure environment? If not,
how about side channel attacks (SCA)?
Of course, one could say: People know SCA for more than a decade. They will be able to secure implementations of new curves as well. ---
This might be true, but one has to look at the details to see the problems:
Specially shaped curves (and implementations) may completely loose their (performance)
advantage if they have to be secured against SCA. Joppe already mentioned the problems  concerning randomizing the (secret) scalar for a scalar multiplication.
One could say: Ok, then you have one randomization method less, but people will find other.
Well, this could be true. But personally, I believe this will be not so easy.
If one asks a guy from evaluation labs, they will probably  tell you that handling a constant secret value again and again in an unrandomized manner, this is a very dangerous thing to do. --- At least, if SCAs are applicable.
This often holds, even if other randomization techniques are used, e.g., if you have
implemented constant timing formulae, or maybe even if you have a carefully designed
hardware (dual rail etc.), or everything together. (By the way, I would like to add that
constant timing may make other attacks easier...)
If one accepts the consequence that a scalar has to be randomized, then Joppe already
mentioned that special shape primes (as the characteristic) may immediately force you to
use a randomization that prolongs your scalar by say 50% - and the performance goes down
by the same factor. This holds in the case where the scalar k is randomized in the form
k' := k + rand * n. If the prime p (and therefore the order n) is rather random,
a random value rand of moderate length (eg. 32 bits) may be sufficient.
Of course, one can use the primitive randomization k = k1 + k2, with a random split,
then one needs even two scalar multiplications, and the computation time goes up by 100%.
I only can think of one other generic randomization technique, but there are patent
issues --- besides some additional minor technical issues.
If one hopes other randomization techniques are sufficient, let's think about what else one can actually do:
All computations break down to operations like (a * b mod p). Here, one could compute
instead something like
( (a + rand1*p) * (b + rand2*p) mod (rand3*p) ) mod p.
But in this case, you loose your special shape of your module. - If it has one.
Specially  taylored hardware or software will pobably not work with (rand*p) instead of p.
For a special purpose hardware realizing a modular multiplication with general p
this is not an additional hardship.
So maybe one has to look for other SCA countermeasures, maybe on a higher level.
I hope people will finde some. (And then, I am concerned, they probably will file the ideas
for a patent and the technique will be blocked for the next 20 years. So we have again a
situation which we already slowly overcome regarding the many old patents.)
This all holds for implementations in software as well as in hardware!
I don't want to say that this scenario holds for all security applications.
Maybe in products with moderate security requirements the situation is not as bad as
described... Maybe...
One word on hardware implementations (away from the SCA topic):
Usually one can choose one of two directions to design a co-processor for pkc applications.
For a very application-specific hardware with high volume, it may be an advantage to
build a very specific co-processor. One decides for one elliptic curve (and protocol) and
tries to make the hardware as small/fast as possible. One classic example are the GF(2^n)
modular multipliers which are similar to an LFSR. But they only work on one distinct curve.
And if anything "happens with the curve" that was used, one is not able to change the
protocol (parameters) if the product is out in the field.
Alternatively, one can design a general purpose co-processor. These are designed to realize
in hardware operations like a*b mod p, for generic a,b,p. These co-processors usually don't
have an advantage if the input values have a special shape. But on the other hand, they are
able to handle whole classes of elliptic ccurves. They even might be able to handle RSA
additionally. For hardware manufacturers it is very hard to support many different
co-processors, so their choice is often a general purpose co-processor (maybe with their
individual implementation of a modular multiplication)  
This is the reason why Joppe mentioned that a special taylored elliptic curve (or p)
will not increase performance.
In all, I therfore think it might be a good idea to define two (sets of) elliptic curves:
One that are easy to implement in software running on hardware without special security
needs (except for constant timing or so). And one with random parameters for the rest.
At the moment I see the danger that only curves suitable for TLS are choosen and then the
rest of the (security) world will have to live with it.

Wieland

Am 05.09.2014 um 20:34 schrieb "Johannes Merkle" <johannes.merkle@secunet.com>:

Michael Hamburg wrote on 04.09.2014 17:21:
> 
>> On Sep 4, 2014, at 5:55 AM, Johannes Merkle <johannes.merkle@secunet.com> wrote:
>> 
>> Michael Hamburg wrote on 02.09.2014 18:31:
>>> I agree with Alyssa that hardware performance isn’t our concern here.
>> 
>> I disagree with this oversimplification. Currently, the fraction of TLS implementations based on HW is relatively
>> small but not negligible. And in the future it will increase:
>> 
>> 1. Heartbleed has shown that it is dangerous to keep private keys in software. Hopefully, this lesson has been learned.
>> 
>> 2. There are security critical infrastructures emerging, where TLS will be used with hardware crypto
>> implementations. Examples are car2car and car2X, health care infrastructures, smart meter,
>> e-government communications services.
>> 
>> 3. In the foreseeable future, we will see a huge number of constrained devices in the IoT potentially deploying TLS
>> (e.g. for home automation, sensor networks).
>> 
>> Furthermore, other IETF protocols are well within the scope of our effort. (As Kenny wrote in his announcement of the
>> current effort "We regard this as a major work item for CFRG and one where CFRG can provide real value to the TLS WG
>> and the IETF more widely.") For IPSec, there is indeed a significant number of implementations based on smart cards or
>> small HW crypto modules (for instance from my employer).
>> 
>> 
>> -- 
>> Johannes
> 
> You know what?  I spoke too soon on this, and you changed my mind.  Hardware does matter, even though it isn’t the main focus of this research group.
> 
> But I still don’t understand why hardware favors short Weierstrass curves or random primes.  Short Weierstrass really doesn’t seem simpler than Edwards to me.  It’s slower, less regular, and doesn’t parallelize as well (if you have high perf hardware).  It wants a halve() operation for best performance in addition to add/sub/mul, and you can’t just use one unified formula for everything if you’re complexity constrained.  The cofactor is potentially an issue, but you can just wipe out the cofactor first and then check against the identity.

I am not at all a hardware expert. But we just had a meeting of the ECC Brainpool, where several hardware vendors are
represented. What I understood is that they do not want to specialize their hardware to minimize their investment while
being fully flexible (algorithm agile). Remember that multipliers are also needed for RSA where the numbers do not have
a special shape. Furthermore, I guess that new implementations can be built easier based on existing ones.

> 
> And random primes aren’t better either: they’re still much slower than special primes in hardware, and it’s more complex to invert mod them (more complex Fermat’s Little Theorem, or have to implement EGCD), and you can’t switch between Barrett and Montgomery reduction as easily.  From Joppe Bos' email, it sounds like primes make one blinding technique more effective.  But if you suddenly have double the performance budget (from a faster prime) can you not do even better?
The point is that flexible (non-specialized) hardware multipliers do not take advantage of specialized primes.
Specialized multipliers are certainly possible but are less agile w.r.t algorithms / curves. Flexibility is particularly
important due to the long development / certification cycles and the tremendous costs involved.

-- 
Johannes

_______________________________________________
Cfrg mailing list
Cfrg@irtf.org
http://www.irtf.org/mailman/listinfo/cfrg