Re: [CFRG] Comment on draft-irtf-cfrg-hash-to-curve-10

Hi Daira,

Thanks for the feedback, and apologies for the delayed reply! Given the maturity of this document and its state in the RG, I've two very high-level responses:

1. Changing the sign bit breaks backwards compatibility. Absent more support for this change, especially from those familiar with the challenges in the current design, perhaps now is not the time. I note that future ciphersuites could change the sign computation, should that be desired for certain applications.
2. I think the optimized, curve-specific algorithms in the appendix are constant time as described, so it doesn't *seem* necessary to make any substantial editorial changes to lift those to the main text. The mappings in Section 6 are not intended to be translated directly into code. Instead, each mapping points to a corresponding straight-line implementation in Appendix F. 

I would love to hear from others who may have opinions on either of these points, or on the particular suggestions below.

Best,
Chris

On Mon, Jan 4, 2021, at 5:40 AM, Daira Hopwood wrote:
> Hello,
> 
> This is a response to the RG Last Call for draft-irtf-cfrg-hash-to-curve-10.
> 
> The current draft is based in large part on 
> [WB2019](https://eprint.iacr.org/2019/403).
> However, it does not cover most of the optimizations described in that
> paper. This is understandable given that the paper is focussed on
> hashing to BLS12-381 G1 and G2, and does not explain in much detail how
> to apply these optimizations to other curves.
> 
> In fact, *all* of the optimizations applied for BLS12-381 G1 in [WB2019]
> can be adapted to general short Weierstrass curves over an arbitrary
> prime field:
> 
> * merging an inversion and square root in map_to_curve;
> * eliminating all remaining inversions by use of Jacobian coordinates
> 
>    * in map_to_curve;
>    * in the addition, for the random oracle variant; and
>    * in the isogeny map, for curves with AB = 0.
> 
> These are all compatible with a constant-time implementation; actually,
> they result in an algorithm that is naturally constant time. The
> unoptimized algorithm, on the other hand, entails a trade-off between
> constant-timeness and speed. That could tempt implementors to rely on
> arguments about inapplicability of timing attacks in their context,
> that may be incorrect, or may fail when code is reused in a different
> context.
> 
> (Yes I have read section 10 of the ID. From experience, implementors
> are very likely to ignore even strong recommendations if they have a
> performance issue to solve and they believe, correctly or incorrectly,
> that they can get away with it for their application.)
> 
> Probably all of the above is also true for curves over extension fields,
> but I haven't verified that.
> 
> Recent developments in proof systems have made it more important to be
> able to efficiently hash to curves over F_p where p = 1 + m * 2^n for
> reasonably large n, say n ~= 32 (informally, "highly 2-adic" fields).
> This is effectively a requirement for fast polynomial arithmetic using
> FFTs. Some proof systems, such as
> [PLONK](https://eprint.iacr.org/2019/953) and its variants, depend
> fundamentally on the resulting multiplicative subgroup structure.
> 
> My main concern is a lack of advice in the draft on how to apply the
> [WB2019] optimizations to such fields, which is not at all obvious.
> The paper only discusses the cases needed for BLS12-381 G1 (where
> p = 3 (mod 4)), and G2 (a quadratic extension field where
> p^2 = 9 (mod 16)).
> The [reference 
> code](https://github.com/cfrg/draft-irtf-cfrg-hash-to-curve/tree/master/poc)
> only has some of the optimizations for simplified SWU on fields where
> p = 3 (mod 4), p = 5 (mod 8), and p^2 = 9 (mod 16).
> 
> It took me a couple of days to work out the detail of other cases.
> Therefore I think it is likely that the draft as it stands would lead
> to many sub-optimal implementations. Also, in my opinion optimization
> attempts will carry a great risk of introducing errors or
> incompatibilities in corner cases (including some that are
> exceptionally unlikely to be picked up in tests not focussed
> specifically on those cases), with the specification as it stands.
> 
> The description in the current draft also obscures the extent of the
> performance advantage for the simplified SWU mapping, potentially
> leading practitioners to make sub-optimal choices in protocol design.
> By my estimates, an optimized constant-time simplified SWU
> implementation can be at least 3 times faster than a constant-time
> SW implementation. Some of the same optimization ideas could in
> principle be applied to SW, but some cannot.
> 
> ----
> 
> All of the following assumes that F is a prime field. Again, it is
> likely to be easily adaptable to extension fields but I haven't
> checked it for that case.
> 
> In order to obtain the speed-ups discussed above for arbitrary prime
> fields, it's first necessary to change the interface to the square
> root algorithm. We need an algorithm, that I'll call divsqrt, with
> the following specification.
> 
> Let h be a fixed nonsquare in F.
> 
>    divsqrt(N: F, D: F \ {0}) -> (F, Bool):
>        if N/D is square in F, return (sqrt(N/D), true)
>        else return (sqrt(h * N/D), false).
> 
> where sqrt returns an arbitrary choice of square root.
> 
> It's possible to implement this specification in constant time
> without computing two square roots or any explicit inversion. For
> example this can be done using an adaptation of the algorithm in
> [Sarkar2020](https://eprint.iacr.org/2020/1407), combined with
> [BDL+12](https://cr.yp.to/papers.html#ed25519) section 5, and some
> hints in [WB2019] section 4.2.
> 
> A prototype specialized for n = 32 and the
> [Pallas/Vesta](https://electriccoin.co/blog/the-pasta-curves-for-halo-2-and-beyond/)
> curves is at https://github.com/zcash/pasta/squareroottab.sage .
> I can provide an implementation that is not dependent on n if needed.
> Note that this is also more efficient than the commonly used
> Tonelli–Shanks, especially its constant-time variant.
> 
> (That code does data-dependent look-ups, but could be modified to also
> be resistant to cache-based side channel attacks by using smaller
> tables and accessing all elements. [Sarkar2020] also has a variant
> that doesn't use tables, and is still significantly faster than the
> constant-time Tonelli–Shanks in Appendix I of the ID.)
> 
> The overhead of handling the division in the square root, relative
> to a plain sqrt using a similar algorithm, is just over n squarings
> and log2(n) multiplications for F_p where p = 1 + m * 2^n. This is a
> substantial saving over a constant-time inversion, which would take
> around |p| squarings.
> 
> Now given divsqrt, we can apply another optimization in section 4.2
> of [WB2019] that avoids needing to test whether gx1 (using the notation
> of the Internet Draft) is square. In brief: we first compute the
> numerators and denominators for x1 = N/D and gx1 = U/V as in the
> paper. If gx1 is not square, then we have sqrt(h * gx1); using a
> precomputed square root of Z/h, we can calculate the numerators and
> denominators of x2 and y2, without ever needing to calculate gx2.
> The detail is explained in comments at
> https://github.com/zcash/pasta/hashtocurve.sage .
> 
> Note that the same procedure works for *all* short Weierstrass curves
> to which simplified SWU is applicable. Therefore IMHO (and this is
> the main point of this email), it may as well be included in the
> description of the basic algorithm rather than presented as an
> optimization — especially given that it is naturally constant time,
> and the currently presented algorithm isn't!
> 
> As far as I can see, there are only three potential arguments for not
> making this change:
> 
> * That divsqrt is more complicated than sqrt. This doesn't really
>    hold up, because it is just as easy to implement divsqrt naively
>    as it would be to implement the current map_to_curve naively.
> 
> * That it's less obvious how this is derived from the original
>    algorithm of [BCI+10](https://eprint.iacr.org/2009/340). But the
>    current algorithm already diverges from that paper a little. What
>    matters IMHO is that the algorithm is correct and its derivation
>    is explained so that anyone with a mathematical background can
>    check it.
> 
> * That it would make the map_to_curve dependent on Jacobian coordinates,
>    which might not be what an implementor wants to use in the rest of
>    their code. But it is straightforward to also say how to obtain the
>    output in homogeneous or affine coordinates.
> 
> A naive constant-time implementation of a full hash_to_curve using the
> simplified SWU algorithm from the draft takes 1001S + 168M + 4I for
> the Pallas curve (a 255-bit non-pairing friendly curve with A = 0 and
> a degree-3 isogeny from the curve on which hashing is performed).
> With optimizations, it takes 577S + 150M and no inversions, if output
> is in Jacobian coordinates.
> 
> Aside: the current draft prohibits Z = -1. It is unclear why, since
> that is exactly what [WB2019] uses for BLS12-381 G1.
> 
> ----
> 
> Another possible change I would suggest is in the method of choosing
> the sign of the y-coordinate. (Unlike the optimizations discussed
> above, this would actually change the map_to_curve function, and so
> would probably need to be optional.) The current method involves
> ensuring that the low bit of u (as named in the ID) is consistent
> with the low bit of y.
> 
> This is fine if we're on a conventional processor where taking the
> low bit of a field element is cheap. However, it is unnecessarily
> inefficient in the setting of an arithmetic circuit. To take the low
> bit of a field element in that setting requires a full decomposition
> of the value, with cost O(|p|). In an R1CS-based proof system that
> requires one constraint per bit of p. In PLONK it requires one range
> check (e.g. implemented using
> [plookup](https://eprint.iacr.org/2020/315)) for each chunk of k bits
> where k is limited by the circuit size.
> 
> Since it is very cheap to check a square root in an arithmetic circuit
> using nondeterminism, these bit decompositions become the major cost
> bottleneck of hash-to-curve in a circuit.
> 
> The following modification halves this cost for for arithmetic circuits
> while not imposing significant overhead in other settings: simply
> require that sgn0(u - y) = 0. This is not the same as requiring that
> u and y have the same low bit, but it has a similar effect on the
> distribution of y.
> 
> Either for this modification or for the existing algorithm, it would
> be desirable to have a proof that it suffices to use the low bit of u
> (rather than using an extra random bit independent of u), since this
> is not entirely obvious.
> 
> ----
> 
> I work for Electric Coin Company, but I'm not speaking for them here.
> 
> None of the optimizations or changes discussed here are to my knowledge
> affected by, or likely to be affected by, patents or other IP issues.
> 
> -- 
> Daira Hopwood
> 
> _______________________________________________
> CFRG mailing list
> CFRG@irtf.org
> https://www.irtf.org/mailman/listinfo/cfrg
>