Re: [Cfrg] RG Last Call on draft-irtf-cfrg-gcmsiv-06

Andy Polyakov <> Mon, 18 September 2017 17:51 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 3E4931243F6 for <>; Mon, 18 Sep 2017 10:51:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -6.9
X-Spam-Status: No, score=-6.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id MmIyDxwK7ZMr for <>; Mon, 18 Sep 2017 10:51:41 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id BEC05124B18 for <>; Mon, 18 Sep 2017 10:51:40 -0700 (PDT)
Received: from [] (localhost [IPv6:::1]) by (Postfix) with ESMTP id 08E54E035C; Mon, 18 Sep 2017 17:51:37 +0000 (UTC)
References: <>
From: Andy Polyakov <>
Message-ID: <>
Date: Mon, 18 Sep 2017 19:51:36 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0
MIME-Version: 1.0
In-Reply-To: <>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Archived-At: <>
Subject: Re: [Cfrg] RG Last Call on draft-irtf-cfrg-gcmsiv-06
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Crypto Forum Research Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Mon, 18 Sep 2017 17:51:44 -0000


I apologize for chiming in so late, but I wasn't aware of this endeavour
till very recently. The question I find myself struggling with is that I
can't find introduction of new primitives justified, at least not in
this context. I did my best sieving through to locate relevant
prior comments, so even with those in mind. One of arguments was that
POLYVAL gets bit order right in comparison to GHASH. Well, comparing the
two is actually boils down to one question: what is it you choose to
bit-flip, input or polynomial? [Do note "bit-flip", not "byte-flip".]
What I'm trying to say is that we have to recognize that POLYVAL is
simply formulated in terms of GHASH polynomial in reverse bit order. On
the other hand on more practical level, i.e. from software
implementation viewpoint, answer to the question is self-obvious, it's
more efficient to bit-flip single polynomial than each fragment of the
input. This last thing effectively means that sensible GHASH
implementation would actually use reverse-bit polynomial (and at least
OpenSSL implementations all do). This in turn means that the only
essential difference between GHASH and POLYVAL is byte order. Now, it
was argued that little-endian byte order gives performance edge [on
platform of popular choice]. But relevant question in the context is if
the edge actually justifies introduction of new algorithm. It was
asserted that it provides 20% improvement and it was attributed to the
fact that PCLMULQDQ-based GHASH implementations use vector byte swap
instructions, *one* per block. But we have to recognize that said
improvement coefficient is actually specific to Skylake. For example on
Haswell we'll observe ... 0% improvement, on Broadwell - 6%, on Ryzen -
3%. Let's even have a look at absolute cycles per processed byte values
for GHASH:

Haswell   0.41
Broadwell 0.29
Skylake   0.36

Do note that effectively we observe regression on Skylake, also note
that it's ~20%. I mean if Skylake inherited Broadwell qualities, GHASH
would have performed ~20% better. With all the numbers in mind it's only
natural to conclude that vector byte swap "costs" unproportionally much
specifically on Skylake [and in GHASH], and it's rather anomaly than
rule. And it's not unlikely to be addressed in next processor
generation. At least other alike cases were, just look at improvement
from Haswell to Broadwell above. And if not, maybe it would still be
more appropriate to demand that Intel fixes this in next-next
generation, instead of introducing new algorithm. Because latter
approach is not really sustainable. I mean how many algorithms would we
have to implement and support if they were introduced on per
specific-processor-pipeline-anomaly basis? Besides, even 20% improvement
has to be considered in overall context, not by itself. I mean if the
only difference between two otherwise equivalent proposals was POLYVAL
vs. GHASH, then improvement for *whole* operation would be ~6% [on
Skylake, and none to negligible on others]. Or in other words this
proposal *could* just as well be formulated with GHASH instead of
POLYVAL, there is no *compelling* reason to introduce another algorithm.

And same can be said about suggested CTR tweak. I.e. suggested mode
*could* just as well be formulated with standard CTR. Even more so, as
there is no apparent difference in performance [on platform of popular

Well, it's not like I actually reject the sentiment to favour
little-endian platform[s], question rather is if *mode* specification is
right place for it. Protocols are about *communication* and there are
more factors in play, most notably using already available primitives
would facilitate adoption. And if little-endian-centric primitives are
deemed desirable, wouldn't it be more appropriate to define them
separately to promote modularity and re-usability?