Re: AEAD and header encryption overhead in QUIC

Kazuho Oku <kazuhooku@gmail.com> Thu, 18 June 2020 21:40 UTC

Return-Path: <kazuhooku@gmail.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3ECA33A0FD5 for <quic@ietfa.amsl.com>; Thu, 18 Jun 2020 14:40:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fB54K1SmN8Dx for <quic@ietfa.amsl.com>; Thu, 18 Jun 2020 14:40:17 -0700 (PDT)
Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7FF283A0FCC for <quic@ietf.org>; Thu, 18 Jun 2020 14:40:17 -0700 (PDT)
Received: by mail-ej1-x62c.google.com with SMTP id gl26so7976701ejb.11 for <quic@ietf.org>; Thu, 18 Jun 2020 14:40:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=tYpt/bLFhuMwzYQRbKsPu6AeSeiCN5C1j4JBhTXdWjM=; b=d6JcVEIw8reKjAEqPxHnCAYghfi4cSeNMeAEGresdB+pj5X88Bb9JFzLfR29rUpaaW ZWdJdcuqXIsFC6Gsq7hefCf99N/O83A+RACoj7eqT0qxWxiTVg+GwhFmVj/jGxk6HeF1 C/M9EZz4fPBweSdcK6+tQJ+WeWIMCxt/od/yT7M5Lgp75wGDia+Mtq1oF4Rsnxixhr7d zcgBwwgK+wv/4+E3EMy46Hk37BCj3Y7O86Hzv66ERvIMdLrOR0X9ArMTEToNx08RE4xY mkTa7FDTJlz4bpZFHRMmxG3+sdmz8HgDog+Mw/cDdxdiyQ5wiURh/lVE44/ecxEdlA/M VQHQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=tYpt/bLFhuMwzYQRbKsPu6AeSeiCN5C1j4JBhTXdWjM=; b=aDtHRw1EjK2zgoQ6OVvoZ1+aoVpqzpHAVYNHRxtc8iQm8yJ1yTlfZhexLCDQk6p1Ri UkOTERUXp9z6FJrLUPrlX5D63isRmFQqkUJGuWm/K2cyQCxz9nXRHodsoe5qjCOwMq0m NN2tBAfnvmuksYp0KpUq1p43OHF0skQ2AKtQiUp3p0zRi2Dw93VtHLXKlUbfryoD+K8P ZeWTO2B5RRgKF+yggaK9CbNnNAOtVN6jNNxMHOhVg4GqKG2GFzkDOE0sa/xwEendgmAX XJj+EvOiICxvwQ+44B8nTcajQKbrCIoQBTzdFb2327o/1KBU+gsFV3Hb0/BW2KGAEDvc jdqA==
X-Gm-Message-State: AOAM533/fny8qNEtUsG9xL1hcw1AIG72fWhcEzmFw+2CWrSDCKhb7QLB fK9ZFiHH3Q99dQI+kqK5W8sMLD3qTXdNpGf2bOI=
X-Google-Smtp-Source: ABdhPJwQLCb109T/1u44KKBkEIl9MPCNDJcwTbBdHnk+cv9SIVJlXeJSTQPBWarnCsCG/VYBjSM9mOri5eqF4N8yxe0=
X-Received: by 2002:a17:906:a01:: with SMTP id w1mr695125ejf.197.1592516403654; Thu, 18 Jun 2020 14:40:03 -0700 (PDT)
MIME-Version: 1.0
References: <CANatvzz8F1H=DXMkBEhmKHnYM-HVG48TS9KwY=OP881Txkcodw@mail.gmail.com> <CADdTf+i+LZ98GgNhFNVcuoVczC=jCQE-TqWbCqhrpR7=Z2knWg@mail.gmail.com> <CAN1APdft3UU1dfKY_UxRaLy2xeCYSXQT3=k53=96OO_Gu1X_cw@mail.gmail.com>
In-Reply-To: <CAN1APdft3UU1dfKY_UxRaLy2xeCYSXQT3=k53=96OO_Gu1X_cw@mail.gmail.com>
From: Kazuho Oku <kazuhooku@gmail.com>
Date: Fri, 19 Jun 2020 06:39:52 +0900
Message-ID: <CANatvzw2pj-CaXbAJ_-kUrvmoC3_oyvnSo7Yn+mX-kBuqVkr3w@mail.gmail.com>
Subject: Re: AEAD and header encryption overhead in QUIC
To: =?UTF-8?Q?Mikkel_Fahn=C3=B8e_J=C3=B8rgensen?= <mikkelfj@gmail.com>, Nick Banks <nibanks@microsoft.com>
Cc: Matt Joras <matt.joras@gmail.com>, IETF QUIC WG <quic@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000089f54805a8629ec3"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/0u-hWb32SlqzEgAvu7-2Tuatl_w>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Jun 2020 21:40:20 -0000

Nick, Matt, Mikkel, thank you for your comments.

2020年6月18日(木) 23:52 Nick Banks <nibanks@microsoft.com>om>:

> We’ve found that an easy way to minimize the CPU cost of the header
> protection (on send and receive) is batching. If you copy 8 packet headers
> into a single contiguous block of memory, you can do a single crypto
> operation on it. The cost of doing one batch of 8 is essentially the same
> as doing just a single header. This effectively cuts the cost of header
> protection to 1/8th the original cost. Obviously this only works if you
> have enough packets to batch process, but for high throughput tests, you
> usually do.
>
>
>
> Feel free to take a look at the MsQuic code (receive path
> <https://github.com/microsoft/msquic/blob/master/src/core/connection.c#L4784>
> , send path
> <https://github.com/microsoft/msquic/blob/master/src/core/packet_builder.c#L728>).
> The nice thing about this design (for those of us who don’t have as much
> special crypto expertise) is that it doesn’t require any special crypto
> changes. I’d be interested to see if our two approaches could be combined
> for even better performance!
>

Thank you for sharing your experience. This is indeed a sensible approach
for minimizing the cost of header protection. On the receive side,
processing multiple packets in batch is the only way of minimizing the cost
of header protection, because AEAD unprotection cannot start until the
result of header unprotection is being obtained. Considering that modern
x86-64 CPUs can run 4 to 8 AES block operations in parallel, doing header
protection for one packet at a time is a waste of CPU resources.

To paraphrase, the two approaches are orthogonal on the receive-side, and
combining them would give better performance. My napkin tells me that I can
expect at most 3% reduction in CPU cycles spent in crypto.

On the send side however, I do not expect to see performance increase by
combining the two approaches. This "game" is about keeping the CPU
pipelines that do AES-NI running at their full speed. The cost does not
change as long as the AES operations of header protection are run in
parallel with other AES operations, regardless of that being AEAD or header
protection o other packets.

If there's chance of further improving performance, I tend to think that
that would come from overlapping the AEAD operation of multiple packets.
This "game" is about keeping the AES-NI pipeline busy. With Fusion, we've
reached 90% of the theoretical maximum speed (9th-gen Intel Core CPUs can
do 64 bytes of AES encryption in 40 clocks). The question to us is if we
want to consider changing our API for single-digit performance gain.


2020年6月19日(金) 2:21 Matt Joras <matt.joras@gmail.com>om>:

> I was curious, since it's not mentioned on the PRs that I saw, were these
> tests done with any adjustments to the ACK frequency on the client? Or is
> this with the default ACK policy of ACKing every other?
>

This is an important point, thank you for asking. We did reduce ACK
frequency to 1/8 CWND, as we did in our previous report [1]. In fact, you
cannot turn that off with quicly.


2020年6月19日(金) 2:35 Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>om>:

> I am a bit surprised that it is possible to use bulk encrypt headers as
> Nick suggests as I would have thought the key / nonce was associated with
> the packet, but I haven’t looked closely recently.
>

That's a keen observation. Nonce used for header protection does depend on
the output of AEAD. Though, because the nonce is taken from the very first
few bytes of AEAD output, it is possible to start calculating the header
protection vector before finishing AEAD operation, unless the packet is
tiny. The position of the nonce was deliberately chosen to provide room for
this type of optimization. I recall at least one person (Martin Thomson)
arguing for having this possibility.

[1]
https://www.fastly.com/blog/measuring-quic-vs-tcp-computational-efficiency

-- 
Kazuho Oku