Re: AEAD and header encryption overhead in QUIC

Kazuho Oku <kazuhooku@gmail.com> Thu, 18 June 2020 21:40 UTC

MIME-Version: 1.0
References: <CANatvzz8F1H=DXMkBEhmKHnYM-HVG48TS9KwY=OP881Txkcodw@mail.gmail.com> <CADdTf+i+LZ98GgNhFNVcuoVczC=jCQE-TqWbCqhrpR7=Z2knWg@mail.gmail.com> <CAN1APdft3UU1dfKY_UxRaLy2xeCYSXQT3=k53=96OO_Gu1X_cw@mail.gmail.com>
In-Reply-To: <CAN1APdft3UU1dfKY_UxRaLy2xeCYSXQT3=k53=96OO_Gu1X_cw@mail.gmail.com>
From: Kazuho Oku <kazuhooku@gmail.com>
Date: Fri, 19 Jun 2020 06:39:52 +0900
Message-ID: <CANatvzw2pj-CaXbAJ_-kUrvmoC3_oyvnSo7Yn+mX-kBuqVkr3w@mail.gmail.com>
Subject: Re: AEAD and header encryption overhead in QUIC
To: Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>, Nick Banks <nibanks@microsoft.com>
Cc: Matt Joras <matt.joras@gmail.com>, IETF QUIC WG <quic@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000089f54805a8629ec3"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/0u-hWb32SlqzEgAvu7-2Tuatl_w>
Precedence: list

Nick, Matt, Mikkel, thank you for your comments.

2020年6月18日(木) 23:52 Nick Banks <nibanks@microsoft.com>:

> We’ve found that an easy way to minimize the CPU cost of the header
> protection (on send and receive) is batching. If you copy 8 packet headers
> into a single contiguous block of memory, you can do a single crypto
> operation on it. The cost of doing one batch of 8 is essentially the same
> as doing just a single header. This effectively cuts the cost of header
> protection to 1/8th the original cost. Obviously this only works if you
> have enough packets to batch process, but for high throughput tests, you
> usually do.
>
>
>
> Feel free to take a look at the MsQuic code (receive path
> <https://github.com/microsoft/msquic/blob/master/src/core/connection.c#L4784>
> , send path
> <https://github.com/microsoft/msquic/blob/master/src/core/packet_builder.c#L728>).
> The nice thing about this design (for those of us who don’t have as much
> special crypto expertise) is that it doesn’t require any special crypto
> changes. I’d be interested to see if our two approaches could be combined
> for even better performance!
>

Thank you for sharing your experience. This is indeed a sensible approach
for minimizing the cost of header protection. On the receive side,
processing multiple packets in batch is the only way of minimizing the cost
of header protection, because AEAD unprotection cannot start until the
result of header unprotection is being obtained. Considering that modern
x86-64 CPUs can run 4 to 8 AES block operations in parallel, doing header
protection for one packet at a time is a waste of CPU resources.

To paraphrase, the two approaches are orthogonal on the receive-side, and
combining them would give better performance. My napkin tells me that I can
expect at most 3% reduction in CPU cycles spent in crypto.

On the send side however, I do not expect to see performance increase by
combining the two approaches. This "game" is about keeping the CPU
pipelines that do AES-NI running at their full speed. The cost does not
change as long as the AES operations of header protection are run in
parallel with other AES operations, regardless of that being AEAD or header
protection o other packets.

If there's chance of further improving performance, I tend to think that
that would come from overlapping the AEAD operation of multiple packets.
This "game" is about keeping the AES-NI pipeline busy. With Fusion, we've
reached 90% of the theoretical maximum speed (9th-gen Intel Core CPUs can
do 64 bytes of AES encryption in 40 clocks). The question to us is if we
want to consider changing our API for single-digit performance gain.


2020年6月19日(金) 2:21 Matt Joras <matt.joras@gmail.com>:

> I was curious, since it's not mentioned on the PRs that I saw, were these
> tests done with any adjustments to the ACK frequency on the client? Or is
> this with the default ACK policy of ACKing every other?
>

This is an important point, thank you for asking. We did reduce ACK
frequency to 1/8 CWND, as we did in our previous report [1]. In fact, you
cannot turn that off with quicly.


2020年6月19日(金) 2:35 Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>:

> I am a bit surprised that it is possible to use bulk encrypt headers as
> Nick suggests as I would have thought the key / nonce was associated with
> the packet, but I haven’t looked closely recently.
>

That's a keen observation. Nonce used for header protection does depend on
the output of AEAD. Though, because the nonce is taken from the very first
few bytes of AEAD output, it is possible to start calculating the header
protection vector before finishing AEAD operation, unless the packet is
tiny. The position of the nonce was deliberately chosen to provide room for
this type of optimization. I recall at least one person (Martin Thomson)
arguing for having this possibility.

[1]
https://www.fastly.com/blog/measuring-quic-vs-tcp-computational-efficiency

-- 
Kazuho Oku

AEAD and header encryption overhead in QUIC Kazuho Oku
RE: AEAD and header encryption overhead in QUIC Nick Banks
Re: AEAD and header encryption overhead in QUIC Matt Joras
Re: AEAD and header encryption overhead in QUIC Mikkel Fahnøe Jørgensen
Re: AEAD and header encryption overhead in QUIC Kazuho Oku
Re: AEAD and header encryption overhead in QUIC Matt Joras
Re: AEAD and header encryption overhead in QUIC Kazuho Oku