Re: AEAD and header encryption overhead in QUIC

Kazuho Oku <kazuhooku@gmail.com> Fri, 19 June 2020 04:42 UTC

MIME-Version: 1.0
References: <CANatvzz8F1H=DXMkBEhmKHnYM-HVG48TS9KwY=OP881Txkcodw@mail.gmail.com> <CADdTf+i+LZ98GgNhFNVcuoVczC=jCQE-TqWbCqhrpR7=Z2knWg@mail.gmail.com> <CAN1APdft3UU1dfKY_UxRaLy2xeCYSXQT3=k53=96OO_Gu1X_cw@mail.gmail.com> <CANatvzw2pj-CaXbAJ_-kUrvmoC3_oyvnSo7Yn+mX-kBuqVkr3w@mail.gmail.com> <CADdTf+jtAd5rh+fW2RMLjTgXzNZ30xxSQMXKRtRb4z7bZcRsbg@mail.gmail.com>
In-Reply-To: <CADdTf+jtAd5rh+fW2RMLjTgXzNZ30xxSQMXKRtRb4z7bZcRsbg@mail.gmail.com>
From: Kazuho Oku <kazuhooku@gmail.com>
Date: Fri, 19 Jun 2020 13:42:05 +0900
Message-ID: <CANatvzxsPhZvyLtk80aPCnr+g7rdpDngvSLxp6eWRYppEHjTrQ@mail.gmail.com>
Subject: Re: AEAD and header encryption overhead in QUIC
To: Matt Joras <matt.joras@gmail.com>
Cc: Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>, Nick Banks <nibanks@microsoft.com>, IETF QUIC WG <quic@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000077698f05a868844a"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/VBgI7A6sxzU8UhVmQ9jE7YrpkPI>
Precedence: list

2020年6月19日(金) 8:24 Matt Joras <matt.joras@gmail.com>:

> Thanks for the explanations, Kazuho! Would it be possible to share the
> commands you used to produce the given results? It would be interesting so
> others can do direct comparisons with their own performance benchmarking.
> It is always hard to do comparisons across implementations with varying
> hardware and OS versions.
>

That's a reasonable thing to ask.

While I'm not fully certain if it is helpful to have direct
comparison between the stacks (everybody has different requirements /
deployments), I am at least certain that sharing the setup being used is a
good idea. It also helps me reproduce the results in the future.

I've written down the setup that we used for the benchmark shown in
https://github.com/h2o/quicly/pull/359 at
https://github.com/h2o/quicly/wiki/Benchmarking-CPU-usage. Regarding the
AES-GCM benchmark, I think you can find sufficient information in the last
paragraph of https://github.com/h2o/picotls/pull/310.


>
> Matt Joras
>
> On Thu, Jun 18, 2020 at 2:40 PM Kazuho Oku <kazuhooku@gmail.com> wrote:
>
>> Nick, Matt, Mikkel, thank you for your comments.
>>
>> 2020年6月18日(木) 23:52 Nick Banks <nibanks@microsoft.com>:
>>
>>> We’ve found that an easy way to minimize the CPU cost of the header
>>> protection (on send and receive) is batching. If you copy 8 packet headers
>>> into a single contiguous block of memory, you can do a single crypto
>>> operation on it. The cost of doing one batch of 8 is essentially the same
>>> as doing just a single header. This effectively cuts the cost of header
>>> protection to 1/8th the original cost. Obviously this only works if you
>>> have enough packets to batch process, but for high throughput tests, you
>>> usually do.
>>>
>>>
>>>
>>> Feel free to take a look at the MsQuic code (receive path
>>> <https://github.com/microsoft/msquic/blob/master/src/core/connection.c#L4784>
>>> , send path
>>> <https://github.com/microsoft/msquic/blob/master/src/core/packet_builder.c#L728>).
>>> The nice thing about this design (for those of us who don’t have as much
>>> special crypto expertise) is that it doesn’t require any special crypto
>>> changes. I’d be interested to see if our two approaches could be combined
>>> for even better performance!
>>>
>>
>> Thank you for sharing your experience. This is indeed a sensible approach
>> for minimizing the cost of header protection. On the receive side,
>> processing multiple packets in batch is the only way of minimizing the cost
>> of header protection, because AEAD unprotection cannot start until the
>> result of header unprotection is being obtained. Considering that modern
>> x86-64 CPUs can run 4 to 8 AES block operations in parallel, doing header
>> protection for one packet at a time is a waste of CPU resources.
>>
>> To paraphrase, the two approaches are orthogonal on the receive-side, and
>> combining them would give better performance. My napkin tells me that I can
>> expect at most 3% reduction in CPU cycles spent in crypto.
>>
>> On the send side however, I do not expect to see performance increase by
>> combining the two approaches. This "game" is about keeping the CPU
>> pipelines that do AES-NI running at their full speed. The cost does not
>> change as long as the AES operations of header protection are run in
>> parallel with other AES operations, regardless of that being AEAD or header
>> protection o other packets.
>>
>> If there's chance of further improving performance, I tend to think that
>> that would come from overlapping the AEAD operation of multiple packets.
>> This "game" is about keeping the AES-NI pipeline busy. With Fusion, we've
>> reached 90% of the theoretical maximum speed (9th-gen Intel Core CPUs can
>> do 64 bytes of AES encryption in 40 clocks). The question to us is if we
>> want to consider changing our API for single-digit performance gain.
>>
>>
>> 2020年6月19日(金) 2:21 Matt Joras <matt.joras@gmail.com>:
>>
>>> I was curious, since it's not mentioned on the PRs that I saw, were
>>> these tests done with any adjustments to the ACK frequency on the client?
>>> Or is this with the default ACK policy of ACKing every other?
>>>
>>
>> This is an important point, thank you for asking. We did reduce ACK
>> frequency to 1/8 CWND, as we did in our previous report [1]. In fact, you
>> cannot turn that off with quicly.
>>
>>
>> 2020年6月19日(金) 2:35 Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>:
>>
>>> I am a bit surprised that it is possible to use bulk encrypt headers as
>>> Nick suggests as I would have thought the key / nonce was associated with
>>> the packet, but I haven’t looked closely recently.
>>>
>>
>> That's a keen observation. Nonce used for header protection does depend
>> on the output of AEAD. Though, because the nonce is taken from the very
>> first few bytes of AEAD output, it is possible to start calculating the
>> header protection vector before finishing AEAD operation, unless the packet
>> is tiny. The position of the nonce was deliberately chosen to provide room
>> for this type of optimization. I recall at least one person (Martin
>> Thomson) arguing for having this possibility.
>>
>> [1]
>> https://www.fastly.com/blog/measuring-quic-vs-tcp-computational-efficiency
>>
>> --
>> Kazuho Oku
>>
>

-- 
Kazuho Oku

AEAD and header encryption overhead in QUIC Kazuho Oku
RE: AEAD and header encryption overhead in QUIC Nick Banks
Re: AEAD and header encryption overhead in QUIC Matt Joras
Re: AEAD and header encryption overhead in QUIC Mikkel Fahnøe Jørgensen
Re: AEAD and header encryption overhead in QUIC Kazuho Oku
Re: AEAD and header encryption overhead in QUIC Matt Joras
Re: AEAD and header encryption overhead in QUIC Kazuho Oku