Re: Impact of hardware offloads on network stack performance

Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com> Fri, 06 April 2018 21:20 UTC

Return-Path: <mikkelfj@gmail.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 649C9120721 for <quic@ietfa.amsl.com>; Fri, 6 Apr 2018 14:20:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.698
X-Spam-Level:
X-Spam-Status: No, score=-2.698 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BiZh7A034Uab for <quic@ietfa.amsl.com>; Fri, 6 Apr 2018 14:20:13 -0700 (PDT)
Received: from mail-io0-x22a.google.com (mail-io0-x22a.google.com [IPv6:2607:f8b0:4001:c06::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 585FE1201F8 for <quic@ietf.org>; Fri, 6 Apr 2018 14:20:13 -0700 (PDT)
Received: by mail-io0-x22a.google.com with SMTP id e79so3248571ioi.7 for <quic@ietf.org>; Fri, 06 Apr 2018 14:20:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:in-reply-to:references:mime-version:date:message-id:subject:to :cc; bh=WvrASTbKRR+ViHK2yn1yBx8iHpT75IKPKQ2LcrzCo/Y=; b=GUmypH5yJjrOrOgvRLjNcx2gIgosgfWEYK0sRVeFVSLhliPclJ9IIXi/87iojAJFLB R0yVylKBRnFJkLFAb262X6+Lxq+M+1mzSA7UqB8lj2ydZbQ5XHBMDAEc8sj2jKx5VEj9 VncTDjOPYtDJvc6KjiAlPECVJyOx1Gk/Zy9xsAI4piBf3O9WDc/DEB/DjBfQ7aeLXpst 6ACWPGQKEHgG4tZDHxxoaRffWCcgIVY67LPlSwF2N4bmR5TFdban+b3s5XOVCLy8VHOS /JKzTP5nSXd8//jNnqZ0fD6w8QWRRW7jUebWvZVKelPr/LOwjI+tGwFEtJGyZMT1rKgp 4teg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:in-reply-to:references:mime-version:date :message-id:subject:to:cc; bh=WvrASTbKRR+ViHK2yn1yBx8iHpT75IKPKQ2LcrzCo/Y=; b=lUzx+ah46LtLcJ5QxzoU94H5f4QGt33RDQzy+230wixCNgTd2x123AGry4oNee4nbN juSbs/vcKKzZQNXCxuwmEpD3f93AVhpaNrwTPj12BJHAXVAks/eazi4mgfz8GVEzeIbD 7lo4o1q55d4Y8PLsGtGpQ01grBGvs5NOEYzaMX61WmxFdfuJXt9j2l0Hwx+B5q3l4mci m3PLaKse9wLt6CRGJndPeGQApzQPdbLimjAEXHB6Dkdi3rQ6mK2QOAU3z+J0KFRaeycl pM+wxPybhDvdUIZ7JBPyJomXKohPrOfHeOhUojrfgp1uiZNkFQtonOuVQoq9KWiT/zbq HE3Q==
X-Gm-Message-State: ALQs6tD68QgncVIgWkhMtiuHWgzlTUEwrPeEFhpI/nGHGxndTn0tjRx1 hhpRi/YrJGU7twWJ5WqX6GH/D9kjtNoLLm38Qqs=
X-Google-Smtp-Source: AIpwx48UgxSd3WaXbppU1tnuagH0pdGCMATQifJgD41QA0X500ZngjnmWFyd2p4bC2bTisbgghihxGyYRPZdy2Yv2KA=
X-Received: by 10.107.103.14 with SMTP id b14mr18285375ioc.70.1523049612538; Fri, 06 Apr 2018 14:20:12 -0700 (PDT)
Received: from 1058052472880 named unknown by gmailapi.google.com with HTTPREST; Fri, 6 Apr 2018 17:20:11 -0400
From: Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>
In-Reply-To: <CAKcm_gMYf6AbHgvg+WKv-woQCtzz9WztiEDf-iRUjKw63Qx=9Q@mail.gmail.com>
References: <CY4PR21MB0630CE4DE6BA4EDF54A1FEB1B6A40@CY4PR21MB0630.namprd21.prod.outlook.com> <CAN1APde93+S8CP-KcZmqPKCCvsGRiq6ECPUoh_Qk0j9hqs8h0Q@mail.gmail.com> <SN1PR08MB1854E64D7C370BF0A7456977DABB0@SN1PR08MB1854.namprd08.prod.outlook.com> <CAKcm_gNhsHxXM77FjRj-Wh4JxA21NAWKXX3KBT=eZJsCdacM7Q@mail.gmail.com> <DB6PR10MB1766789B25E31EBE70564BD3ACBA0@DB6PR10MB1766.EURPRD10.PROD.OUTLOOK.COM> <SN1PR08MB18548155722C665FFD45CC4DDABA0@SN1PR08MB1854.namprd08.prod.outlook.com> <CAKcm_gMYf6AbHgvg+WKv-woQCtzz9WztiEDf-iRUjKw63Qx=9Q@mail.gmail.com>
X-Mailer: Airmail (420)
MIME-Version: 1.0
Date: Fri, 06 Apr 2018 17:20:11 -0400
Message-ID: <CAN1APdeQSHXB7dmwdJd_ndrCCL9y7YgNCnyaux1krcn9Mceb7A@mail.gmail.com>
Subject: Re: Impact of hardware offloads on network stack performance
To: Mike Bishop <mbishop@evequefou.be>, Ian Swett <ianswett@google.com>
Cc: IETF QUIC WG <quic@ietf.org>, Praveen Balasubramanian <pravb@microsoft.com>
Content-Type: multipart/alternative; boundary="089e08e54afb2177610569349f7e"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/aa34lb_puM5avklBcMtYlm0QiM0>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Apr 2018 21:20:17 -0000

Yes, it is touching the memory that is the big problem.

24ns is measurable time in my world where a 500 byte structured message is
easily handled, but in reality it is the same latency as a 7 meter longer
ethernet cable due to c (the speed of light).

CPU cores work by cache coherency and prefetching. As long as data is only
written once, data can funnel into the different CPU cache systems but if
data is later touched the message bus between CPU’s must invalidate a lot
of assumptions and start over.

In the current form of PR 1079, the packet number is encrypted by writing
over data in the buffer, so in order to decrypt it, or just to verify it,
it is necessary to either copy the buffer - which means that all other
interested CPU’s must fetch new data, or the data is modified meaning cache
invalidation.

Fortunately this part of the problem is not so hard to solve: remove the
packet number from the AEAD protected data - there is no benefit in having
it there. Store and encrypt it separately. This is a problem because an AES
block is 16 bytes and smaller encryption schemes are hard to make secure.
By overwriting packet data PR 1079 places some of block size over existing
non-packet number data which saves space but hurts cache.

My unverified assumption is that this can be improved: the packet number
can be encrypted indirectly by first encrypting arbitrary content (say the
tag) of the existing already encrypted packet. The encrypted data is NOT
stored  but xor’ed with the packet number. The packet number is now
appended after the tag, or anywhere else, just not inside AEAD data.

So how to decrypt? When the packet is received you encrypt the same data as
before and get 16 bytes out. You locate the encrypted packet number and xor
to get the clear text of the packet number.

Now you can start to decrypt and/or verify the AEAD packet content because
the decrypted packet number is the IV used in that crypto process. Notice
that in this process we only operate with a 16-byte temporary block that we
do NOT store in he packet buffer. We get a decoded packet number that we do
NOT store in the packet buffer. We can store it anywhere else, as long as
it is in a memory area with its own cache line boundary.

This is still not ideal because the latency is not only the 24ns decryption
but also writing the decrypted packet number somewhere and communicate it
to a process that operate on packet data. For example, if we have two cores
- one for verification, and one for decryption, then one must necessarily
send data to the other, or they most both do the same packet number
decryption (which probably is the better idea, but now it costs 48ns total).



On 6 April 2018 at 22.55.33, Ian Swett (ianswett@google.com) wrote:

PNE doesn't preclude UDP LRO or pacing offload.

However, in a highly optimized stack(ie: The Disk|Crypt|Net paper from
SIGCOMM 2017) not touching(ie: bringing into memory or cache) the content
to be sent until the last possible moment is critical.  PNE likely means
touching that content much earlier.

But personally I consider the actual crypto we're talking about(2 AES
instructions?) adding basically free.

On Fri, Apr 6, 2018 at 4:37 PM Mike Bishop <mbishop@evequefou.be> wrote:

> Thanks for the clarification.  I guess what I’m really getting down to is
> how do we quantify the actual *cost* of PNE?  Are we substantially
> increasing the crypto costs?  If Ian is saying that crypto is comparatively
> cheap and the cost is that it’s harder to offload something that’s
> comparatively cheap, what have we lost?  I’d think we want to offload the
> most intensive piece we can, and it seems like we’re talking about crypto
> offloads....  Or are we saying instead that the crypto makes it harder to
> offload other things in the future, like a QUIC equivalent to LRO/LSO?
>
>
>
> *From:* Mikkel Fahnøe Jørgensen [mailto:mikkelfj@gmail.com]
> *Sent:* Thursday, April 5, 2018 9:45 PM
> *To:* Ian Swett <ianswett@google.com>; Mike Bishop <mbishop@evequefou.be>
> *Cc:* Praveen Balasubramanian <pravb@microsoft.com>; IETF QUIC WG <
> quic@ietf.org>
> *Subject:* Re: Impact of hardware offloads on network stack performance
>
>
>
> To be clear - I don’t think crypto is overshadowing other issues as Mike
> read my post. It certainly comes at a cost, but either multiple cores or
> co-processors will deal with this. 1000ns is 10Gbps crypto speed on one
> core and it is higly parallelisable and cache friendly.
>
>
>
> But if you have to drip your packets through a traditional send interface
> that is copying, buffering or blocking, and certainly sync’ing with the
> kernel - it is going to be tough.
>
>
>
> For receive you risk high latency or top much scheduling in receive
> buffers.
>
>
> ------------------------------
>
> *From:* Ian Swett <ianswett@google.com>
> *Sent:* Friday, April 6, 2018 4:06:01 AM
> *To:* Mike Bishop
> *Cc:* Mikkel Fahnøe Jørgensen; Praveen Balasubramanian; IETF QUIC WG
> *Subject:* Re: Impact of hardware offloads on network stack performance
>
>
>
>
>
> On Thu, Apr 5, 2018 at 5:57 PM Mike Bishop <mbishop@evequefou.be> wrote:
>
> That’s interesting data, Praveen – thanks.  There is one caveat there,
> which is that (IIUC, on current hardware) you can’t do packet pacing with
> LSO, and so my understanding is that even for TCP many providers disable
> this in pursuit of better egress management.  So what we’re really saying
> is that:
>
>    - On current hardware and OS setups, UDP runs less than half as much
>    throughput as TCP; that needs some optimization, as we’ve already discussed
>    - TCP will dramatically gain performance once hardware offload catches
>    up and allows *paced* LSO 😊
>
>
>
> However, as Mikkel points out, the crypto costs are likely to overwhelm
> the OS throughput / scheduling issues in real deployments.  So I think the
> other relevant piece to understanding the cost here is this:
>
>
>
> I have a few cases(both client and server) where my UDP send costs are
> more than 30%(in some cases 50%) of CPU consumption, and crypto is less
> than 10%.  So currently, I assume crypto is cheap(especially AES-GCM when
> hardware acceleration is available) and egress is expensive.  UDP ingres is
> not that expensive, but could use a bit of optimization as well.
>
>
>
> The only 'fancy' thing our current code is doing for crypto is encrypting
> in place, which was a ~2% win.  Nice, but not transformative.  See
> EncryptInPlace
> <https://cs.chromium.org/chromium/src/net/quic/core/quic_framer.cc?sq=package:chromium&l=1893>
>
>
>
> I haven't done much work benchmarking Windows, so possibly the Windows UDP
> stack is really fast, and so crypto seems really slow in comparison?
>
>
>
>
>    - What is the throughput and relative cost of TLS/TCP versus QUIC
>    (i.e. how much are the smaller units of encryption hurting us versus, say,
>    16KB TLS records)?
>
>
>    - TLS implementations already vary here:  Some implementations choose
>    large record sizes, some vary record sizes to reduce delay / HoLB, so this
>    probably isn’t a single number.
>
>
>    - How much additional load does PNE add to this difference?
>    - To what extent would PNE make future crypto offloads *impossible*
>    versus *requires more R&D to develop*?
>
>
>
> *From:* QUIC [mailto:quic-bounces@ietf.org] *On Behalf Of* Mikkel Fahnøe
> Jørgensen
> *Sent:* Wednesday, April 4, 2018 1:34 PM
> *To:* Praveen Balasubramanian <pravb=40microsoft.com@dmarc.ietf.org>;
> IETF QUIC WG <quic@ietf.org>
> *Subject:* Re: Impact of hardware offloads on network stack performance
>
>
>
> Thanks for sharing these numbers.
>
>
>
> My guess is that these offloads deal with kernel scheduling, memory cache
> issues, and interrupt scheduling.
>
>
>
> It probably has very little to do with crypto, TCP headers, and any other
> cpu sensitive processing.
>
>
>
> This is where netmap enters: you can transparently feed the data as it
> becomes available with very little sync work and you can also efficiently
> pipeline so you pass data to decryptor, then to app, with as little bus
> traffic as possible. No need to copy data or synchronize kernel space.
>
>
>
> For 1K packets you would use about 1000ns (3 cycles/byte) on crypto and
> this would happen in L1 cache. It would of course consume a core which a
> crypto offload would not, but that can be debated because with 18 core
> systems, your problem is memory and network more than CPU.
>
>
>
> My concern with PNE is for small packets and low latency where
>
> 1) an estimated 24ns for PN encryption and decryption becomes measurable
> if your network is fast enough.
>
> 2) any damage to the packet buffer cause all sorts of memory and cpu bus
> traffic issues.
>
>
>
> 1) is annoying. 2) is bad, completely avoidable but not as PR 1079 is
> formulated currently.
>
>
>
> 2) is also likely bad for hardware offload units as well.
>
>
>
> As to SRV-IO - I’m not familiar with it, but obviously there is some IO
> abstraction layer - the question is how you make it accessible to apps as
> opposed to device drivers that does nor work with you QUIC custom stack,
> and netmap is one option here.
>
>
>
> Kind Regards,
>
> Mikkel Fahnøe Jørgensen
>
>
>
> On 4 April 2018 at 22.15.52, Praveen Balasubramanian (
> pravb=40microsoft.com@dmarc.ietf.org) wrote:
>
> Some comparative numbers from an out of box default settings Windows
> Server 2016 (released version) for a single connection with a
> microbenchmark tool:
>
>
>
> *Offloads enabled*
>
> *TCP gbps*
>
> *UDP gbps*
>
> *LSO + LRO + checksum*
>
> 24
>
> 3.6
>
> *Checksum only*
>
> 7.6
>
> 3.6
>
> *None*
>
> 5.6
>
> 2.3
>
>
>
> This is for a fully bottlenecked CPU core -- *if you run lower data rates
> there is still a significant difference in Cycles/byte cost*. Same
> increased CPU cost applies for client systems going over high data rate
> Wifi and cellular.
>
>
>
> This is without any crypto. Once you add crypto the numbers become much
> worse *with crypto cost becoming dominant*. Adding another crypto step
> further exacerbates the problem. Hence crypto offload gains in importance
> followed by these batch offloads.
>
>
>
> If folks need any more numbers I’d be happy to provide them.
>
>
>
> Thanks
>
>
>
>