RE: Impact of hardware offloads on network stack performance

Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com> Mon, 23 April 2018 18:25 UTC

Return-Path: <mikkelfj@gmail.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F1A1412D88E for <quic@ietfa.amsl.com>; Mon, 23 Apr 2018 11:25:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.698
X-Spam-Level:
X-Spam-Status: No, score=-2.698 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id p_Z67doTzgYR for <quic@ietfa.amsl.com>; Mon, 23 Apr 2018 11:25:02 -0700 (PDT)
Received: from mail-it0-x22c.google.com (mail-it0-x22c.google.com [IPv6:2607:f8b0:4001:c0b::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 48652126C0F for <quic@ietf.org>; Mon, 23 Apr 2018 11:25:02 -0700 (PDT)
Received: by mail-it0-x22c.google.com with SMTP id 186-v6so11902310itu.0 for <quic@ietf.org>; Mon, 23 Apr 2018 11:25:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:in-reply-to:references:mime-version:date:message-id:subject:to :cc; bh=kHFyE/myrK6aD0+9OHxRtMxv5A5eZ5D+v2iAakZlrjM=; b=NZpwKGYxa39es9yKZzTiH7kRlb4+/n8oqEAJOq34eSQmPaqZNyYNiMZ0D6FMmIUBn/ QcG+pZpm0fAnUNL15f9UT607QnLJe4A5su6TydlkSVZpQFvb35OdXo3YC3j7lZQQ6Hnw 6gO5nT1+TQ2qD2QYcTsDmukWtEdY4XxTghOUQFO/Naa2iUnvVlgijwH5JZIun2i9ulT7 E/3BwDORi+/KE5RqUhLLyxtyrZh55l7jQPxzazH7LgZiA0jirFbvgdjLduf8OA5+qEIG QiNxMa0rV5naHj0DtZS/AMmubggf/e7aKl2fhuxl0wWk4fyKW12NxVimx3zvLj3OrHWb lz1A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:in-reply-to:references:mime-version:date :message-id:subject:to:cc; bh=kHFyE/myrK6aD0+9OHxRtMxv5A5eZ5D+v2iAakZlrjM=; b=KwukE3oeO9JD/epIZd74pgGm37fusEDLW9YCTxb0A+d5b0EGJpdg8VufYb1xblkcCr ikGMtK6dlra0SH8r/pFRAaemkvTIMDTRpA5HHFM5JgN/8RmQQ6ZqFuZW3t21a2dXSlF7 X+EoE7AW/G4bETbYYXuDjJZyKv/DYCefWvLHsC8tp8zpbcRbp524Htj5L3HetL07RrWB 8rhjuExBMWlb2BfXRlLfbTx5jttRqTZfBpHHfZj+ukUSCO7o4tsm2UeeKn24AigkDWsu ffwCuPRzcCa/09tNIiSD4qaAAYGE7Wm/y+MsjP0IS+3F1ebY4lcA4SowGwBric2vtGKq +fqw==
X-Gm-Message-State: ALQs6tBKQrjsZDdnaIVaSq5YmuOT6zk6SgjG2qtEw5uHZPJPuKKUq3cV WBYLY1J9rxzgEaoBe7jv6XL1CFvwlETglQtLgaA=
X-Google-Smtp-Source: AB8JxZrtBVUoQM4Uhqx5CmIZHJpFJcr5DecnKuEt6zO6AIn/n8huj2gJa19t3GtDEbSjQ78SNkrvcu2b8hyVNwFrfhc=
X-Received: by 2002:a24:7088:: with SMTP id f130-v6mr15392766itc.39.1524507901531; Mon, 23 Apr 2018 11:25:01 -0700 (PDT)
Received: from 1058052472880 named unknown by gmailapi.google.com with HTTPREST; Mon, 23 Apr 2018 11:25:00 -0700
From: Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>
In-Reply-To: <1F436ED13A22A246A59CA374CBC543998B60C7AD@ORSMSX111.amr.corp.intel.com>
References: <CY4PR21MB0630CE4DE6BA4EDF54A1FEB1B6A40@CY4PR21MB0630.namprd21.prod.outlook.com> <CAN1APde93+S8CP-KcZmqPKCCvsGRiq6ECPUoh_Qk0j9hqs8h0Q@mail.gmail.com> <SN1PR08MB1854E64D7C370BF0A7456977DABB0@SN1PR08MB1854.namprd08.prod.outlook.com> <CAKcm_gNhsHxXM77FjRj-Wh4JxA21NAWKXX3KBT=eZJsCdacM7Q@mail.gmail.com> <DB6PR10MB1766789B25E31EBE70564BD3ACBA0@DB6PR10MB1766.EURPRD10.PROD.OUTLOOK.COM> <SN1PR08MB18548155722C665FFD45CC4DDABA0@SN1PR08MB1854.namprd08.prod.outlook.com> <CAKcm_gMYf6AbHgvg+WKv-woQCtzz9WztiEDf-iRUjKw63Qx=9Q@mail.gmail.com> <1F436ED13A22A246A59CA374CBC543998B60C7AD@ORSMSX111.amr.corp.intel.com>
X-Mailer: Airmail (420)
MIME-Version: 1.0
Date: Mon, 23 Apr 2018 11:25:00 -0700
Message-ID: <CAN1APdc3y0EwFqeYVvZs7MtBHhS_9CzwGmcwRqi_6GHWzF3_2Q@mail.gmail.com>
Subject: RE: Impact of hardware offloads on network stack performance
To: Ian Swett <ianswett=40google.com@dmarc.ietf.org>, "Deval, Manasi" <manasi.deval@intel.com>, Mike Bishop <mbishop@evequefou.be>
Cc: IETF QUIC WG <quic@ietf.org>, Praveen Balasubramanian <pravb@microsoft.com>
Content-Type: multipart/alternative; boundary="000000000000ed9c77056a8827a9"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/pkxZgwPAsCeuZ8yPcwThWRTCow8>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Apr 2018 18:25:06 -0000

I’m not really familiar with this dump but what does partial segmentation
mean here?

Are you splitting a single UDP datagram into multiple fragments, or are you
processing multiple separate QUIC packets, one PN each?

If you fragment UDP datagrams, that can be a big issue because different
fragments may race along different routes, causing unexpected delays and
blocking at the other end.



On 23 April 2018 at 20.18.54, Deval, Manasi (manasi.deval@intel.com) wrote:

If we do TSO / LSO with QUIC, the initial segment provided can look like –

MAC – IP – UDP – QUIC Clear – QUIC encrypted header (PN=x) – QUIC Encrypted
payload.



If we just use the stack’s udp segmentation (using the patch -
https://marc.info/?l=linux-netdev&m=152399530603992&w=2), the outcome of
QUIC packets on the wire would be:

MAC – IP – UDP – QUIC Clear – QUIC encrypted part (PN=x, partial segment 1)

MAC – IP – UDP – QUIC encrypted part (partial segment 2)

MAC – IP – UDP – QUIC encrypted part (partial segment 3)

…



Putting these back together in a stack / hardware looks like a painful and
potentially non-deterministic activity.





It would be easier to do a native QUIC offload, where the packet on wire
can be identified by QUIC CiD and appropriate packet number. Therefore,
original packet looks like - MAC – IP – UDP – QUIC Clear – QUIC encrypted
header (PN=x) – QUIC Encrypted payload.

If we just create another version of segmentation called QUIC segmentation,
the outcome of packets on the wire would be:

MAC – IP – UDP – QUIC Clear – QUIC encrypted part (PN=x, partial segment 1)

MAC – IP – UDP – QUIC Clear – QUIC encrypted part (PN=x+1, partial segment
2)

MAC – IP – UDP – QUIC Clear – QUIC encrypted part (PN=x+2, partial segment
3)

…

If this set of segments were to be re-assembled, the ability to identify
CiD and PN would simplify this task. I think this is a simpler
implementation. Do others see an issue with this?



Thanks,

Manasi





*From:* QUIC [mailto:quic-bounces@ietf.org] *On Behalf Of *Ian Swett
*Sent:* Friday, April 06, 2018 1:55 PM
*To:* Mike Bishop <mbishop@evequefou.be>
*Cc:* Praveen Balasubramanian <pravb@microsoft.com>; Mikkel Fahnøe
Jørgensen <mikkelfj@gmail.com>; IETF QUIC WG <quic@ietf.org>
*Subject:* Re: Impact of hardware offloads on network stack performance



PNE doesn't preclude UDP LRO or pacing offload.



However, in a highly optimized stack(ie: The Disk|Crypt|Net paper from
SIGCOMM 2017) not touching(ie: bringing into memory or cache) the content
to be sent until the last possible moment is critical.  PNE likely means
touching that content much earlier.



But personally I consider the actual crypto we're talking about(2 AES
instructions?) adding basically free.



On Fri, Apr 6, 2018 at 4:37 PM Mike Bishop <mbishop@evequefou.be> wrote:

Thanks for the clarification.  I guess what I’m really getting down to is
how do we quantify the actual *cost* of PNE?  Are we substantially
increasing the crypto costs?  If Ian is saying that crypto is comparatively
cheap and the cost is that it’s harder to offload something that’s
comparatively cheap, what have we lost?  I’d think we want to offload the
most intensive piece we can, and it seems like we’re talking about crypto
offloads....  Or are we saying instead that the crypto makes it harder to
offload other things in the future, like a QUIC equivalent to LRO/LSO?



*From:* Mikkel Fahnøe Jørgensen [mailto:mikkelfj@gmail.com]
*Sent:* Thursday, April 5, 2018 9:45 PM
*To:* Ian Swett <ianswett@google.com>; Mike Bishop <mbishop@evequefou.be>
*Cc:* Praveen Balasubramanian <pravb@microsoft.com>; IETF QUIC WG <
quic@ietf.org>
*Subject:* Re: Impact of hardware offloads on network stack performance



To be clear - I don’t think crypto is overshadowing other issues as Mike
read my post. It certainly comes at a cost, but either multiple cores or
co-processors will deal with this. 1000ns is 10Gbps crypto speed on one
core and it is higly parallelisable and cache friendly.



But if you have to drip your packets through a traditional send interface
that is copying, buffering or blocking, and certainly sync’ing with the
kernel - it is going to be tough.



For receive you risk high latency or top much scheduling in receive buffers.


------------------------------

*From:* Ian Swett <ianswett@google.com>
*Sent:* Friday, April 6, 2018 4:06:01 AM
*To:* Mike Bishop
*Cc:* Mikkel Fahnøe Jørgensen; Praveen Balasubramanian; IETF QUIC WG
*Subject:* Re: Impact of hardware offloads on network stack performance





On Thu, Apr 5, 2018 at 5:57 PM Mike Bishop <mbishop@evequefou.be> wrote:

That’s interesting data, Praveen – thanks.  There is one caveat there,
which is that (IIUC, on current hardware) you can’t do packet pacing with
LSO, and so my understanding is that even for TCP many providers disable
this in pursuit of better egress management.  So what we’re really saying
is that:

   - On current hardware and OS setups, UDP runs less than half as much
   throughput as TCP; that needs some optimization, as we’ve already discussed
   - TCP will dramatically gain performance once hardware offload catches
   up and allows *paced* LSO 😊



However, as Mikkel points out, the crypto costs are likely to overwhelm the
OS throughput / scheduling issues in real deployments.  So I think the
other relevant piece to understanding the cost here is this:



I have a few cases(both client and server) where my UDP send costs are more
than 30%(in some cases 50%) of CPU consumption, and crypto is less than
10%.  So currently, I assume crypto is cheap(especially AES-GCM when
hardware acceleration is available) and egress is expensive.  UDP ingres is
not that expensive, but could use a bit of optimization as well.



The only 'fancy' thing our current code is doing for crypto is encrypting
in place, which was a ~2% win.  Nice, but not transformative.  See
EncryptInPlace
<https://cs.chromium.org/chromium/src/net/quic/core/quic_framer.cc?sq=package:chromium&l=1893>



I haven't done much work benchmarking Windows, so possibly the Windows UDP
stack is really fast, and so crypto seems really slow in comparison?




   - What is the throughput and relative cost of TLS/TCP versus QUIC (i.e.
   how much are the smaller units of encryption hurting us versus, say, 16KB
   TLS records)?


   - TLS implementations already vary here:  Some implementations choose
      large record sizes, some vary record sizes to reduce delay /
HoLB, so this
      probably isn’t a single number.


   - How much additional load does PNE add to this difference?
   - To what extent would PNE make future crypto offloads *impossible*
   versus * requires more R&D to develop*?



*From:* QUIC [mailto:quic-bounces@ietf.org] *On Behalf Of *Mikkel Fahnøe
Jørgensen
*Sent:* Wednesday, April 4, 2018 1:34 PM
*To:* Praveen Balasubramanian <pravb=40microsoft.com@dmarc.ietf.org>; IETF
QUIC WG <quic@ietf.org>
*Subject:* Re: Impact of hardware offloads on network stack performance



Thanks for sharing these numbers.



My guess is that these offloads deal with kernel scheduling, memory cache
issues, and interrupt scheduling.



It probably has very little to do with crypto, TCP headers, and any other
cpu sensitive processing.



This is where netmap enters: you can transparently feed the data as it
becomes available with very little sync work and you can also efficiently
pipeline so you pass data to decryptor, then to app, with as little bus
traffic as possible. No need to copy data or synchronize kernel space.



For 1K packets you would use about 1000ns (3 cycles/byte) on crypto and
this would happen in L1 cache. It would of course consume a core which a
crypto offload would not, but that can be debated because with 18 core
systems, your problem is memory and network more than CPU.



My concern with PNE is for small packets and low latency where

1) an estimated 24ns for PN encryption and decryption becomes measurable if
your network is fast enough.

2) any damage to the packet buffer cause all sorts of memory and cpu bus
traffic issues.



1) is annoying. 2) is bad, completely avoidable but not as PR 1079 is
formulated currently.



2) is also likely bad for hardware offload units as well.



As to SRV-IO - I’m not familiar with it, but obviously there is some IO
abstraction layer - the question is how you make it accessible to apps as
opposed to device drivers that does nor work with you QUIC custom stack,
and netmap is one option here.



Kind Regards,

Mikkel Fahnøe Jørgensen



On 4 April 2018 at 22.15.52, Praveen Balasubramanian (
pravb=40microsoft.com@dmarc.ietf.org) wrote:

Some comparative numbers from an out of box default settings Windows Server
2016 (released version) for a single connection with a microbenchmark tool:



*Offloads enabled*

*TCP gbps*

*UDP gbps*

*LSO + LRO + checksum*

24

3.6

*Checksum only*

7.6

3.6

*None*

5.6

2.3



This is for a fully bottlenecked CPU core -- *if you run lower data rates
there is still a significant difference in Cycles/byte cost*. Same
increased CPU cost applies for client systems going over high data rate
Wifi and cellular.



This is without any crypto. Once you add crypto the numbers become much
worse *with crypto cost becoming dominant*. Adding another crypto step
further exacerbates the problem. Hence crypto offload gains in importance
followed by these batch offloads.



If folks need any more numbers I’d be happy to provide them.



Thanks