RE: Impact of hardware offloads on network stack performance

"Lubashev, Igor" <ilubashe@akamai.com> Mon, 23 April 2018 19:36 UTC

Return-Path: <ilubashe@akamai.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C039612D810 for <quic@ietfa.amsl.com>; Mon, 23 Apr 2018 12:36:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.721
X-Spam-Level:
X-Spam-Status: No, score=-0.721 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, HTTPS_HTTP_MISMATCH=1.989, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_DKIMWL_WL_HIGH=-0.01] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=akamai.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EveBJpZK0NrJ for <quic@ietfa.amsl.com>; Mon, 23 Apr 2018 12:36:42 -0700 (PDT)
Received: from mx0b-00190b01.pphosted.com (mx0b-00190b01.pphosted.com [IPv6:2620:100:9005:57f::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 622C4126B6E for <quic@ietf.org>; Mon, 23 Apr 2018 12:36:42 -0700 (PDT)
Received: from pps.filterd (m0050096.ppops.net [127.0.0.1]) by m0050096.ppops.net-00190b01. (8.16.0.22/8.16.0.22) with SMTP id w3NJNnDO032131; Mon, 23 Apr 2018 20:36:35 +0100
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=akamai.com; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=jan2016.eng; bh=RqDYc3JRpoTrMT/l9vBT/67pq5jC6agpVd3cGNsAEhw=; b=c+8DlT92UGPZVf8e6gVQ83Jr/IHAAvjy+g1ixJ8Wm3e/vfH0Ts5nkww/FH49BEPjfgaf 82rVtV3bboj8H+dMS+a3OgfZphK9fbEt1lQRFKSA0bs9W9kiDxTbAHK+tRRqcjE9Ia1C OzCf9xBeSqCoCkGZK+T1K/wMST8twBiegZ3WSCxlsRMomBmffO1DjjrR1zn1iUseM3Wu R7YjbGDJBy1IgClHNSO/l7uvsLox0AkgofuMqkRfPOz028DK/0CalV26MlH21zaTtQv4 WH2jRPdb98YBdNvdWx33T7eF1qppxfhUSeDFkD4KdEGH4fgU6tGdD/Xzg3M2g+qNPHqK 5Q==
Received: from prod-mail-ppoint3 ([96.6.114.86]) by m0050096.ppops.net-00190b01. with ESMTP id 2hg0c1wvxt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 23 Apr 2018 20:36:34 +0100
Received: from pps.filterd (prod-mail-ppoint3.akamai.com [127.0.0.1]) by prod-mail-ppoint3.akamai.com (8.16.0.21/8.16.0.21) with SMTP id w3NJaPaJ028293; Mon, 23 Apr 2018 15:36:33 -0400
Received: from email.msg.corp.akamai.com ([172.27.123.30]) by prod-mail-ppoint3.akamai.com with ESMTP id 2hg0nbs11q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Mon, 23 Apr 2018 15:36:30 -0400
Received: from USMA1EX-DAG1MB5.msg.corp.akamai.com (172.27.123.105) by usma1ex-dag1mb3.msg.corp.akamai.com (172.27.123.103) with Microsoft SMTP Server (TLS) id 15.0.1365.1; Mon, 23 Apr 2018 15:36:16 -0400
Received: from USMA1EX-DAG1MB5.msg.corp.akamai.com ([172.27.123.105]) by usma1ex-dag1mb5.msg.corp.akamai.com ([172.27.123.105]) with mapi id 15.00.1365.000; Mon, 23 Apr 2018 15:36:16 -0400
From: "Lubashev, Igor" <ilubashe@akamai.com>
To: Ian Swett <ianswett=40google.com@dmarc.ietf.org>, Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>
CC: Praveen Balasubramanian <pravb@microsoft.com>, IETF QUIC WG <quic@ietf.org>, Mike Bishop <mbishop@evequefou.be>, "Deval, Manasi" <manasi.deval@intel.com>
Subject: RE: Impact of hardware offloads on network stack performance
Thread-Topic: Impact of hardware offloads on network stack performance
Thread-Index: AdPMUbL7T8nZAW2ySiO289IPTC0ttgAJCDIAADUrUQAACLVdgAAFkLMAACFBx4AAAJ6NgANRfL2AAAA3SAAAAHMwgAAAR/kAAAEuywAACA7YgA==
Date: Mon, 23 Apr 2018 19:36:15 +0000
Message-ID: <bf540ec1f6f045aca1cf2379380630b5@usma1ex-dag1mb5.msg.corp.akamai.com>
References: <CY4PR21MB0630CE4DE6BA4EDF54A1FEB1B6A40@CY4PR21MB0630.namprd21.prod.outlook.com> <CAN1APde93+S8CP-KcZmqPKCCvsGRiq6ECPUoh_Qk0j9hqs8h0Q@mail.gmail.com> <SN1PR08MB1854E64D7C370BF0A7456977DABB0@SN1PR08MB1854.namprd08.prod.outlook.com> <CAKcm_gNhsHxXM77FjRj-Wh4JxA21NAWKXX3KBT=eZJsCdacM7Q@mail.gmail.com> <DB6PR10MB1766789B25E31EBE70564BD3ACBA0@DB6PR10MB1766.EURPRD10.PROD.OUTLOOK.COM> <SN1PR08MB18548155722C665FFD45CC4DDABA0@SN1PR08MB1854.namprd08.prod.outlook.com> <CAKcm_gMYf6AbHgvg+WKv-woQCtzz9WztiEDf-iRUjKw63Qx=9Q@mail.gmail.com> <1F436ED13A22A246A59CA374CBC543998B60C7AD@ORSMSX111.amr.corp.intel.com> <CAN1APdc3y0EwFqeYVvZs7MtBHhS_9CzwGmcwRqi_6GHWzF3_2Q@mail.gmail.com> <1F436ED13A22A246A59CA374CBC543998B60C851@ORSMSX111.amr.corp.intel.com> <CAN1APdcPkO-HfXqqvjeee6K8U8KSZQdtx=6fW1vZo+H6pKfzzQ@mail.gmail.com> <CAKcm_gP8TRPT7yi1y=mU5Fvq7xB5-1ieyKFLQMPomfabYbkcxA@mail.gmail.com>
In-Reply-To: <CAKcm_gP8TRPT7yi1y=mU5Fvq7xB5-1ieyKFLQMPomfabYbkcxA@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [172.19.33.66]
Content-Type: multipart/alternative; boundary="_000_bf540ec1f6f045aca1cf2379380630b5usma1exdag1mb5msgcorpak_"
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-23_07:, , signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1804230195
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-23_07:, , signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1804230194
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/pmutk8ZvXNwIp-pKplVd02hS6s0>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Apr 2018 19:36:46 -0000

The point of HW offloads is to reduce the number of times the kernel (“transport”) stack is invoked to locate the flow, validate headers, enqueue data for delivery to sockets, etc.

If HW can associate multiple QUIC packets with a single flow, remove encryption and pass it on to the stack, it does not really need to simulate the data arriving in a single jumbo QUIC packet.


  *   Igor


From: Ian Swett [mailto:ianswett=40google.com@dmarc.ietf.org]
Sent: Monday, April 23, 2018 3:20 PM
To: Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>
Cc: Praveen Balasubramanian <pravb@microsoft.com>; IETF QUIC WG <quic@ietf.org>; Mike Bishop <mbishop@evequefou.be>; Deval, Manasi <manasi.deval@intel.com>; ianswett=40google.com@dmarc.ietf.org
Subject: Re: Impact of hardware offloads on network stack performance

To clarify my understanding, you'd have to receive all 3 segments to decrypt the payload?  If that's what you're proposing, then I think it's not a good design for QUIC, because it amplifies the effective loss rate and it doesn't save that much CPU.

It's not that hard for userspace to provide 3 fully formed encrypted packets and use an API like Willem's UDP GSO patch, and I believe that approach is more agile as well.

If you wanted to do crypto offload, then you could provide the ability to offload payload and PN encryption instead of doing it in userspace.  As a bonus, support iovec style scatter-gather, so bulk payloads never need to be brought into CPU cache before they're encrypted and copied to the NIC.

On Mon, Apr 23, 2018 at 2:46 PM Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com<mailto:mikkelfj@gmail.com>> wrote:
Ah, I see. Thanks for explaining.

QUIC has a much reacher interface, so there could be, say 100 active streams waiting for transmission. The QUIC implementation would use some sort of scheduler to move the most important parts first without loosing progress on other parts, for example using

https://en.wikipedia..org/wiki/Deficit_round_robin<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Deficit-5Fround-5Frobin&d=DwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=Djn3bQ5uNJDPM_2skfL3rW1tzcIxyjUZdn_m55KPmlo&m=kLn2KgKLPtGJxRvKXL-FJ5JKb-Ru0P0jELdhxpWwM8Y&s=EiFxElkH4sg30mBAg17hM6E1uSnsk1m_CcnEpbfAR4Q&e=>

A simple huge buffer segmentation would probably not work directly, but a means to forward data pre-wrapped in QUIC frames, in reasonable package units, but without AEAD envelope would make sense. These could be queued in volumes, leaving it an offloader to proceed.

I’ve been looking into the netmap interface where you can feed content in units directly to hardware, and I’m sure you have detailed knowledge of DPDK that does something similar.


Kind Regards,
Mikkel Fahnøe Jørgensen


On 23 April 2018 at 20.38.06, Deval, Manasi (manasi.deval@intel.com<mailto:manasi.deval@intel.com>) wrote:
I’m trying to lay out the options for TSO/LSO with QUIC. In my previous e-mail I compared the outcome of vanilla UDP segmentation as defined by the latest patch (https://marc.info/?l=linux-netdev&m=152399530603992&w=2<https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dnetdev-26m-3D152399530603992-26w-3D2&d=DwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=Djn3bQ5uNJDPM_2skfL3rW1tzcIxyjUZdn_m55KPmlo&m=kLn2KgKLPtGJxRvKXL-FJ5JKb-Ru0P0jELdhxpWwM8Y&s=hYHZfZyfxP7kInpTvNE9U02BPAhje-ZvaUgV1rNCSLI&e=>), to an alternative native QUIC segmentation.

A segment is a large buffer produced in an application. The segment is often much too large to go out on the wire with just pre-pended headers. Therefore segment needs to be chopped up into appropriate size and the pre-pended headers need to be replicated to make MTU size packets for the wire.

I’m suggesting that we should consider processing a large data buffer as multiple QUIC packets each with a unique PN.

Thanks,
Manasi


From: Mikkel Fahnøe Jørgensen [mailto:mikkelfj@gmail.com<mailto:mikkelfj@gmail.com>]
Sent: Monday, April 23, 2018 11:25 AM
To: Ian Swett <ianswett=40google.com@dmarc.ietf.org<mailto:40google.com@dmarc.ietf.org>>; Deval, Manasi <manasi.deval@intel.com<mailto:manasi.deval@intel.com>>; Mike Bishop <mbishop@evequefou.be<mailto:mbishop@evequefou.be>>
Cc: IETF QUIC WG <quic@ietf.org<mailto:quic@ietf.org>>; Praveen Balasubramanian <pravb@microsoft.com<mailto:pravb@microsoft.com>>
Subject: RE: Impact of hardware offloads on network stack performance

I’m not really familiar with this dump but what does partial segmentation mean here?

Are you splitting a single UDP datagram into multiple fragments, or are you processing multiple separate QUIC packets, one PN each?

If you fragment UDP datagrams, that can be a big issue because different fragments may race along different routes, causing unexpected delays and blocking at the other end.



On 23 April 2018 at 20.18.54, Deval, Manasi (manasi.deval@intel.com<mailto:manasi.deval@intel.com>) wrote:
If we do TSO / LSO with QUIC, the initial segment provided can look like –
MAC – IP – UDP – QUIC Clear – QUIC encrypted header (PN=x) – QUIC Encrypted payload.


If we just use the stack’s udp segmentation (using the patch - https://marc.info/?l=linux-netdev&m=152399530603992&w=2<https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dnetdev-26m-3D152399530603992-26w-3D2&d=DwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=Djn3bQ5uNJDPM_2skfL3rW1tzcIxyjUZdn_m55KPmlo&m=kLn2KgKLPtGJxRvKXL-FJ5JKb-Ru0P0jELdhxpWwM8Y&s=hYHZfZyfxP7kInpTvNE9U02BPAhje-ZvaUgV1rNCSLI&e=>), the outcome of QUIC packets on the wire would be:
MAC – IP – UDP – QUIC Clear – QUIC encrypted part (PN=x, partial segment 1/3)
MAC – IP – UDP – QUIC encrypted part (partial segment 2/3)
MAC – IP – UDP – QUIC encrypted part (partial segment 3/3)
…

Putting these back together in a stack / hardware looks like a painful and potentially non-deterministic activity.


It would be easier to do a native QUIC offload, where the packet on wire can be identified by QUIC CiD and appropriate packet number. Therefore, original packet looks like - MAC – IP – UDP – QUIC Clear – QUIC encrypted header (PN=x) – QUIC Encrypted payload.
If we just create another version of segmentation called QUIC segmentation, the outcome of packets on the wire would be:
MAC – IP – UDP – QUIC Clear – QUIC encrypted part (PN=x, partial segment 1/3)
MAC – IP – UDP – QUIC Clear – QUIC encrypted part (PN=x+1, partial segment 2/3)
MAC – IP – UDP – QUIC Clear – QUIC encrypted part (PN=x+2, partial segment 3/3)
…
If this set of segments were to be re-assembled, the ability to identify CiD and PN would simplify this task. I think this is a simpler implementation. Do others see an issue with this?

Thanks,
Manasi


From: QUIC [mailto:quic-bounces@ietf.org<mailto:quic-bounces@ietf.org>] On Behalf Of Ian Swett
Sent: Friday, April 06, 2018 1:55 PM
To: Mike Bishop <mbishop@evequefou.be<mailto:mbishop@evequefou.be>>
Cc: Praveen Balasubramanian <pravb@microsoft.com<mailto:pravb@microsoft.com>>; Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com<mailto:mikkelfj@gmail.com>>; IETF QUIC WG <quic@ietf.org<mailto:quic@ietf.org>>
Subject: Re: Impact of hardware offloads on network stack performance

PNE doesn't preclude UDP LRO or pacing offload.

However, in a highly optimized stack(ie: The Disk|Crypt|Net paper from SIGCOMM 2017) not touching(ie: bringing into memory or cache) the content to be sent until the last possible moment is critical.  PNE likely means touching that content much earlier.

But personally I consider the actual crypto we're talking about(2 AES instructions?) adding basically free.

On Fri, Apr 6, 2018 at 4:37 PM Mike Bishop <mbishop@evequefou.be<mailto:mbishop@evequefou.be>> wrote:
Thanks for the clarification.  I guess what I’m really getting down to is how do we quantify the actual cost of PNE?  Are we substantially increasing the crypto costs?  If Ian is saying that crypto is comparatively cheap and the cost is that it’s harder to offload something that’s comparatively cheap, what have we lost?  I’d think we want to offload the most intensive piece we can, and it seems like we’re talking about crypto offloads....  Or are we saying instead that the crypto makes it harder to offload other things in the future, like a QUIC equivalent to LRO/LSO?

From: Mikkel Fahnøe Jørgensen [mailto:mikkelfj@gmail.com<mailto:mikkelfj@gmail.com>]
Sent: Thursday, April 5, 2018 9:45 PM
To: Ian Swett <ianswett@google.com<mailto:ianswett@google.com>>; Mike Bishop <mbishop@evequefou.be<mailto:mbishop@evequefou.be>>
Cc: Praveen Balasubramanian <pravb@microsoft.com<mailto:pravb@microsoft.com>>; IETF QUIC WG <quic@ietf.org<mailto:quic@ietf.org>>
Subject: Re: Impact of hardware offloads on network stack performance

To be clear - I don’t think crypto is overshadowing other issues as Mike read my post. It certainly comes at a cost, but either multiple cores or co-processors will deal with this. 1000ns is 10Gbps crypto speed on one core and it is higly parallelisable and cache friendly.

But if you have to drip your packets through a traditional send interface that is copying, buffering or blocking, and certainly sync’ing with the kernel - it is going to be tough.

For receive you risk high latency or top much scheduling in receive buffers.

________________________________
From: Ian Swett <ianswett@google.com<mailto:ianswett@google.com>>
Sent: Friday, April 6, 2018 4:06:01 AM
To: Mike Bishop
Cc: Mikkel Fahnøe Jørgensen; Praveen Balasubramanian; IETF QUIC WG
Subject: Re: Impact of hardware offloads on network stack performance


On Thu, Apr 5, 2018 at 5:57 PM Mike Bishop <mbishop@evequefou.be<mailto:mbishop@evequefou.be>> wrote:
That’s interesting data, Praveen – thanks.  There is one caveat there, which is that (IIUC, on current hardware) you can’t do packet pacing with LSO, and so my understanding is that even for TCP many providers disable this in pursuit of better egress management.  So what we’re really saying is that:

  *   On current hardware and OS setups, UDP runs less than half as much throughput as TCP; that needs some optimization, as we’ve already discussed
  *   TCP will dramatically gain performance once hardware offload catches up and allows paced LSO 😊

However, as Mikkel points out, the crypto costs are likely to overwhelm the OS throughput / scheduling issues in real deployments.  So I think the other relevant piece to understanding the cost here is this:

I have a few cases(both client and server) where my UDP send costs are more than 30%(in some cases 50%) of CPU consumption, and crypto is less than 10%.  So currently, I assume crypto is cheap(especially AES-GCM when hardware acceleration is available) and egress is expensive.  UDP ingres is not that expensive, but could use a bit of optimization as well.

The only 'fancy' thing our current code is doing for crypto is encrypting in place, which was a ~2% win.  Nice, but not transformative.  See EncryptInPlace<https://urldefense.proofpoint.com/v2/url?u=https-3A__cs.chromium.org_chromium_src_net_quic_core_quic-5Fframer.cc-3Fsq-3Dpackage-3Achromium-26l-3D1893&d=DwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=Djn3bQ5uNJDPM_2skfL3rW1tzcIxyjUZdn_m55KPmlo&m=kLn2KgKLPtGJxRvKXL-FJ5JKb-Ru0P0jELdhxpWwM8Y&s=--DoALPZoa0L4Ie9TZrPKpDy1_dnb1kN-3Ga3MrXnb0&e=>

I haven't done much work benchmarking Windows, so possibly the Windows UDP stack is really fast, and so crypto seems really slow in comparison?


  *   What is the throughput and relative cost of TLS/TCP versus QUIC (i.e. how much are the smaller units of encryption hurting us versus, say, 16KB TLS records)?

     *   TLS implementations already vary here:  Some implementations choose large record sizes, some vary record sizes to reduce delay / HoLB, so this probably isn’t a single number.

  *   How much additional load does PNE add to this difference?
  *   To what extent would PNE make future crypto offloads impossible versus requires more R&D to develop?

From: QUIC [mailto:quic-bounces@ietf.org<mailto:quic-bounces@ietf.org>] On Behalf Of Mikkel Fahnøe Jørgensen
Sent: Wednesday, April 4, 2018 1:34 PM
To: Praveen Balasubramanian <pravb=40microsoft.com@dmarc.ietf.org<mailto:40microsoft.com@dmarc.ietf.org>>; IETF QUIC WG <quic@ietf.org<mailto:quic@ietf.org>>
Subject: Re: Impact of hardware offloads on network stack performance

Thanks for sharing these numbers.

My guess is that these offloads deal with kernel scheduling, memory cache issues, and interrupt scheduling.

It probably has very little to do with crypto, TCP headers, and any other cpu sensitive processing.

This is where netmap enters: you can transparently feed the data as it becomes available with very little sync work and you can also efficiently pipeline so you pass data to decryptor, then to app, with as little bus traffic as possible. No need to copy data or synchronize kernel space.

For 1K packets you would use about 1000ns (3 cycles/byte) on crypto and this would happen in L1 cache. It would of course consume a core which a crypto offload would not, but that can be debated because with 18 core systems, your problem is memory and network more than CPU.

My concern with PNE is for small packets and low latency where
1) an estimated 24ns for PN encryption and decryption becomes measurable if your network is fast enough.
2) any damage to the packet buffer cause all sorts of memory and cpu bus traffic issues.

1) is annoying. 2) is bad, completely avoidable but not as PR 1079 is formulated currently.

2) is also likely bad for hardware offload units as well.

As to SRV-IO - I’m not familiar with it, but obviously there is some IO abstraction layer - the question is how you make it accessible to apps as opposed to device drivers that does nor work with you QUIC custom stack, and netmap is one option here.

Kind Regards,
Mikkel Fahnøe Jørgensen


On 4 April 2018 at 22.15.52, Praveen Balasubramanian (pravb=40microsoft.com@dmarc.ietf.org<mailto:pravb=40microsoft.com@dmarc.ietf.org>) wrote:
Some comparative numbers from an out of box default settings Windows Server 2016 (released version) for a single connection with a microbenchmark tool:

Offloads enabled

TCP gbps

UDP gbps

LSO + LRO + checksum

24

3.6

Checksum only

7.6

3.6

None

5.6

2.3


This is for a fully bottlenecked CPU core -- if you run lower data rates there is still a significant difference in Cycles/byte cost. Same increased CPU cost applies for client systems going over high data rate Wifi and cellular.

This is without any crypto. Once you add crypto the numbers become much worse with crypto cost becoming dominant. Adding another crypto step further exacerbates the problem. Hence crypto offload gains in importance followed by these batch offloads.

If folks need any more numbers I’d be happy to provide them.

Thanks