Re: Impact of hardware offloads on network stack performance

Ian Swett <ianswett@google.com> Fri, 06 April 2018 20:55 UTC

MIME-Version: 1.0
References: <CY4PR21MB0630CE4DE6BA4EDF54A1FEB1B6A40@CY4PR21MB0630.namprd21.prod.outlook.com> <CAN1APde93+S8CP-KcZmqPKCCvsGRiq6ECPUoh_Qk0j9hqs8h0Q@mail.gmail.com> <SN1PR08MB1854E64D7C370BF0A7456977DABB0@SN1PR08MB1854.namprd08.prod.outlook.com> <CAKcm_gNhsHxXM77FjRj-Wh4JxA21NAWKXX3KBT=eZJsCdacM7Q@mail.gmail.com> <DB6PR10MB1766789B25E31EBE70564BD3ACBA0@DB6PR10MB1766.EURPRD10.PROD.OUTLOOK.COM> <SN1PR08MB18548155722C665FFD45CC4DDABA0@SN1PR08MB1854.namprd08.prod.outlook.com>
In-Reply-To: <SN1PR08MB18548155722C665FFD45CC4DDABA0@SN1PR08MB1854.namprd08.prod.outlook.com>
From: Ian Swett <ianswett@google.com>
Date: Fri, 06 Apr 2018 20:55:21 +0000
Message-ID: <CAKcm_gMYf6AbHgvg+WKv-woQCtzz9WztiEDf-iRUjKw63Qx=9Q@mail.gmail.com>
Subject: Re: Impact of hardware offloads on network stack performance
To: Mike Bishop <mbishop@evequefou.be>
Cc: Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>, Praveen Balasubramanian <pravb@microsoft.com>, IETF QUIC WG <quic@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000f27cbd056934468b"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/I1RcDS_hr8IrfXJEbx8AXlypMZY>
Precedence: list

PNE doesn't preclude UDP LRO or pacing offload.

However, in a highly optimized stack(ie: The Disk|Crypt|Net paper from
SIGCOMM 2017) not touching(ie: bringing into memory or cache) the content
to be sent until the last possible moment is critical.  PNE likely means
touching that content much earlier.

But personally I consider the actual crypto we're talking about(2 AES
instructions?) adding basically free.

On Fri, Apr 6, 2018 at 4:37 PM Mike Bishop <mbishop@evequefou.be> wrote:

> Thanks for the clarification.  I guess what I’m really getting down to is
> how do we quantify the actual *cost* of PNE?  Are we substantially
> increasing the crypto costs?  If Ian is saying that crypto is comparatively
> cheap and the cost is that it’s harder to offload something that’s
> comparatively cheap, what have we lost?  I’d think we want to offload the
> most intensive piece we can, and it seems like we’re talking about crypto
> offloads....  Or are we saying instead that the crypto makes it harder to
> offload other things in the future, like a QUIC equivalent to LRO/LSO?
>
>
>
> *From:* Mikkel Fahnøe Jørgensen [mailto:mikkelfj@gmail.com]
> *Sent:* Thursday, April 5, 2018 9:45 PM
> *To:* Ian Swett <ianswett@google.com>; Mike Bishop <mbishop@evequefou.be>
> *Cc:* Praveen Balasubramanian <pravb@microsoft.com>; IETF QUIC WG <
> quic@ietf.org>
> *Subject:* Re: Impact of hardware offloads on network stack performance
>
>
>
> To be clear - I don’t think crypto is overshadowing other issues as Mike
> read my post. It certainly comes at a cost, but either multiple cores or
> co-processors will deal with this. 1000ns is 10Gbps crypto speed on one
> core and it is higly parallelisable and cache friendly.
>
>
>
> But if you have to drip your packets through a traditional send interface
> that is copying, buffering or blocking, and certainly sync’ing with the
> kernel - it is going to be tough.
>
>
>
> For receive you risk high latency or top much scheduling in receive
> buffers.
>
>
> ------------------------------
>
> *From:* Ian Swett <ianswett@google.com>
> *Sent:* Friday, April 6, 2018 4:06:01 AM
> *To:* Mike Bishop
> *Cc:* Mikkel Fahnøe Jørgensen; Praveen Balasubramanian; IETF QUIC WG
> *Subject:* Re: Impact of hardware offloads on network stack performance
>
>
>
>
>
> On Thu, Apr 5, 2018 at 5:57 PM Mike Bishop <mbishop@evequefou.be> wrote:
>
> That’s interesting data, Praveen – thanks.  There is one caveat there,
> which is that (IIUC, on current hardware) you can’t do packet pacing with
> LSO, and so my understanding is that even for TCP many providers disable
> this in pursuit of better egress management.  So what we’re really saying
> is that:
>
>    - On current hardware and OS setups, UDP runs less than half as much
>    throughput as TCP; that needs some optimization, as we’ve already discussed
>    - TCP will dramatically gain performance once hardware offload catches
>    up and allows *paced* LSO 😊
>
>
>
> However, as Mikkel points out, the crypto costs are likely to overwhelm
> the OS throughput / scheduling issues in real deployments.  So I think the
> other relevant piece to understanding the cost here is this:
>
>
>
> I have a few cases(both client and server) where my UDP send costs are
> more than 30%(in some cases 50%) of CPU consumption, and crypto is less
> than 10%.  So currently, I assume crypto is cheap(especially AES-GCM when
> hardware acceleration is available) and egress is expensive.  UDP ingres is
> not that expensive, but could use a bit of optimization as well.
>
>
>
> The only 'fancy' thing our current code is doing for crypto is encrypting
> in place, which was a ~2% win.  Nice, but not transformative.  See
> EncryptInPlace
> <https://cs.chromium.org/chromium/src/net/quic/core/quic_framer.cc?sq=package:chromium&l=1893>
>
>
>
> I haven't done much work benchmarking Windows, so possibly the Windows UDP
> stack is really fast, and so crypto seems really slow in comparison?
>
>
>
>
>    - What is the throughput and relative cost of TLS/TCP versus QUIC
>    (i.e. how much are the smaller units of encryption hurting us versus, say,
>    16KB TLS records)?
>
>
>    - TLS implementations already vary here:  Some implementations choose
>       large record sizes, some vary record sizes to reduce delay / HoLB, so this
>       probably isn’t a single number.
>
>
>    - How much additional load does PNE add to this difference?
>    - To what extent would PNE make future crypto offloads *impossible*
>    versus * requires more R&D to develop*?
>
>
>
> *From:* QUIC [mailto:quic-bounces@ietf.org] *On Behalf Of *Mikkel Fahnøe
> Jørgensen
> *Sent:* Wednesday, April 4, 2018 1:34 PM
> *To:* Praveen Balasubramanian <pravb=40microsoft.com@dmarc.ietf.org>;
> IETF QUIC WG <quic@ietf.org>
> *Subject:* Re: Impact of hardware offloads on network stack performance
>
>
>
> Thanks for sharing these numbers.
>
>
>
> My guess is that these offloads deal with kernel scheduling, memory cache
> issues, and interrupt scheduling.
>
>
>
> It probably has very little to do with crypto, TCP headers, and any other
> cpu sensitive processing.
>
>
>
> This is where netmap enters: you can transparently feed the data as it
> becomes available with very little sync work and you can also efficiently
> pipeline so you pass data to decryptor, then to app, with as little bus
> traffic as possible. No need to copy data or synchronize kernel space.
>
>
>
> For 1K packets you would use about 1000ns (3 cycles/byte) on crypto and
> this would happen in L1 cache. It would of course consume a core which a
> crypto offload would not, but that can be debated because with 18 core
> systems, your problem is memory and network more than CPU.
>
>
>
> My concern with PNE is for small packets and low latency where
>
> 1) an estimated 24ns for PN encryption and decryption becomes measurable
> if your network is fast enough.
>
> 2) any damage to the packet buffer cause all sorts of memory and cpu bus
> traffic issues.
>
>
>
> 1) is annoying. 2) is bad, completely avoidable but not as PR 1079 is
> formulated currently.
>
>
>
> 2) is also likely bad for hardware offload units as well.
>
>
>
> As to SRV-IO - I’m not familiar with it, but obviously there is some IO
> abstraction layer - the question is how you make it accessible to apps as
> opposed to device drivers that does nor work with you QUIC custom stack,
> and netmap is one option here.
>
>
>
> Kind Regards,
>
> Mikkel Fahnøe Jørgensen
>
>
>
> On 4 April 2018 at 22.15.52, Praveen Balasubramanian (
> pravb=40microsoft.com@dmarc.ietf.org) wrote:
>
> Some comparative numbers from an out of box default settings Windows
> Server 2016 (released version) for a single connection with a
> microbenchmark tool:
>
>
>
> *Offloads enabled*
>
> *TCP gbps*
>
> *UDP gbps*
>
> *LSO + LRO + checksum*
>
> 24
>
> 3.6
>
> *Checksum only*
>
> 7.6
>
> 3.6
>
> *None*
>
> 5.6
>
> 2.3
>
>
>
> This is for a fully bottlenecked CPU core -- *if you run lower data rates
> there is still a significant difference in Cycles/byte cost*. Same
> increased CPU cost applies for client systems going over high data rate
> Wifi and cellular.
>
>
>
> This is without any crypto. Once you add crypto the numbers become much
> worse *with crypto cost becoming dominant*. Adding another crypto step
> further exacerbates the problem. Hence crypto offload gains in importance
> followed by these batch offloads.
>
>
>
> If folks need any more numbers I’d be happy to provide them.
>
>
>
> Thanks
>
>
>
>

Impact of hardware offloads on network stack perf… Praveen Balasubramanian
Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
RE: Impact of hardware offloads on network stack … Mike Bishop
Re: Impact of hardware offloads on network stack … Ian Swett
Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
RE: Impact of hardware offloads on network stack … Mike Bishop
Re: Impact of hardware offloads on network stack … Ian Swett
Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
RE: Impact of hardware offloads on network stack … Praveen Balasubramanian
RE: Impact of hardware offloads on network stack … Deval, Manasi
RE: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
RE: Impact of hardware offloads on network stack … Deval, Manasi
RE: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
Re: Impact of hardware offloads on network stack … Ian Swett
RE: Impact of hardware offloads on network stack … Lubashev, Igor
RE: Impact of hardware offloads on network stack … Mike Bishop
Re: Impact of hardware offloads on network stack … Watson Ladd
Re: Impact of hardware offloads on network stack … Jana Iyengar
RE: Impact of hardware offloads on network stack … Mike Bishop
Re: Impact of hardware offloads on network stack … Roberto Peon
Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
Re: Impact of hardware offloads on network stack … craigt
Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
Re: Impact of hardware offloads on network stack … Eggert, Lars
RE: Impact of hardware offloads on network stack … Deval, Manasi
Re: Impact of hardware offloads on network stack … Boris Pismenny
Re: Impact of hardware offloads on network stack … Eggert, Lars
Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
Re: Impact of hardware offloads on network stack … craigt