Re: Impact of hardware offloads on network stack performance

Ian Swett <ianswett@google.com> Fri, 06 April 2018 20:55 UTC

Return-Path: <ianswett@google.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 74E00120721 for <quic@ietfa.amsl.com>; Fri, 6 Apr 2018 13:55:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.01
X-Spam-Level:
X-Spam-Status: No, score=-2.01 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id r-NRMldYFLYI for <quic@ietfa.amsl.com>; Fri, 6 Apr 2018 13:55:34 -0700 (PDT)
Received: from mail-it0-x22b.google.com (mail-it0-x22b.google.com [IPv6:2607:f8b0:4001:c0b::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1FCCE1201F8 for <quic@ietf.org>; Fri, 6 Apr 2018 13:55:34 -0700 (PDT)
Received: by mail-it0-x22b.google.com with SMTP id v194-v6so3318702itb.0 for <quic@ietf.org>; Fri, 06 Apr 2018 13:55:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=9nddmN8IPCl1KxN58mGCcGwECG6+9G1+O3SgTBmU6s8=; b=TMHU5zoRnQ1hkvj/86/Tr7xNqoRJ+k5Yhmlv4d9aiadpMnSs320ZyWoHOUgp1kyjF1 sxra21NMYy3EImUFQxkR7OZ0ccTvlrQz5CLFVZHAz6VS54vnIIEQKgjEOgUNOP8tmw4k 38nrJg8yL4GfIiqaa4A8eRW+0hs2Lk1Li5mc2s0bNHTN22iW6fmGn26D61bb4+9k/9BK Owfux0T9XHVhQi6M2AVqf70SVqi9n3kgRaHwsLJkqOd3X/xIn3feMvCDQnWY7Pe8mYW2 y5d7vuKRhUTPuiqw9a2oKnqhypVN98Nt8CtYlcCi7oM7cmVnle8nhseq7p1v8cwTsuRM CkWA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=9nddmN8IPCl1KxN58mGCcGwECG6+9G1+O3SgTBmU6s8=; b=md0RmVjzUgubGQzL9tvgnWILkygE9PyByOlqY5htZGiT6sVpWVSf9dYVKpPOmr+RUb QohM7JyaoUX02q1htY4Avr2DPynQ9qwvWtZ7GChWakvpVBnnQZz8BBDqS3lfptxWgIBe Y1pj4hFPuSfknj48SNpW74YDjFfh7FvCI4OJTcaaNpCv/ESeAOB011vRGYbC2tSKeOAx CYBBG6dY/cqE1d9WGC63iJ3x99QhKNZNh+0vEj4yuyRberh/cMjhX6fVfygNn4nIhH+j v06V5lYIsT3/SMz+R7As6z7JtSit8veMqPtQbbR+W4qnBEAtXlvZ9kTwjlIhiEbsNp3U wfAQ==
X-Gm-Message-State: ALQs6tB5gax+2rdOzEQwJepsWxHZCMgy77miBZWBMJ6vA6mFuO1hquOu rPGwOA8nEXR1FGF20x+qQ64CkGWLiRtR7btTAGkH2w==
X-Google-Smtp-Source: AIpwx4+qytsBlGPlmTDrQMrDUDxEyNEwpyysNgBhE4ABqDX6kIIyYzn5NtTrl3Xf2dhoamERCZ6pQiPk6vaeGy+ffPM=
X-Received: by 2002:a24:5a49:: with SMTP id v70-v6mr6896397ita.54.1523048133021; Fri, 06 Apr 2018 13:55:33 -0700 (PDT)
MIME-Version: 1.0
References: <CY4PR21MB0630CE4DE6BA4EDF54A1FEB1B6A40@CY4PR21MB0630.namprd21.prod.outlook.com> <CAN1APde93+S8CP-KcZmqPKCCvsGRiq6ECPUoh_Qk0j9hqs8h0Q@mail.gmail.com> <SN1PR08MB1854E64D7C370BF0A7456977DABB0@SN1PR08MB1854.namprd08.prod.outlook.com> <CAKcm_gNhsHxXM77FjRj-Wh4JxA21NAWKXX3KBT=eZJsCdacM7Q@mail.gmail.com> <DB6PR10MB1766789B25E31EBE70564BD3ACBA0@DB6PR10MB1766.EURPRD10.PROD.OUTLOOK.COM> <SN1PR08MB18548155722C665FFD45CC4DDABA0@SN1PR08MB1854.namprd08.prod.outlook.com>
In-Reply-To: <SN1PR08MB18548155722C665FFD45CC4DDABA0@SN1PR08MB1854.namprd08.prod.outlook.com>
From: Ian Swett <ianswett@google.com>
Date: Fri, 06 Apr 2018 20:55:21 +0000
Message-ID: <CAKcm_gMYf6AbHgvg+WKv-woQCtzz9WztiEDf-iRUjKw63Qx=9Q@mail.gmail.com>
Subject: Re: Impact of hardware offloads on network stack performance
To: Mike Bishop <mbishop@evequefou.be>
Cc: Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>, Praveen Balasubramanian <pravb@microsoft.com>, IETF QUIC WG <quic@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000f27cbd056934468b"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/I1RcDS_hr8IrfXJEbx8AXlypMZY>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Apr 2018 20:55:37 -0000

PNE doesn't preclude UDP LRO or pacing offload.

However, in a highly optimized stack(ie: The Disk|Crypt|Net paper from
SIGCOMM 2017) not touching(ie: bringing into memory or cache) the content
to be sent until the last possible moment is critical.  PNE likely means
touching that content much earlier.

But personally I consider the actual crypto we're talking about(2 AES
instructions?) adding basically free.

On Fri, Apr 6, 2018 at 4:37 PM Mike Bishop <mbishop@evequefou.be> wrote:

> Thanks for the clarification.  I guess what I’m really getting down to is
> how do we quantify the actual *cost* of PNE?  Are we substantially
> increasing the crypto costs?  If Ian is saying that crypto is comparatively
> cheap and the cost is that it’s harder to offload something that’s
> comparatively cheap, what have we lost?  I’d think we want to offload the
> most intensive piece we can, and it seems like we’re talking about crypto
> offloads....  Or are we saying instead that the crypto makes it harder to
> offload other things in the future, like a QUIC equivalent to LRO/LSO?
>
>
>
> *From:* Mikkel Fahnøe Jørgensen [mailto:mikkelfj@gmail.com]
> *Sent:* Thursday, April 5, 2018 9:45 PM
> *To:* Ian Swett <ianswett@google.com>; Mike Bishop <mbishop@evequefou.be>
> *Cc:* Praveen Balasubramanian <pravb@microsoft.com>; IETF QUIC WG <
> quic@ietf.org>
> *Subject:* Re: Impact of hardware offloads on network stack performance
>
>
>
> To be clear - I don’t think crypto is overshadowing other issues as Mike
> read my post. It certainly comes at a cost, but either multiple cores or
> co-processors will deal with this. 1000ns is 10Gbps crypto speed on one
> core and it is higly parallelisable and cache friendly.
>
>
>
> But if you have to drip your packets through a traditional send interface
> that is copying, buffering or blocking, and certainly sync’ing with the
> kernel - it is going to be tough.
>
>
>
> For receive you risk high latency or top much scheduling in receive
> buffers.
>
>
> ------------------------------
>
> *From:* Ian Swett <ianswett@google.com>
> *Sent:* Friday, April 6, 2018 4:06:01 AM
> *To:* Mike Bishop
> *Cc:* Mikkel Fahnøe Jørgensen; Praveen Balasubramanian; IETF QUIC WG
> *Subject:* Re: Impact of hardware offloads on network stack performance
>
>
>
>
>
> On Thu, Apr 5, 2018 at 5:57 PM Mike Bishop <mbishop@evequefou.be> wrote:
>
> That’s interesting data, Praveen – thanks.  There is one caveat there,
> which is that (IIUC, on current hardware) you can’t do packet pacing with
> LSO, and so my understanding is that even for TCP many providers disable
> this in pursuit of better egress management.  So what we’re really saying
> is that:
>
>    - On current hardware and OS setups, UDP runs less than half as much
>    throughput as TCP; that needs some optimization, as we’ve already discussed
>    - TCP will dramatically gain performance once hardware offload catches
>    up and allows *paced* LSO 😊
>
>
>
> However, as Mikkel points out, the crypto costs are likely to overwhelm
> the OS throughput / scheduling issues in real deployments.  So I think the
> other relevant piece to understanding the cost here is this:
>
>
>
> I have a few cases(both client and server) where my UDP send costs are
> more than 30%(in some cases 50%) of CPU consumption, and crypto is less
> than 10%.  So currently, I assume crypto is cheap(especially AES-GCM when
> hardware acceleration is available) and egress is expensive.  UDP ingres is
> not that expensive, but could use a bit of optimization as well.
>
>
>
> The only 'fancy' thing our current code is doing for crypto is encrypting
> in place, which was a ~2% win.  Nice, but not transformative.  See
> EncryptInPlace
> <https://cs.chromium.org/chromium/src/net/quic/core/quic_framer.cc?sq=package:chromium&l=1893>
>
>
>
> I haven't done much work benchmarking Windows, so possibly the Windows UDP
> stack is really fast, and so crypto seems really slow in comparison?
>
>
>
>
>    - What is the throughput and relative cost of TLS/TCP versus QUIC
>    (i.e. how much are the smaller units of encryption hurting us versus, say,
>    16KB TLS records)?
>
>
>    - TLS implementations already vary here:  Some implementations choose
>       large record sizes, some vary record sizes to reduce delay / HoLB, so this
>       probably isn’t a single number.
>
>
>    - How much additional load does PNE add to this difference?
>    - To what extent would PNE make future crypto offloads *impossible*
>    versus * requires more R&D to develop*?
>
>
>
> *From:* QUIC [mailto:quic-bounces@ietf.org] *On Behalf Of *Mikkel Fahnøe
> Jørgensen
> *Sent:* Wednesday, April 4, 2018 1:34 PM
> *To:* Praveen Balasubramanian <pravb=40microsoft.com@dmarc.ietf.org>;
> IETF QUIC WG <quic@ietf.org>
> *Subject:* Re: Impact of hardware offloads on network stack performance
>
>
>
> Thanks for sharing these numbers.
>
>
>
> My guess is that these offloads deal with kernel scheduling, memory cache
> issues, and interrupt scheduling.
>
>
>
> It probably has very little to do with crypto, TCP headers, and any other
> cpu sensitive processing.
>
>
>
> This is where netmap enters: you can transparently feed the data as it
> becomes available with very little sync work and you can also efficiently
> pipeline so you pass data to decryptor, then to app, with as little bus
> traffic as possible. No need to copy data or synchronize kernel space.
>
>
>
> For 1K packets you would use about 1000ns (3 cycles/byte) on crypto and
> this would happen in L1 cache. It would of course consume a core which a
> crypto offload would not, but that can be debated because with 18 core
> systems, your problem is memory and network more than CPU.
>
>
>
> My concern with PNE is for small packets and low latency where
>
> 1) an estimated 24ns for PN encryption and decryption becomes measurable
> if your network is fast enough.
>
> 2) any damage to the packet buffer cause all sorts of memory and cpu bus
> traffic issues.
>
>
>
> 1) is annoying. 2) is bad, completely avoidable but not as PR 1079 is
> formulated currently.
>
>
>
> 2) is also likely bad for hardware offload units as well.
>
>
>
> As to SRV-IO - I’m not familiar with it, but obviously there is some IO
> abstraction layer - the question is how you make it accessible to apps as
> opposed to device drivers that does nor work with you QUIC custom stack,
> and netmap is one option here.
>
>
>
> Kind Regards,
>
> Mikkel Fahnøe Jørgensen
>
>
>
> On 4 April 2018 at 22.15.52, Praveen Balasubramanian (
> pravb=40microsoft.com@dmarc.ietf.org) wrote:
>
> Some comparative numbers from an out of box default settings Windows
> Server 2016 (released version) for a single connection with a
> microbenchmark tool:
>
>
>
> *Offloads enabled*
>
> *TCP gbps*
>
> *UDP gbps*
>
> *LSO + LRO + checksum*
>
> 24
>
> 3.6
>
> *Checksum only*
>
> 7.6
>
> 3.6
>
> *None*
>
> 5.6
>
> 2.3
>
>
>
> This is for a fully bottlenecked CPU core -- *if you run lower data rates
> there is still a significant difference in Cycles/byte cost*. Same
> increased CPU cost applies for client systems going over high data rate
> Wifi and cellular.
>
>
>
> This is without any crypto. Once you add crypto the numbers become much
> worse *with crypto cost becoming dominant*. Adding another crypto step
> further exacerbates the problem. Hence crypto offload gains in importance
> followed by these batch offloads.
>
>
>
> If folks need any more numbers I’d be happy to provide them.
>
>
>
> Thanks
>
>
>
>