Re: Impact of hardware offloads on network stack performance
Ian Swett <ianswett@google.com> Fri, 06 April 2018 20:55 UTC
Return-Path: <ianswett@google.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 74E00120721 for <quic@ietfa.amsl.com>; Fri, 6 Apr 2018 13:55:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.01
X-Spam-Level:
X-Spam-Status: No, score=-2.01 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id r-NRMldYFLYI for <quic@ietfa.amsl.com>; Fri, 6 Apr 2018 13:55:34 -0700 (PDT)
Received: from mail-it0-x22b.google.com (mail-it0-x22b.google.com [IPv6:2607:f8b0:4001:c0b::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1FCCE1201F8 for <quic@ietf.org>; Fri, 6 Apr 2018 13:55:34 -0700 (PDT)
Received: by mail-it0-x22b.google.com with SMTP id v194-v6so3318702itb.0 for <quic@ietf.org>; Fri, 06 Apr 2018 13:55:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=9nddmN8IPCl1KxN58mGCcGwECG6+9G1+O3SgTBmU6s8=; b=TMHU5zoRnQ1hkvj/86/Tr7xNqoRJ+k5Yhmlv4d9aiadpMnSs320ZyWoHOUgp1kyjF1 sxra21NMYy3EImUFQxkR7OZ0ccTvlrQz5CLFVZHAz6VS54vnIIEQKgjEOgUNOP8tmw4k 38nrJg8yL4GfIiqaa4A8eRW+0hs2Lk1Li5mc2s0bNHTN22iW6fmGn26D61bb4+9k/9BK Owfux0T9XHVhQi6M2AVqf70SVqi9n3kgRaHwsLJkqOd3X/xIn3feMvCDQnWY7Pe8mYW2 y5d7vuKRhUTPuiqw9a2oKnqhypVN98Nt8CtYlcCi7oM7cmVnle8nhseq7p1v8cwTsuRM CkWA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=9nddmN8IPCl1KxN58mGCcGwECG6+9G1+O3SgTBmU6s8=; b=md0RmVjzUgubGQzL9tvgnWILkygE9PyByOlqY5htZGiT6sVpWVSf9dYVKpPOmr+RUb QohM7JyaoUX02q1htY4Avr2DPynQ9qwvWtZ7GChWakvpVBnnQZz8BBDqS3lfptxWgIBe Y1pj4hFPuSfknj48SNpW74YDjFfh7FvCI4OJTcaaNpCv/ESeAOB011vRGYbC2tSKeOAx CYBBG6dY/cqE1d9WGC63iJ3x99QhKNZNh+0vEj4yuyRberh/cMjhX6fVfygNn4nIhH+j v06V5lYIsT3/SMz+R7As6z7JtSit8veMqPtQbbR+W4qnBEAtXlvZ9kTwjlIhiEbsNp3U wfAQ==
X-Gm-Message-State: ALQs6tB5gax+2rdOzEQwJepsWxHZCMgy77miBZWBMJ6vA6mFuO1hquOu rPGwOA8nEXR1FGF20x+qQ64CkGWLiRtR7btTAGkH2w==
X-Google-Smtp-Source: AIpwx4+qytsBlGPlmTDrQMrDUDxEyNEwpyysNgBhE4ABqDX6kIIyYzn5NtTrl3Xf2dhoamERCZ6pQiPk6vaeGy+ffPM=
X-Received: by 2002:a24:5a49:: with SMTP id v70-v6mr6896397ita.54.1523048133021; Fri, 06 Apr 2018 13:55:33 -0700 (PDT)
MIME-Version: 1.0
References: <CY4PR21MB0630CE4DE6BA4EDF54A1FEB1B6A40@CY4PR21MB0630.namprd21.prod.outlook.com> <CAN1APde93+S8CP-KcZmqPKCCvsGRiq6ECPUoh_Qk0j9hqs8h0Q@mail.gmail.com> <SN1PR08MB1854E64D7C370BF0A7456977DABB0@SN1PR08MB1854.namprd08.prod.outlook.com> <CAKcm_gNhsHxXM77FjRj-Wh4JxA21NAWKXX3KBT=eZJsCdacM7Q@mail.gmail.com> <DB6PR10MB1766789B25E31EBE70564BD3ACBA0@DB6PR10MB1766.EURPRD10.PROD.OUTLOOK.COM> <SN1PR08MB18548155722C665FFD45CC4DDABA0@SN1PR08MB1854.namprd08.prod.outlook.com>
In-Reply-To: <SN1PR08MB18548155722C665FFD45CC4DDABA0@SN1PR08MB1854.namprd08.prod.outlook.com>
From: Ian Swett <ianswett@google.com>
Date: Fri, 06 Apr 2018 20:55:21 +0000
Message-ID: <CAKcm_gMYf6AbHgvg+WKv-woQCtzz9WztiEDf-iRUjKw63Qx=9Q@mail.gmail.com>
Subject: Re: Impact of hardware offloads on network stack performance
To: Mike Bishop <mbishop@evequefou.be>
Cc: Mikkel Fahnøe Jørgensen <mikkelfj@gmail.com>, Praveen Balasubramanian <pravb@microsoft.com>, IETF QUIC WG <quic@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000f27cbd056934468b"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/I1RcDS_hr8IrfXJEbx8AXlypMZY>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Apr 2018 20:55:37 -0000
PNE doesn't preclude UDP LRO or pacing offload. However, in a highly optimized stack(ie: The Disk|Crypt|Net paper from SIGCOMM 2017) not touching(ie: bringing into memory or cache) the content to be sent until the last possible moment is critical. PNE likely means touching that content much earlier. But personally I consider the actual crypto we're talking about(2 AES instructions?) adding basically free. On Fri, Apr 6, 2018 at 4:37 PM Mike Bishop <mbishop@evequefou.be> wrote: > Thanks for the clarification. I guess what I’m really getting down to is > how do we quantify the actual *cost* of PNE? Are we substantially > increasing the crypto costs? If Ian is saying that crypto is comparatively > cheap and the cost is that it’s harder to offload something that’s > comparatively cheap, what have we lost? I’d think we want to offload the > most intensive piece we can, and it seems like we’re talking about crypto > offloads.... Or are we saying instead that the crypto makes it harder to > offload other things in the future, like a QUIC equivalent to LRO/LSO? > > > > *From:* Mikkel Fahnøe Jørgensen [mailto:mikkelfj@gmail.com] > *Sent:* Thursday, April 5, 2018 9:45 PM > *To:* Ian Swett <ianswett@google.com>; Mike Bishop <mbishop@evequefou.be> > *Cc:* Praveen Balasubramanian <pravb@microsoft.com>; IETF QUIC WG < > quic@ietf.org> > *Subject:* Re: Impact of hardware offloads on network stack performance > > > > To be clear - I don’t think crypto is overshadowing other issues as Mike > read my post. It certainly comes at a cost, but either multiple cores or > co-processors will deal with this. 1000ns is 10Gbps crypto speed on one > core and it is higly parallelisable and cache friendly. > > > > But if you have to drip your packets through a traditional send interface > that is copying, buffering or blocking, and certainly sync’ing with the > kernel - it is going to be tough. > > > > For receive you risk high latency or top much scheduling in receive > buffers. > > > ------------------------------ > > *From:* Ian Swett <ianswett@google.com> > *Sent:* Friday, April 6, 2018 4:06:01 AM > *To:* Mike Bishop > *Cc:* Mikkel Fahnøe Jørgensen; Praveen Balasubramanian; IETF QUIC WG > *Subject:* Re: Impact of hardware offloads on network stack performance > > > > > > On Thu, Apr 5, 2018 at 5:57 PM Mike Bishop <mbishop@evequefou.be> wrote: > > That’s interesting data, Praveen – thanks. There is one caveat there, > which is that (IIUC, on current hardware) you can’t do packet pacing with > LSO, and so my understanding is that even for TCP many providers disable > this in pursuit of better egress management. So what we’re really saying > is that: > > - On current hardware and OS setups, UDP runs less than half as much > throughput as TCP; that needs some optimization, as we’ve already discussed > - TCP will dramatically gain performance once hardware offload catches > up and allows *paced* LSO 😊 > > > > However, as Mikkel points out, the crypto costs are likely to overwhelm > the OS throughput / scheduling issues in real deployments. So I think the > other relevant piece to understanding the cost here is this: > > > > I have a few cases(both client and server) where my UDP send costs are > more than 30%(in some cases 50%) of CPU consumption, and crypto is less > than 10%. So currently, I assume crypto is cheap(especially AES-GCM when > hardware acceleration is available) and egress is expensive. UDP ingres is > not that expensive, but could use a bit of optimization as well. > > > > The only 'fancy' thing our current code is doing for crypto is encrypting > in place, which was a ~2% win. Nice, but not transformative. See > EncryptInPlace > <https://cs.chromium.org/chromium/src/net/quic/core/quic_framer.cc?sq=package:chromium&l=1893> > > > > I haven't done much work benchmarking Windows, so possibly the Windows UDP > stack is really fast, and so crypto seems really slow in comparison? > > > > > - What is the throughput and relative cost of TLS/TCP versus QUIC > (i.e. how much are the smaller units of encryption hurting us versus, say, > 16KB TLS records)? > > > - TLS implementations already vary here: Some implementations choose > large record sizes, some vary record sizes to reduce delay / HoLB, so this > probably isn’t a single number. > > > - How much additional load does PNE add to this difference? > - To what extent would PNE make future crypto offloads *impossible* > versus * requires more R&D to develop*? > > > > *From:* QUIC [mailto:quic-bounces@ietf.org] *On Behalf Of *Mikkel Fahnøe > Jørgensen > *Sent:* Wednesday, April 4, 2018 1:34 PM > *To:* Praveen Balasubramanian <pravb=40microsoft.com@dmarc.ietf.org>; > IETF QUIC WG <quic@ietf.org> > *Subject:* Re: Impact of hardware offloads on network stack performance > > > > Thanks for sharing these numbers. > > > > My guess is that these offloads deal with kernel scheduling, memory cache > issues, and interrupt scheduling. > > > > It probably has very little to do with crypto, TCP headers, and any other > cpu sensitive processing. > > > > This is where netmap enters: you can transparently feed the data as it > becomes available with very little sync work and you can also efficiently > pipeline so you pass data to decryptor, then to app, with as little bus > traffic as possible. No need to copy data or synchronize kernel space. > > > > For 1K packets you would use about 1000ns (3 cycles/byte) on crypto and > this would happen in L1 cache. It would of course consume a core which a > crypto offload would not, but that can be debated because with 18 core > systems, your problem is memory and network more than CPU. > > > > My concern with PNE is for small packets and low latency where > > 1) an estimated 24ns for PN encryption and decryption becomes measurable > if your network is fast enough. > > 2) any damage to the packet buffer cause all sorts of memory and cpu bus > traffic issues. > > > > 1) is annoying. 2) is bad, completely avoidable but not as PR 1079 is > formulated currently. > > > > 2) is also likely bad for hardware offload units as well. > > > > As to SRV-IO - I’m not familiar with it, but obviously there is some IO > abstraction layer - the question is how you make it accessible to apps as > opposed to device drivers that does nor work with you QUIC custom stack, > and netmap is one option here. > > > > Kind Regards, > > Mikkel Fahnøe Jørgensen > > > > On 4 April 2018 at 22.15.52, Praveen Balasubramanian ( > pravb=40microsoft.com@dmarc.ietf.org) wrote: > > Some comparative numbers from an out of box default settings Windows > Server 2016 (released version) for a single connection with a > microbenchmark tool: > > > > *Offloads enabled* > > *TCP gbps* > > *UDP gbps* > > *LSO + LRO + checksum* > > 24 > > 3.6 > > *Checksum only* > > 7.6 > > 3.6 > > *None* > > 5.6 > > 2.3 > > > > This is for a fully bottlenecked CPU core -- *if you run lower data rates > there is still a significant difference in Cycles/byte cost*. Same > increased CPU cost applies for client systems going over high data rate > Wifi and cellular. > > > > This is without any crypto. Once you add crypto the numbers become much > worse *with crypto cost becoming dominant*. Adding another crypto step > further exacerbates the problem. Hence crypto offload gains in importance > followed by these batch offloads. > > > > If folks need any more numbers I’d be happy to provide them. > > > > Thanks > > > >
- Impact of hardware offloads on network stack perf… Praveen Balasubramanian
- Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
- RE: Impact of hardware offloads on network stack … Mike Bishop
- Re: Impact of hardware offloads on network stack … Ian Swett
- Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
- RE: Impact of hardware offloads on network stack … Mike Bishop
- Re: Impact of hardware offloads on network stack … Ian Swett
- Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
- RE: Impact of hardware offloads on network stack … Praveen Balasubramanian
- RE: Impact of hardware offloads on network stack … Deval, Manasi
- RE: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
- RE: Impact of hardware offloads on network stack … Deval, Manasi
- RE: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
- Re: Impact of hardware offloads on network stack … Ian Swett
- RE: Impact of hardware offloads on network stack … Lubashev, Igor
- RE: Impact of hardware offloads on network stack … Mike Bishop
- Re: Impact of hardware offloads on network stack … Watson Ladd
- Re: Impact of hardware offloads on network stack … Jana Iyengar
- RE: Impact of hardware offloads on network stack … Mike Bishop
- Re: Impact of hardware offloads on network stack … Roberto Peon
- Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
- Re: Impact of hardware offloads on network stack … craigt
- Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
- Re: Impact of hardware offloads on network stack … Eggert, Lars
- RE: Impact of hardware offloads on network stack … Deval, Manasi
- Re: Impact of hardware offloads on network stack … Boris Pismenny
- Re: Impact of hardware offloads on network stack … Eggert, Lars
- Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
- Re: Impact of hardware offloads on network stack … Mikkel Fahnøe Jørgensen
- Re: Impact of hardware offloads on network stack … craigt