Re: Packet Number Encryption Performance

Kazuho Oku <kazuhooku@gmail.com> Sat, 23 June 2018 07:00 UTC

Return-Path: <kazuhooku@gmail.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BA487130E9A for <quic@ietfa.amsl.com>; Sat, 23 Jun 2018 00:00:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.009
X-Spam-Level:
X-Spam-Status: No, score=-0.009 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, HTTPS_HTTP_MISMATCH=1.989, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3fcrRGG3aibs for <quic@ietfa.amsl.com>; Sat, 23 Jun 2018 00:00:11 -0700 (PDT)
Received: from mail-pf0-x234.google.com (mail-pf0-x234.google.com [IPv6:2607:f8b0:400e:c00::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0AA1B130DDA for <quic@ietf.org>; Sat, 23 Jun 2018 00:00:10 -0700 (PDT)
Received: by mail-pf0-x234.google.com with SMTP id w7-v6so4190444pfn.9 for <quic@ietf.org>; Sat, 23 Jun 2018 00:00:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=Q9NWYdfzjiiOHRyl7WoTBwZe4AGdATJY1S2wasUItpU=; b=hzuEuCSb9CTCEX5s00kFle23Uw/5q28xPBfROCZKpt7+6lyXBtIt6WkmblLXCbDVKZ u3BmUoIWmKXb299YRd95DqNMpyDE+f6WrtsR5h0DXvUezFZJm/Dx+l/oEECckX9xP4Xh MO1hXyZSXD8nWJST7TbDtQK4PFkm6c4OAEX9PKBhdhlB5K2fWB1gS3SxNLnwAOTrQxwc RzMPJtka3FCmhT59+4FU5OdQMeBuT4GETYnQzMhkI6zYWTWtwUUmGNnjmn0Wr5k+Uije DObAUr+6oDtp8rTYrbD669Iki1WXwAt/NT3Hg9gEFr8GP0Ar/Ls/s21TG0LkNWMleyOS lIdA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=Q9NWYdfzjiiOHRyl7WoTBwZe4AGdATJY1S2wasUItpU=; b=HF83MOGhisGnK74A9TPT4h+E7iKWuU2xeEZRfovATopf2rsvh67Ej3HEZ6y0qCvFCY JDKC+qQL/NexVo3qRhjA67hBj+77AjT84KYIBuBDP2RLnngSboput4Uxa/UCN/wa+Gc4 TrVgK13NpGUXgK90LlXuIcSK3Rdz0kF5NFHEbL/oeNEjyYtYMxYZ8HDkSrqtSACtIpuk jDFl0+tfGEJHPUOII2xWiZlW2w2282yuGqUOmlslYMmS2kDbpDykHlQhgo7Hj7h4Ceeg 3fttLfXSeB0vY64dKqcsw8S83LFmYKVbLV2AaaLbhNIi2+KgeARobzl7qy/lkCxt/QrS RPnQ==
X-Gm-Message-State: APt69E0etDOt/PW/iQlbJR7k+t2N0vM4eOUJUYMQTquuiykEIoVgYGAz c3k8D+CdTrpNGaz7hGJufwVDQvlq3I2moAyMT6uKUg==
X-Google-Smtp-Source: ADUXVKKZGIQpqECpa9FvfqEvqnfzLTW0LSyHojHsecQazRWXxmohqaAKXLpDV1YPyHH6GXDzdjlZeao6XjHHGw3aE+M=
X-Received: by 2002:a63:6e08:: with SMTP id j8-v6mr3910813pgc.428.1529737210296; Sat, 23 Jun 2018 00:00:10 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a17:90a:1181:0:0:0:0 with HTTP; Sat, 23 Jun 2018 00:00:09 -0700 (PDT)
In-Reply-To: <DM5PR2101MB09014CF646E0DA727AD79486B3740@DM5PR2101MB0901.namprd21.prod.outlook.com>
References: <DM5PR2101MB0901FCB1094A124818A0B1FEB3760@DM5PR2101MB0901.namprd21.prod.outlook.com> <CANatvzxVBq1-UKiuixWGFfFyWMh8SYpp=y2LqYwiF=tHT6oOOQ@mail.gmail.com> <DM5PR2101MB0901C834F1FDFEC6D0D50781B3750@DM5PR2101MB0901.namprd21.prod.outlook.com> <CANatvzz0u=oy1j2_6=bn6bcuwzQv_6fVqe3WkBtjwaAZ8Bfh=w@mail.gmail.com> <CANatvzysRVQXsB0ZCReY3n_R_kZT-jhmYwR-7-2KYt5+GZCk0A@mail.gmail.com> <CAKcm_gPxYu9jNFmYR0_vQfawuC+T_E9UJbcDPOycrUAMuVJabg@mail.gmail.com> <CY4PR21MB06303A8C17796335F3A3FDE2B6750@CY4PR21MB0630.namprd21.prod.outlook.com> <DM5PR2101MB0901939C8975A87AA74219B9B3750@DM5PR2101MB0901.namprd21.prod.outlook.com> <CAKcm_gMc6y_2+KU3L+XpifNK4JESFA0V=OX4Nj51jTFfAm9M1A@mail.gmail.com> <CANatvzzpZTh516VB40gHiMZq5oOjNRMPpNouCRd-WosEcKaisA@mail.gmail.com> <DM5PR2101MB09014CF646E0DA727AD79486B3740@DM5PR2101MB0901.namprd21.prod.outlook.com>
From: Kazuho Oku <kazuhooku@gmail.com>
Date: Sat, 23 Jun 2018 16:00:09 +0900
Message-ID: <CANatvzzS6CZ3a0_xQpsPF4RFpZ7+cfQYB7dsT0oAjpHxXv+9+A@mail.gmail.com>
Subject: Re: Packet Number Encryption Performance
To: Nick Banks <nibanks@microsoft.com>
Cc: Ian Swett <ianswett@google.com>, IETF QUIC WG <quic@ietf.org>, Praveen Balasubramanian <pravb@microsoft.com>
Content-Type: multipart/alternative; boundary="00000000000004fe9a056f49b379"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/S0kM0N-rpXBA2pnHo5X103KgrE0>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Jun 2018 07:00:16 -0000

IIUC, Ian should be talking about performance numbers that you would see on
a server that is handling real workload. Not the breakdown of numbers
within the QUIC stack.

Let's talk about nanoseconds per packet rather than the ratio, because
ratio depends on what you use as a "total" is.

My benchmarks tell me that, on OpenSSL 1.1.0 running on Core i7 @ 2.5GHz
(MacBook Pro 15" Mid 2015), the numbers are:

without-PNE: 585ns / packet
with-PNE: 607ns / packet (* this is the optimized version with 3.8%
overhead)

Assuming that your peak rate is 1Gbit/CPU core (would be enough to saturate
25Gb Ethernet), the ratio of CPU cycles for running the crypto will be:

all crypto: 0.125GB/sec / 1280bytes/packet * 607ns/packet = 5.93 %
PNE: 0.125GB/sec / 1280bytes/packet * 22.4ns/packet = 0.22 %

Of course, these numbers are what is expected when a server is sending or
receiving full-sized packets in one direction. In reality, there would be
at least some flow in the opposite direction, so the number will be higher,
but not as much as 2x. Or if you are sending small-sized packet in *both*
directions *and* saturating the link, the ratio will be higher.

But looking at the numbers, I'd assume that 10% is a logical number for
workload Ian has, and also for people who are involved in content
distribution.

As shown, I think that sharing the PPS and average packet size that our
workload will have, along with the raw numbers of the crypto (i.e. nsec /
packet) will give us a better understanding on how much the actual overhead
is. Do you mind sharing your expectations?

2018-06-23 9:34 GMT+09:00 Nick Banks <nibanks@microsoft.com>:

> Hey Guys,
>
>
>
> I spent the better part of the day getting some performance traces of
> WinQuic *without PNE*. Our implementation supports both user mode and
> kernel mode, so I got numbers for both. The following table shows the
> relative CPU cost of different parts of the QUIC send path:
>
>
>
>
>
> *User Mode*
>
> *Kernel Mode*
>
> *UDP Send*
>
> 64.7%
>
> 41.4%
>
> *Encryption*
>
> 22.5%
>
> 38.5%
>
> *Packet Building/Framing*
>
> 7%
>
> 15%
>
> *Miscellaneous*
>
> 5.8%
>
> 5.1%
>
>
>
> These numbers definitely show that encryption is a much larger portion of
> CPU.
>
>
>
> - Nick
>
>
>
>
>
> *From:* Kazuho Oku <kazuhooku@gmail.com>
> *Sent:* Friday, June 22, 2018 5:02 PM
> *To:* Ian Swett <ianswett@google.com>
> *Cc:* Nick Banks <nibanks@microsoft.com>; Praveen Balasubramanian <
> pravb@microsoft.com>; IETF QUIC WG <quic@ietf.org>
>
> *Subject:* Re: Packet Number Encryption Performance
>
>
>
>
>
>
>
> 2018-06-23 3:51 GMT+09:00 Ian Swett <ianswett@google.com>:
>
> I expect crypto to increase as a fraction of CPU, but I don't expect it to
> go much higher than 10%.
>
>
>
> But who knows, maybe 2 years from now everything else will be very
> optimized and crypto will be 15%?
>
>
>
> Ian, thank you for sharing the proportion of CPU cycles we are likely to
> spend for crypto.
>
>
>
> Your numbers relieves me, because even if the cost of crypto goes to 15%,
> the overhead of PNE will be less than 1% (0.15*0.04=0.006).
>
>
>
> I would also like to note that it is likely that HyperThreading, when
> used, will eliminate the overhead of PNE.
>
>
>
> This is because IIUC PNE is a marginal additional use of the AES-NI
> engine, which have been mostly idle. The overhead of crypto is small (i.e.
> 15%) that we will rarely see contention on the engine. While one
> hyperthread does AES, the other hyperthread will run at full speed doing
> other operations.
>
>
>
> Also considering the fact that the number of CPU cycles spent per QUIC
> packet does not change a lot with PNE, I would not be surprised to see *no*
> decrease of throughput when PNE is used on a HyperThreading architecture.
> In such case, what we will only observe is the raise of the utilization
> ratio of the AES-NI engine.
>
>
>
>
>
> On Fri, Jun 22, 2018 at 12:34 PM Nick Banks <nibanks=40microsoft.com@
> dmarc.ietf.org> wrote:
>
> I just want to add, my implementation already uses ECB from bcrypt (and I
> do the XOR) already. Bcrypt doesn’t expose CTR mode directly.
>
>
>
> Sent from Mail
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&data=02%7C01%7Cnibanks%40microsoft.com%7Ce6efa363bc2547167a8c08d5d89c80e7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636653089059268836&sdata=aeU7j5lhCIM18KG5c6opG4p7amwhCWbRnfnm8sDW7hc%3D&reserved=0>
> for Windows 10
>
>
> ------------------------------
>
> *From:* Praveen Balasubramanian
> *Sent:* Friday, June 22, 2018 9:26:44 AM
> *To:* Ian Swett; Kazuho Oku
> *Cc:* Nick Banks; IETF QUIC WG
> *Subject:* RE: Packet Number Encryption Performance
>
>
>
> Ian, do you expect that fraction of overall cost to hold once the UDP
> stack is optimized? Is your measurement on top of the recent kernel
> improvements? I expect crypto fraction of overall cost to keep increasing
> over time as the network stack bottlenecks are eliminated.
>
>
>
> Kazuho, should the draft describe the optimizations you are making? Or are
> these are too OpenSSL specific?
>
>
>
> *From:* QUIC [mailto:quic-bounces@ietf.org] *On Behalf Of *Ian Swett
> *Sent:* Friday, June 22, 2018 4:45 AM
> *To:* Kazuho Oku <kazuhooku@gmail.com>
> *Cc:* Nick Banks <nibanks@microsoft.com>; IETF QUIC WG <quic@ietf.org>
> *Subject:* Re: Packet Number Encryption Performance
>
>
>
> Thanks for digging into the details of this, Kazuho.  <4% increase in
> crypto cost is a bit more than I originally expected(~2%), but crypto is
> less than 10% of my CPU usage, so it's still less than 0.5% total, which is
> acceptable to me.
>
>
>
> On Fri, Jun 22, 2018 at 2:45 AM Kazuho Oku <kazuhooku@gmail.com> wrote:
>
>
>
>
>
> 2018-06-22 12:22 GMT+09:00 Kazuho Oku <kazuhooku@gmail.com>:
>
>
>
>
>
> 2018-06-22 11:54 GMT+09:00 Nick Banks <nibanks@microsoft.com>:
>
> Hi Kazuho,
>
>
>
> Thanks for sharing your numbers as well! I'm bit confused where you say
> you can reduce the 10% overhead to 2% to 4%. How do you plan on doing that?
>
>
>
> As stated in my previous mail, the 10% of overhead consists of three
> parts, each consuming comparable number of CPU cycles. The two among the
> three is related to the abstraction layer and how CTR is implemented, while
> the other one is the core AES-ECB operation cost.
>
>
>
> It should be able to remove the costly abstraction layer.
>
>
>
> It should also be possible to remove the overhead of CTR, since in PNE, we
> need to XOR at most 4 octets (applying XOR is the only difference between
> CTR and ECB). That cost should be something that should be possible to be
> nullified.
>
>
>
> Considering these aspects, and by looking at the numbers on the OpenSSL
> source code (as well as considering the overhead of GCM), my expectation
> goes to 2% to 4%.
>
>
>
> Just did some experiments and it seems that the expectation was correct.
>
>
>
> The benchmarks tell me that the overhead goes down from 10.0% to 3.8%, by
> doing the following:
>
>
>
> * remove the overhead of CTR abstraction (i.e. use the ECB backend and do
> XOR by ourselves)
>
> * remove the overhead of the abstraction layer (i.e. call the method
> returned by EVP_CIPHER_meth_get_do_cipher instead of calling
> EVP_EncryptUpdate)
>
>
>
> Of course the changes are specific to OpenSSL, but I would expect that you
> can expect similar numbers assuming that you have access to an optimized
> AES implementation.
>
>
>
>
>
>
>
> Sent from my Windows 10 phone
>
> [HxS - 15254 - 16.0.10228.20075]
>
>
> ------------------------------
>
> *From:* Kazuho Oku <kazuhooku@gmail.com>
> *Sent:* Thursday, June 21, 2018 7:21:17 PM
> *To:* Nick Banks
> *Cc:* quic@ietf.org
> *Subject:* Re: Packet Number Encryption Performance
>
>
>
> Hi Nick,
>
>
>
> Thank you for bringing the numbers to the list.
>
>
>
> I have just run a small benchmark using Quicly, and I see comparable
> numbers.
>
>
>
> To be precise, I see 10.0% increase of CPU cycles when encrypting a
> Initial packet of 1,280 octets. I expect that we will see similar numbers
> on other QUIC stacks that also use picotls (with OpenSSL as a backend).
> Note that the number is only comparing the cost of encryption, the overhead
> ratio will be much smaller if we look at the total number of CPU cycles
> spent by a QUIC stack as a whole.
>
>
>
> Looking at the profile, the overhead consists of three operations that
> each consumes comparable CPU cycles: core AES operation (using AES-NI), CTR
> operation overhead, CTR initialization. Note that picotls at the moment
> provides access to CTR crypto beneath the AEAD interface, which is to be
> used by the QUIC stacks.
>
>
>
> I would assume that we can cut down the overhead to somewhere between 2%
> to 4%, but it might be hard to go down to somewhere near 1%, because we
> cannot parallelize the AES operation of PNE with that of AEAD (see
> https://github.com/openssl/openssl/blob/OpenSSL_1_1_0h/
> crypto/aes/asm/aesni-x86_64.pl#L24-L39
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenssl%2Fopenssl%2Fblob%2FOpenSSL_1_1_0h%2Fcrypto%2Faes%2Fasm%2Faesni-x86_64.pl%23L24-L39&data=02%7C01%7Cnibanks%40microsoft.com%7C11d55f17333e4a795d7008d5d7e6d93c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636652308843994134&sdata=kqMz4SsN%2F2ErGK06Qz8Z0vUzpl4MiipnNE2wAMUb46c%3D&reserved=0>
> about the impact of parallelization).
>
>
>
> I do not think that 2% to 4% of additional overhead to the crypto is an
> issue for QUIC/HTTP, but current overhead of 10% is something that we might
> want to decrease. I am glad to be able to learn that now.
>
>
>
>
>
> 2018-06-22 5:48 GMT+09:00 Nick Banks <nibanks=40microsoft.com@
> dmarc.ietf.org>:
>
> Hello QUIC WG,
>
>
>
> I recently implemented PNE for WinQuic (using bcrypt APIs) and I decided
> to get some performance numbers to see what the overhead of PNE was. I
> figured the rest of the WG might be interested.
>
>
>
> My test just encrypts the same buffer (size dependent on the test case)
> 10,000,000 times and measured the time it took. The test then did the same
> thing, but also encrypted the packet number as well. I ran all that 10
> times in total. I then collected the best times for each category to
> produce the following graphs and tables (full excel doc attached):
>
>
>
>
>
>
>
> *Time (ms)*
>
> *Rate (Mbps)*
>
> *Bytes*
>
> *NO PNE*
>
> *PNE*
>
> *PNE Overhead*
>
> *No PNE*
>
> *PNE*
>
> *4*
>
> 2284.671
>
> 3027.657
>
> 33%
>
> 140.064
>
> 105.692
>
> *16*
>
> 2102.402
>
> 2828.204
>
> 35%
>
> 608.827
>
> 452.584
>
> *64*
>
> 2198.883
>
> 2907.577
>
> 32%
>
> 2328.45
>
> 1760.92
>
> *256*
>
> 2758.3
>
> 3490.28
>
> 27%
>
> 7424.86
>
> 5867.72
>
> *600*
>
> 4669.283
>
> 5424.539
>
> 16%
>
> 10280
>
> 8848.68
>
> *1000*
>
> 6130.139
>
> 6907.805
>
> 13%
>
> 13050.3
>
> 11581.1
>
> *1200*
>
> 6458.679
>
> 7229.672
>
> 12%
>
> 14863.7
>
> 13278.6
>
> *1450*
>
> 7876.312
>
> 8670.16
>
> 10%
>
> 14727.7
>
> 13379.2
>
>
>
> I used a server grade lab machine I had at my disposal, running the latest
> Windows 10 Server DataCenter build. Again, these numbers are for crypto
> only. No QUIC or UDP is included.
>
>
>
> Thanks,
>
> - Nick
>
>
>
>
>
>
>
> --
>
> Kazuho Oku
>
>
>
>
>
> --
>
> Kazuho Oku
>
>
>
>
>
> --
>
> Kazuho Oku
>
>
>
>
>
> --
>
> Kazuho Oku
>



-- 
Kazuho Oku