RE: Packet Number Encryption Performance

Nick Banks <nibanks@microsoft.com> Sun, 24 June 2018 15:55 UTC

Return-Path: <nibanks@microsoft.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 756FF130E0A for <quic@ietfa.amsl.com>; Sun, 24 Jun 2018 08:55:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.01
X-Spam-Level:
X-Spam-Status: No, score=-2.01 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_DKIMWL_WL_HIGH=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=microsoft.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DmlF8IQTwNOC for <quic@ietfa.amsl.com>; Sun, 24 Jun 2018 08:55:26 -0700 (PDT)
Received: from NAM03-CO1-obe.outbound.protection.outlook.com (mail-co1nam03on0090.outbound.protection.outlook.com [104.47.40.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A358F129C6B for <quic@ietf.org>; Sun, 24 Jun 2018 08:55:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=wvgUzl7JyHcR4AjXrsWO2n2WeW9la/+M+/SwEo0ii3c=; b=DxwkKbCWTJG0DcbaYP7KEdGhw8aATjQpuvnmaKdpEWIMaVYBU9NDYOa9QYK+laBWBufpBdOwPv5i78pbSCHDtJAApCd1mQ7EhWtcuT4dTfu3ejurqouJkpLox5DpGc9i8ROGFbpHxvoxb2slfRxRz0n6DcudigSyLLZfnprA4MA=
Received: from DM5PR2101MB0901.namprd21.prod.outlook.com (52.132.132.158) by DM5PR2101MB1062.namprd21.prod.outlook.com (52.132.128.19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.930.5; Sun, 24 Jun 2018 15:55:24 +0000
Received: from DM5PR2101MB0901.namprd21.prod.outlook.com ([fe80::9cbd:940b:ae62:4031]) by DM5PR2101MB0901.namprd21.prod.outlook.com ([fe80::9cbd:940b:ae62:4031%8]) with mapi id 15.20.0930.005; Sun, 24 Jun 2018 15:55:24 +0000
From: Nick Banks <nibanks@microsoft.com>
To: Kazuho Oku <kazuhooku@gmail.com>
CC: IETF QUIC WG <quic@ietf.org>, Ian Swett <ianswett@google.com>, Praveen Balasubramanian <pravb@microsoft.com>
Subject: RE: Packet Number Encryption Performance
Thread-Topic: Packet Number Encryption Performance
Thread-Index: AdQJn5UDP5dBCj8oST6dWmAQqIW3NwAMB4yAAAEQQOUAARRpgAAHDxAAAAp/RwAACYTCgAAAjosmAATQHgAACtMrgAABFu2QAA2HWoAAECJDgAAPqVqAACSY/as=
Date: Sun, 24 Jun 2018 15:55:24 +0000
Message-ID: <DM5PR2101MB09014E479732F8F3E980AE5EB34B0@DM5PR2101MB0901.namprd21.prod.outlook.com>
References: <DM5PR2101MB0901FCB1094A124818A0B1FEB3760@DM5PR2101MB0901.namprd21.prod.outlook.com> <CANatvzxVBq1-UKiuixWGFfFyWMh8SYpp=y2LqYwiF=tHT6oOOQ@mail.gmail.com> <DM5PR2101MB0901C834F1FDFEC6D0D50781B3750@DM5PR2101MB0901.namprd21.prod.outlook.com> <CANatvzz0u=oy1j2_6=bn6bcuwzQv_6fVqe3WkBtjwaAZ8Bfh=w@mail.gmail.com> <CANatvzysRVQXsB0ZCReY3n_R_kZT-jhmYwR-7-2KYt5+GZCk0A@mail.gmail.com> <CAKcm_gPxYu9jNFmYR0_vQfawuC+T_E9UJbcDPOycrUAMuVJabg@mail.gmail.com> <CY4PR21MB06303A8C17796335F3A3FDE2B6750@CY4PR21MB0630.namprd21.prod.outlook.com> <DM5PR2101MB0901939C8975A87AA74219B9B3750@DM5PR2101MB0901.namprd21.prod.outlook.com> <CAKcm_gMc6y_2+KU3L+XpifNK4JESFA0V=OX4Nj51jTFfAm9M1A@mail.gmail.com> <CANatvzzpZTh516VB40gHiMZq5oOjNRMPpNouCRd-WosEcKaisA@mail.gmail.com> <DM5PR2101MB09014CF646E0DA727AD79486B3740@DM5PR2101MB0901.namprd21.prod.outlook.com> <CANatvzzS6CZ3a0_xQpsPF4RFpZ7+cfQYB7dsT0oAjpHxXv+9+A@mail.gmail.com> <MW2PR2101MB0908B9F435EC23FB9152DC3BB3740@MW2PR2101MB0908.namprd21.prod.outlook.com>, <CANatvzz33LsTMZbEbUq3R9MquFoF2VUP=-3hZq3fxvM7=GDkYQ@mail.gmail.com>
In-Reply-To: <CANatvzz33LsTMZbEbUq3R9MquFoF2VUP=-3hZq3fxvM7=GDkYQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [66.235.10.1]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; DM5PR2101MB1062; 7:3hsLoFCWV0prO3keFrRe8ML9EDCYM8sz3nGHaK3hAuPnpkSka1YZV9KVtkWu/58zHbn6eZTH9mzsSBvsI87vp48wPPdDrSWzNy8k0SxARF0IL7j4oqdwyPY/7Pb1GHgtdkGe1yExb4uz3vmk0pnFjc+sMQL+mCzmywltnKA1qc0epp1/tMJIZPw19mvkJ3vABhZndM2GpP3hIDHSkVoaXVKnkFNdW9ksa1/9ggnEwMTj+EdqKnvREBnGedOC2hrf
x-ms-exchange-antispam-srfa-diagnostics: SOS;
x-ms-office365-filtering-correlation-id: 26888f72-ca85-4669-62d9-08d5d9eae5ce
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(7020095)(4652020)(8989117)(4534165)(4627221)(201703031133081)(201702281549075)(8990107)(5600026)(711020)(48565401081)(2017052603328)(7193020); SRVR:DM5PR2101MB1062;
x-ms-traffictypediagnostic: DM5PR2101MB1062:
authentication-results: spf=none (sender IP is ) smtp.mailfrom=nibanks@microsoft.com;
x-microsoft-antispam-prvs: <DM5PR2101MB1062B40550C14271849516BDB34B0@DM5PR2101MB1062.namprd21.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(28532068793085)(190756311086443)(158342451672863)(156600954879566)(89211679590171)(166708455590820)(189930954265078)(85827821059158)(211936372134217)(153496737603132);
x-ms-exchange-senderadcheck: 1
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(8211001083)(6040522)(2401047)(5005006)(8121501046)(10201501046)(93006095)(93001095)(3002001)(3231254)(2018427008)(944501410)(52105095)(6055026)(149027)(150027)(6041310)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123560045)(20161123564045)(20161123562045)(20161123558120)(6072148)(201708071742011)(7699016); SRVR:DM5PR2101MB1062; BCL:0; PCL:0; RULEID:; SRVR:DM5PR2101MB1062;
x-forefront-prvs: 0713BC207F
x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(39860400002)(376002)(346002)(396003)(366004)(39380400002)(189003)(199004)(51444003)(54094003)(9686003)(6436002)(54906003)(55016002)(22452003)(478600001)(54896002)(6306002)(236005)(86362001)(446003)(11346002)(561924002)(59450400001)(99286004)(316002)(76176011)(7696005)(14454004)(486006)(476003)(66066001)(966005)(5250100002)(97736004)(5890100001)(86612001)(10290500003)(74316002)(2900100001)(7736002)(8936002)(53936002)(8990500004)(3660700001)(107886003)(6246003)(606006)(4326008)(39060400002)(3280700002)(6916009)(3480700004)(105586002)(229853002)(53546011)(102836004)(26005)(6506007)(93886005)(186003)(106356001)(5070765005)(1411001)(5660300001)(10090500001)(68736007)(33656002)(3846002)(6116002)(2906002)(81166006)(8676002)(345774005)(53946003)(25786009)(81156014)(579004)(559001); DIR:OUT; SFP:1102; SCL:1; SRVR:DM5PR2101MB1062; H:DM5PR2101MB0901.namprd21.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1;
received-spf: None (protection.outlook.com: microsoft.com does not designate permitted sender hosts)
x-microsoft-antispam-message-info: M6oZVZ/zO9cZGBUnkqFWCLlftqr6o/ctvUMtgHLB0tHasFtM0+dazvMbCY73laYcMEXogkjYjYvbYt9jKaCff7/ApVJaUqLdoIh/uzJ7/NcW7x/WLsu/vpJI+6YgBwDPMA1xHG+ceD5bXlId3pvexrwlDUQgFwRp3KQixLGf9z+aCtvPtwU8gg6AouHfj7LpPp7XCJ6YhJPz2XkXjaXcYV8LyuxeaPmoraTC/rl757ZoW/qGpxgCToLwsILqF0te0HCiaL+GxbnlN6gKzw7KMqNU0WE/R/2EOEUbMHoMzO7rhISKZQ8nn/ThJQRXZdwxUIUkz1scUeRI4BGYSdxvRmRAy1cndsLDeTJ9wrhlnLs=
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: multipart/alternative; boundary="_000_DM5PR2101MB09014E479732F8F3E980AE5EB34B0DM5PR2101MB0901_"
MIME-Version: 1.0
X-OriginatorOrg: microsoft.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 26888f72-ca85-4669-62d9-08d5d9eae5ce
X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Jun 2018 15:55:24.2805 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR2101MB1062
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/EJ1bMc_iNpw5ebzGsDf_vwJrF28>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 24 Jun 2018 15:55:32 -0000

Hi Kazuho,

WinQuic will be a generic QUIC transport, with HTTP being just one of the scenarios we have planned for it. The goal for the transport is always to provide the most performant solution possible, using as few CPU cycles as possible, so the higher layers can have them. The test I used would most closely mimic a large file transfer over QUIC, were most of the CPU would end up being used by the QUIC transport.

So again, the goal here is to measure the cost of encryption (and eventually PNE) compared to other QUIC transport layer work. That will give us the baseline we can provide to other layers on top of us. Obviously, the total CPU percentage of encryption will be smaller when you include other layers on top of QUIC, but that will be dependent on the different protocols built on QUIC, and the particular work load / scenario being executed. We shouldn’t assume any particular protocol/scenario and say we are fine with using too much CPU at the lower layer, because the higher layer will be using even more CPU anyways.

As far as raw numbers, we are not prepared to share them at this time.

Thanks,
- Nick

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

________________________________
From: Kazuho Oku <kazuhooku@gmail.com>
Sent: Saturday, June 23, 2018 3:10:33 PM
To: Nick Banks
Cc: IETF QUIC WG; Ian Swett; Praveen Balasubramanian
Subject: Re: Packet Number Encryption Performance



2018-06-23 23:54 GMT+09:00 Nick Banks <nibanks@microsoft.com<mailto:nibanks@microsoft.com>>:
I agree that the numbers for real work loads will be interesting, but I do believe that raw QUIC stack numbers are also important. They will represent your best case scenario.


To clarify, the reason I asked about the raw numbers and the use-case is because without that I do not know how to interpret your numbers.

My understanding is that your benchmark is about encrypting and sending large QUIC packets.

Assuming that BCRYPT provides comparable performance to OpenSSL, the numbers you show (i.e. 22.5% in user-mode, 38.5% in kernel-mode) means that you want to reduce crypto cost when a single CPU core is emitting somewhere between 5Gbps to 8 Gbps.

I wonder how much we would be interested in such case. Let me explain why.

Assuming that we have many connections, each CPU core will only handle somewhere around 1Gbps to saturate a 25Gbps link. As I have stated in my previous mail, the crypto cost will be around or less than 10% in such case.

Assuming that we are interested in utilizing a 25Gbps link for a single connection (or a small number of connections), I think that we would be generally considering of distributing both UDP send and encryption to multiple CPU cores, because even without crypto, it is hard to utilize the entire bandwidth using just one core (the same goes for the receive side). And if you distribute the load to your CPU cores, the crypto cost becomes somewhere around or below 10%.

This is my understanding. But I also understand that it could well be the case that I do not understand what your workload looks like. That's why I wonder if you could share the raw numbers and the use-case.

For my testing, I have a fairly simple test of opening a single unidirectional stream and just continually sending data until the core gets saturated. From that, I grab the performance trace, and specifically look at the composition of the QUIC stack’s send path. The send path constitutes generally all of the CPU for the whole connection anyways (with just a little overhead seen from receiving and processing ACKs). The machine I used was another server grade machine running the latest Windows 10 Server Datacenter.

So the percentages I shared are for just the QUIC send path (again, no PNE with these numbers). The numbers are for the percentage of total CPU for all packets sent, but since all packets were pretty much the same size, the numbers should still hold for a single packet. And bottom line, encryption is a lot more of an impact than 10%. And as we bring in more performance improvements for UDP send path in Windows, we expect encryption to be higher and higher percentage.

- Nick

From: Kazuho Oku <kazuhooku@gmail.com<mailto:kazuhooku@gmail.com>>
Sent: Saturday, June 23, 2018 12:00 AM
To: Nick Banks <nibanks@microsoft.com<mailto:nibanks@microsoft.com>>
Cc: Ian Swett <ianswett@google.com<mailto:ianswett@google.com>>; IETF QUIC WG <quic@ietf.org<mailto:quic@ietf.org>>; Praveen Balasubramanian <pravb@microsoft.com<mailto:pravb@microsoft.com>>

Subject: Re: Packet Number Encryption Performance

IIUC, Ian should be talking about performance numbers that you would see on a server that is handling real workload. Not the breakdown of numbers within the QUIC stack.

Let's talk about nanoseconds per packet rather than the ratio, because ratio depends on what you use as a "total" is.

My benchmarks tell me that, on OpenSSL 1.1.0 running on Core i7 @ 2.5GHz (MacBook Pro 15" Mid 2015), the numbers are:

without-PNE: 585ns / packet
with-PNE: 607ns / packet (* this is the optimized version with 3.8% overhead)

Assuming that your peak rate is 1Gbit/CPU core (would be enough to saturate 25Gb Ethernet), the ratio of CPU cycles for running the crypto will be:

all crypto: 0.125GB/sec / 1280bytes/packet * 607ns/packet = 5.93 %
PNE: 0.125GB/sec / 1280bytes/packet * 22.4ns/packet = 0.22 %

Of course, these numbers are what is expected when a server is sending or receiving full-sized packets in one direction. In reality, there would be at least some flow in the opposite direction, so the number will be higher, but not as much as 2x. Or if you are sending small-sized packet in *both* directions *and* saturating the link, the ratio will be higher.

But looking at the numbers, I'd assume that 10% is a logical number for workload Ian has, and also for people who are involved in content distribution.

As shown, I think that sharing the PPS and average packet size that our workload will have, along with the raw numbers of the crypto (i.e. nsec / packet) will give us a better understanding on how much the actual overhead is. Do you mind sharing your expectations?

2018-06-23 9:34 GMT+09:00 Nick Banks <nibanks@microsoft.com<mailto:nibanks@microsoft.com>>:
Hey Guys,

I spent the better part of the day getting some performance traces of WinQuic without PNE. Our implementation supports both user mode and kernel mode, so I got numbers for both. The following table shows the relative CPU cost of different parts of the QUIC send path:



User Mode

Kernel Mode

UDP Send

64.7%

41.4%

Encryption

22.5%

38.5%

Packet Building/Framing

7%

15%

Miscellaneous

5.8%

5.1%


These numbers definitely show that encryption is a much larger portion of CPU.

- Nick


From: Kazuho Oku <kazuhooku@gmail.com<mailto:kazuhooku@gmail.com>>
Sent: Friday, June 22, 2018 5:02 PM
To: Ian Swett <ianswett@google.com<mailto:ianswett@google.com>>
Cc: Nick Banks <nibanks@microsoft.com<mailto:nibanks@microsoft.com>>; Praveen Balasubramanian <pravb@microsoft.com<mailto:pravb@microsoft.com>>; IETF QUIC WG <quic@ietf.org<mailto:quic@ietf.org>>

Subject: Re: Packet Number Encryption Performance



2018-06-23 3:51 GMT+09:00 Ian Swett <ianswett@google.com<mailto:ianswett@google.com>>:
I expect crypto to increase as a fraction of CPU, but I don't expect it to go much higher than 10%.

But who knows, maybe 2 years from now everything else will be very optimized and crypto will be 15%?

Ian, thank you for sharing the proportion of CPU cycles we are likely to spend for crypto.

Your numbers relieves me, because even if the cost of crypto goes to 15%, the overhead of PNE will be less than 1% (0.15*0.04=0.006).

I would also like to note that it is likely that HyperThreading, when used, will eliminate the overhead of PNE.

This is because IIUC PNE is a marginal additional use of the AES-NI engine, which have been mostly idle. The overhead of crypto is small (i.e. 15%) that we will rarely see contention on the engine. While one hyperthread does AES, the other hyperthread will run at full speed doing other operations.

Also considering the fact that the number of CPU cycles spent per QUIC packet does not change a lot with PNE, I would not be surprised to see *no* decrease of throughput when PNE is used on a HyperThreading architecture. In such case, what we will only observe is the raise of the utilization ratio of the AES-NI engine.


On Fri, Jun 22, 2018 at 12:34 PM Nick Banks <nibanks=40microsoft.com@dmarc.ietf.org<mailto:40microsoft.com@dmarc.ietf.org>> wrote:
I just want to add, my implementation already uses ECB from bcrypt (and I do the XOR) already. Bcrypt doesn’t expose CTR mode directly.

Sent from Mail<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&data=02%7C01%7Cnibanks%40microsoft.com%7Ce6efa363bc2547167a8c08d5d89c80e7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636653089059268836&sdata=aeU7j5lhCIM18KG5c6opG4p7amwhCWbRnfnm8sDW7hc%3D&reserved=0> for Windows 10

________________________________
From: Praveen Balasubramanian
Sent: Friday, June 22, 2018 9:26:44 AM
To: Ian Swett; Kazuho Oku
Cc: Nick Banks; IETF QUIC WG
Subject: RE: Packet Number Encryption Performance

Ian, do you expect that fraction of overall cost to hold once the UDP stack is optimized? Is your measurement on top of the recent kernel improvements? I expect crypto fraction of overall cost to keep increasing over time as the network stack bottlenecks are eliminated.

Kazuho, should the draft describe the optimizations you are making? Or are these are too OpenSSL specific?

From: QUIC [mailto:quic-bounces@ietf.org<mailto:quic-bounces@ietf.org>] On Behalf Of Ian Swett
Sent: Friday, June 22, 2018 4:45 AM
To: Kazuho Oku <kazuhooku@gmail.com<mailto:kazuhooku@gmail.com>>
Cc: Nick Banks <nibanks@microsoft.com<mailto:nibanks@microsoft.com>>; IETF QUIC WG <quic@ietf.org<mailto:quic@ietf.org>>
Subject: Re: Packet Number Encryption Performance

Thanks for digging into the details of this, Kazuho.  <4% increase in crypto cost is a bit more than I originally expected(~2%), but crypto is less than 10% of my CPU usage, so it's still less than 0.5% total, which is acceptable to me.

On Fri, Jun 22, 2018 at 2:45 AM Kazuho Oku <kazuhooku@gmail.com<mailto:kazuhooku@gmail.com>> wrote:


2018-06-22 12:22 GMT+09:00 Kazuho Oku <kazuhooku@gmail.com<mailto:kazuhooku@gmail.com>>:


2018-06-22 11:54 GMT+09:00 Nick Banks <nibanks@microsoft.com<mailto:nibanks@microsoft.com>>:
Hi Kazuho,

Thanks for sharing your numbers as well! I'm bit confused where you say you can reduce the 10% overhead to 2% to 4%. How do you plan on doing that?

As stated in my previous mail, the 10% of overhead consists of three parts, each consuming comparable number of CPU cycles. The two among the three is related to the abstraction layer and how CTR is implemented, while the other one is the core AES-ECB operation cost.

It should be able to remove the costly abstraction layer.

It should also be possible to remove the overhead of CTR, since in PNE, we need to XOR at most 4 octets (applying XOR is the only difference between CTR and ECB). That cost should be something that should be possible to be nullified.

Considering these aspects, and by looking at the numbers on the OpenSSL source code (as well as considering the overhead of GCM), my expectation goes to 2% to 4%.

Just did some experiments and it seems that the expectation was correct.

The benchmarks tell me that the overhead goes down from 10.0% to 3.8%, by doing the following:

* remove the overhead of CTR abstraction (i.e. use the ECB backend and do XOR by ourselves)
* remove the overhead of the abstraction layer (i.e. call the method returned by EVP_CIPHER_meth_get_do_cipher instead of calling EVP_EncryptUpdate)

Of course the changes are specific to OpenSSL, but I would expect that you can expect similar numbers assuming that you have access to an optimized AES implementation.



Sent from my Windows 10 phone
[HxS - 15254 - 16.0.10228.20075]

________________________________
From: Kazuho Oku <kazuhooku@gmail.com<mailto:kazuhooku@gmail.com>>
Sent: Thursday, June 21, 2018 7:21:17 PM
To: Nick Banks
Cc: quic@ietf.org<mailto:quic@ietf.org>
Subject: Re: Packet Number Encryption Performance

Hi Nick,

Thank you for bringing the numbers to the list.

I have just run a small benchmark using Quicly, and I see comparable numbers.

To be precise, I see 10.0% increase of CPU cycles when encrypting a Initial packet of 1,280 octets. I expect that we will see similar numbers on other QUIC stacks that also use picotls (with OpenSSL as a backend). Note that the number is only comparing the cost of encryption, the overhead ratio will be much smaller if we look at the total number of CPU cycles spent by a QUIC stack as a whole.

Looking at the profile, the overhead consists of three operations that each consumes comparable CPU cycles: core AES operation (using AES-NI), CTR operation overhead, CTR initialization. Note that picotls at the moment provides access to CTR crypto beneath the AEAD interface, which is to be used by the QUIC stacks.

I would assume that we can cut down the overhead to somewhere between 2% to 4%, but it might be hard to go down to somewhere near 1%, because we cannot parallelize the AES operation of PNE with that of AEAD (see https://github.com/openssl/openssl/blob/OpenSSL_1_1_0h/crypto/aes/asm/aesni-x86_64.pl#L24-L39<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenssl%2Fopenssl%2Fblob%2FOpenSSL_1_1_0h%2Fcrypto%2Faes%2Fasm%2Faesni-x86_64.pl%23L24-L39&data=02%7C01%7Cnibanks%40microsoft.com%7C11d55f17333e4a795d7008d5d7e6d93c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636652308843994134&sdata=kqMz4SsN%2F2ErGK06Qz8Z0vUzpl4MiipnNE2wAMUb46c%3D&reserved=0> about the impact of parallelization).

I do not think that 2% to 4% of additional overhead to the crypto is an issue for QUIC/HTTP, but current overhead of 10% is something that we might want to decrease. I am glad to be able to learn that now.


2018-06-22 5:48 GMT+09:00 Nick Banks <nibanks=40microsoft.com@dmarc.ietf.org<mailto:nibanks=40microsoft.com@dmarc.ietf.org>>:
Hello QUIC WG,

I recently implemented PNE for WinQuic (using bcrypt APIs) and I decided to get some performance numbers to see what the overhead of PNE was. I figured the rest of the WG might be interested.

My test just encrypts the same buffer (size dependent on the test case) 10,000,000 times and measured the time it took. The test then did the same thing, but also encrypted the packet number as well. I ran all that 10 times in total. I then collected the best times for each category to produce the following graphs and tables (full excel doc attached):




Time (ms)

Rate (Mbps)

Bytes

NO PNE

PNE

PNE Overhead

No PNE

PNE

4

2284.671

3027.657

33%

140.064

105.692

16

2102.402

2828.204

35%

608.827

452.584

64

2198.883

2907.577

32%

2328.45

1760.92

256

2758.3

3490.28

27%

7424.86

5867.72

600

4669.283

5424.539

16%

10280

8848.68

1000

6130.139

6907.805

13%

13050.3

11581.1

1200

6458.679

7229.672

12%

14863.7

13278.6

1450

7876.312

8670.16

10%

14727.7

13379.2


I used a server grade lab machine I had at my disposal, running the latest Windows 10 Server DataCenter build. Again, these numbers are for crypto only. No QUIC or UDP is included.

Thanks,
- Nick




--
Kazuho Oku



--
Kazuho Oku



--
Kazuho Oku



--
Kazuho Oku



--
Kazuho Oku



--
Kazuho Oku