Re: Packet Number Encryption Performance

Kazuho Oku <kazuhooku@gmail.com> Sat, 23 June 2018 00:01 UTC

Return-Path: <kazuhooku@gmail.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5D684126CB6 for <quic@ietfa.amsl.com>; Fri, 22 Jun 2018 17:01:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 98ZMas7mlu_T for <quic@ietfa.amsl.com>; Fri, 22 Jun 2018 17:01:37 -0700 (PDT)
Received: from mail-pl0-x233.google.com (mail-pl0-x233.google.com [IPv6:2607:f8b0:400e:c01::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 29E54130F26 for <quic@ietf.org>; Fri, 22 Jun 2018 17:01:37 -0700 (PDT)
Received: by mail-pl0-x233.google.com with SMTP id 30-v6so4155947pld.13 for <quic@ietf.org>; Fri, 22 Jun 2018 17:01:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=NyuvHV5bvJ1gaIk+KhTsn6oCvaYOdeNHfEKFt7Sesjc=; b=r5tvlbR0n/lHtQV5KfjUGgzoKm0oFovOiXKq91U5GYAOS5ilTHoLEDXB7XMAXMbRWV OwYhzegMvdyOKBeHVmrQK1tG7twMxA5hJPsDh0e77tC2Gm6/nkuI4+bV6jvDbsFj16wU EiYpZx8qIQCHFv7ERZVnzNjiJ2jfM1K5B0RTfP+NbQAFres29TMhFzBKcZQOjPJ27TYL uz7k1q0ZAw53+mgjhrdAVSJ6gief9ORugk6i25GPu5qMIIaSMv2qoSKHZd4r84gp3mT1 QBZCzui8ld9cXPr7QR0sx3JHamjrltksOwdIfIqtQjPcuIi/Lfo6OR0Z9g63W58Z5pDl OONw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=NyuvHV5bvJ1gaIk+KhTsn6oCvaYOdeNHfEKFt7Sesjc=; b=tde/JbR/cVAi8QakUe4BzmS+6LDYmQEfXmUWvjn6AHuw0OffSvP33b1LfGD7ktMl28 dZM3BZ2y3wZa2IkeY0rkWcp2eiL3gdfqNIJSmDUzgMFYtXnPEq2roy4b4Z9N7JAuaeGP EvsMEOhTDZ7sfPShJLRpRWKg2PB7QIhxEsmxbiUjqX+idEfanHpQzUtIHbg2H2/mtdmB Usdo2PDdhMHcSCK3WoAiZ7OVpg6hkQ+Y/29IDAn3ZVYIqAV7ELxd5l3IJqh3qHL2Cdud RPJjxky0MnETYi7jGDYsesj+lMAVClq9RxMIJNGcvZTASPTvmO12qXAycRFkckSAnErd tF5Q==
X-Gm-Message-State: APt69E1RjyynUlWeBP4wdR9UqeFCwy2XLS6yIHC6k92w9YA9iTRBHnnU l4cjwCtv3S9GWMVswQxR6Kd3k86D0ugO8xClpL4=
X-Google-Smtp-Source: ADUXVKJq8/wcvoltjTRcjdflKUDb+r8iUY0+g6rB9tXLprZGpcoPfKMtCb+03UgnW2X9EU1LYQIOg+Qnpw0IUd73d3Q=
X-Received: by 2002:a17:902:c3:: with SMTP id a61-v6mr3575008pla.149.1529712096681; Fri, 22 Jun 2018 17:01:36 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a17:90a:1181:0:0:0:0 with HTTP; Fri, 22 Jun 2018 17:01:35 -0700 (PDT)
In-Reply-To: <CAKcm_gMc6y_2+KU3L+XpifNK4JESFA0V=OX4Nj51jTFfAm9M1A@mail.gmail.com>
References: <DM5PR2101MB0901FCB1094A124818A0B1FEB3760@DM5PR2101MB0901.namprd21.prod.outlook.com> <CANatvzxVBq1-UKiuixWGFfFyWMh8SYpp=y2LqYwiF=tHT6oOOQ@mail.gmail.com> <DM5PR2101MB0901C834F1FDFEC6D0D50781B3750@DM5PR2101MB0901.namprd21.prod.outlook.com> <CANatvzz0u=oy1j2_6=bn6bcuwzQv_6fVqe3WkBtjwaAZ8Bfh=w@mail.gmail.com> <CANatvzysRVQXsB0ZCReY3n_R_kZT-jhmYwR-7-2KYt5+GZCk0A@mail.gmail.com> <CAKcm_gPxYu9jNFmYR0_vQfawuC+T_E9UJbcDPOycrUAMuVJabg@mail.gmail.com> <CY4PR21MB06303A8C17796335F3A3FDE2B6750@CY4PR21MB0630.namprd21.prod.outlook.com> <DM5PR2101MB0901939C8975A87AA74219B9B3750@DM5PR2101MB0901.namprd21.prod.outlook.com> <CAKcm_gMc6y_2+KU3L+XpifNK4JESFA0V=OX4Nj51jTFfAm9M1A@mail.gmail.com>
From: Kazuho Oku <kazuhooku@gmail.com>
Date: Sat, 23 Jun 2018 09:01:35 +0900
Message-ID: <CANatvzzpZTh516VB40gHiMZq5oOjNRMPpNouCRd-WosEcKaisA@mail.gmail.com>
Subject: Re: Packet Number Encryption Performance
To: Ian Swett <ianswett@google.com>
Cc: Nick Banks <nibanks=40microsoft.com@dmarc.ietf.org>, Praveen Balasubramanian <pravb@microsoft.com>, IETF QUIC WG <quic@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000021a29e056f43dabd"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/VTuVIDj55nVydZMgwk4dMJHd53o>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Jun 2018 00:01:43 -0000

2018-06-23 3:51 GMT+09:00 Ian Swett <ianswett@google.com>:

> I expect crypto to increase as a fraction of CPU, but I don't expect it to
> go much higher than 10%.
>
> But who knows, maybe 2 years from now everything else will be very
> optimized and crypto will be 15%?
>

Ian, thank you for sharing the proportion of CPU cycles we are likely to
spend for crypto.

Your numbers relieves me, because even if the cost of crypto goes to 15%,
the overhead of PNE will be less than 1% (0.15*0.04=0.006).

I would also like to note that it is likely that HyperThreading, when used,
will eliminate the overhead of PNE.

This is because IIUC PNE is a marginal additional use of the AES-NI engine,
which have been mostly idle. The overhead of crypto is small (i.e. 15%)
that we will rarely see contention on the engine. While one hyperthread
does AES, the other hyperthread will run at full speed doing other
operations.

Also considering the fact that the number of CPU cycles spent per QUIC
packet does not change a lot with PNE, I would not be surprised to see *no*
decrease of throughput when PNE is used on a HyperThreading architecture.
In such case, what we will only observe is the raise of the utilization
ratio of the AES-NI engine.


>
> On Fri, Jun 22, 2018 at 12:34 PM Nick Banks <nibanks=40microsoft.com@
> dmarc.ietf.org> wrote:
>
>> I just want to add, my implementation already uses ECB from bcrypt (and I
>> do the XOR) already. Bcrypt doesn’t expose CTR mode directly.
>>
>>
>>
>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>> Windows 10
>>
>>
>> ------------------------------
>> *From:* Praveen Balasubramanian
>> *Sent:* Friday, June 22, 2018 9:26:44 AM
>> *To:* Ian Swett; Kazuho Oku
>> *Cc:* Nick Banks; IETF QUIC WG
>> *Subject:* RE: Packet Number Encryption Performance
>>
>>
>> Ian, do you expect that fraction of overall cost to hold once the UDP
>> stack is optimized? Is your measurement on top of the recent kernel
>> improvements? I expect crypto fraction of overall cost to keep increasing
>> over time as the network stack bottlenecks are eliminated.
>>
>>
>>
>> Kazuho, should the draft describe the optimizations you are making? Or
>> are these are too OpenSSL specific?
>>
>>
>>
>> *From:* QUIC [mailto:quic-bounces@ietf.org] *On Behalf Of *Ian Swett
>> *Sent:* Friday, June 22, 2018 4:45 AM
>> *To:* Kazuho Oku <kazuhooku@gmail.com>
>> *Cc:* Nick Banks <nibanks@microsoft.com>; IETF QUIC WG <quic@ietf.org>
>> *Subject:* Re: Packet Number Encryption Performance
>>
>>
>>
>> Thanks for digging into the details of this, Kazuho.  <4% increase in
>> crypto cost is a bit more than I originally expected(~2%), but crypto is
>> less than 10% of my CPU usage, so it's still less than 0.5% total, which is
>> acceptable to me.
>>
>>
>>
>> On Fri, Jun 22, 2018 at 2:45 AM Kazuho Oku <kazuhooku@gmail.com> wrote:
>>
>>
>>
>>
>>
>> 2018-06-22 12:22 GMT+09:00 Kazuho Oku <kazuhooku@gmail.com>:
>>
>>
>>
>>
>>
>> 2018-06-22 11:54 GMT+09:00 Nick Banks <nibanks@microsoft.com>:
>>
>> Hi Kazuho,
>>
>>
>>
>> Thanks for sharing your numbers as well! I'm bit confused where you say
>> you can reduce the 10% overhead to 2% to 4%. How do you plan on doing that?
>>
>>
>>
>> As stated in my previous mail, the 10% of overhead consists of three
>> parts, each consuming comparable number of CPU cycles. The two among the
>> three is related to the abstraction layer and how CTR is implemented, while
>> the other one is the core AES-ECB operation cost.
>>
>>
>>
>> It should be able to remove the costly abstraction layer.
>>
>>
>>
>> It should also be possible to remove the overhead of CTR, since in PNE,
>> we need to XOR at most 4 octets (applying XOR is the only difference
>> between CTR and ECB). That cost should be something that should be possible
>> to be nullified.
>>
>>
>>
>> Considering these aspects, and by looking at the numbers on the OpenSSL
>> source code (as well as considering the overhead of GCM), my expectation
>> goes to 2% to 4%.
>>
>>
>>
>> Just did some experiments and it seems that the expectation was correct.
>>
>>
>>
>> The benchmarks tell me that the overhead goes down from 10.0% to 3.8%, by
>> doing the following:
>>
>>
>>
>> * remove the overhead of CTR abstraction (i.e. use the ECB backend and do
>> XOR by ourselves)
>>
>> * remove the overhead of the abstraction layer (i.e. call the method
>> returned by EVP_CIPHER_meth_get_do_cipher instead of calling
>> EVP_EncryptUpdate)
>>
>>
>>
>> Of course the changes are specific to OpenSSL, but I would expect that
>> you can expect similar numbers assuming that you have access to an
>> optimized AES implementation.
>>
>>
>>
>>
>>
>>
>>
>> Sent from my Windows 10 phone
>>
>> [HxS - 15254 - 16.0.10228.20075]
>>
>>
>> ------------------------------
>>
>> *From:* Kazuho Oku <kazuhooku@gmail.com>
>> *Sent:* Thursday, June 21, 2018 7:21:17 PM
>> *To:* Nick Banks
>> *Cc:* quic@ietf.org
>> *Subject:* Re: Packet Number Encryption Performance
>>
>>
>>
>> Hi Nick,
>>
>>
>>
>> Thank you for bringing the numbers to the list.
>>
>>
>>
>> I have just run a small benchmark using Quicly, and I see comparable
>> numbers.
>>
>>
>>
>> To be precise, I see 10.0% increase of CPU cycles when encrypting a
>> Initial packet of 1,280 octets. I expect that we will see similar numbers
>> on other QUIC stacks that also use picotls (with OpenSSL as a backend).
>> Note that the number is only comparing the cost of encryption, the overhead
>> ratio will be much smaller if we look at the total number of CPU cycles
>> spent by a QUIC stack as a whole.
>>
>>
>>
>> Looking at the profile, the overhead consists of three operations that
>> each consumes comparable CPU cycles: core AES operation (using AES-NI), CTR
>> operation overhead, CTR initialization. Note that picotls at the moment
>> provides access to CTR crypto beneath the AEAD interface, which is to be
>> used by the QUIC stacks.
>>
>>
>>
>> I would assume that we can cut down the overhead to somewhere between 2%
>> to 4%, but it might be hard to go down to somewhere near 1%, because we
>> cannot parallelize the AES operation of PNE with that of AEAD (see
>> https://github.com/openssl/openssl/blob/OpenSSL_1_1_0h/
>> crypto/aes/asm/aesni-x86_64.pl#L24-L39
>> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenssl%2Fopenssl%2Fblob%2FOpenSSL_1_1_0h%2Fcrypto%2Faes%2Fasm%2Faesni-x86_64.pl%23L24-L39&data=02%7C01%7Cnibanks%40microsoft.com%7C11d55f17333e4a795d7008d5d7e6d93c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636652308843994134&sdata=kqMz4SsN%2F2ErGK06Qz8Z0vUzpl4MiipnNE2wAMUb46c%3D&reserved=0>
>> about the impact of parallelization).
>>
>>
>>
>> I do not think that 2% to 4% of additional overhead to the crypto is an
>> issue for QUIC/HTTP, but current overhead of 10% is something that we might
>> want to decrease. I am glad to be able to learn that now.
>>
>>
>>
>>
>>
>> 2018-06-22 5:48 GMT+09:00 Nick Banks <nibanks=40microsoft.com@
>> dmarc.ietf.org>:
>>
>> Hello QUIC WG,
>>
>>
>>
>> I recently implemented PNE for WinQuic (using bcrypt APIs) and I decided
>> to get some performance numbers to see what the overhead of PNE was. I
>> figured the rest of the WG might be interested.
>>
>>
>>
>> My test just encrypts the same buffer (size dependent on the test case)
>> 10,000,000 times and measured the time it took. The test then did the same
>> thing, but also encrypted the packet number as well. I ran all that 10
>> times in total. I then collected the best times for each category to
>> produce the following graphs and tables (full excel doc attached):
>>
>>
>>
>> [image: cid:image003.png@01D40966.7655B6B0]
>>
>>
>>
>> *Time (ms)*
>>
>> *Rate (Mbps)*
>>
>> *Bytes*
>>
>> *NO PNE*
>>
>> *PNE*
>>
>> *PNE Overhead*
>>
>> *No PNE*
>>
>> *PNE*
>>
>> *4*
>>
>> 2284.671
>>
>> 3027.657
>>
>> 33%
>>
>> 140.064
>>
>> 105.692
>>
>> *16*
>>
>> 2102.402
>>
>> 2828.204
>>
>> 35%
>>
>> 608.827
>>
>> 452.584
>>
>> *64*
>>
>> 2198.883
>>
>> 2907.577
>>
>> 32%
>>
>> 2328.45
>>
>> 1760.92
>>
>> *256*
>>
>> 2758.3
>>
>> 3490.28
>>
>> 27%
>>
>> 7424.86
>>
>> 5867.72
>>
>> *600*
>>
>> 4669.283
>>
>> 5424.539
>>
>> 16%
>>
>> 10280
>>
>> 8848.68
>>
>> *1000*
>>
>> 6130.139
>>
>> 6907.805
>>
>> 13%
>>
>> 13050.3
>>
>> 11581.1
>>
>> *1200*
>>
>> 6458.679
>>
>> 7229.672
>>
>> 12%
>>
>> 14863.7
>>
>> 13278.6
>>
>> *1450*
>>
>> 7876.312
>>
>> 8670.16
>>
>> 10%
>>
>> 14727.7
>>
>> 13379.2
>>
>>
>>
>> I used a server grade lab machine I had at my disposal, running the
>> latest Windows 10 Server DataCenter build. Again, these numbers are for
>> crypto only. No QUIC or UDP is included.
>>
>>
>>
>> Thanks,
>>
>> - Nick
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Kazuho Oku
>>
>>
>>
>>
>>
>> --
>>
>> Kazuho Oku
>>
>>
>>
>>
>>
>> --
>>
>> Kazuho Oku
>>
>>


-- 
Kazuho Oku