Handshake processing delays and impact on recovery

Jana Iyengar <jri.ietf@gmail.com> Fri, 17 July 2020 09:01 UTC

Return-Path: <jri.ietf@gmail.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DBE3B3A1594 for <quic@ietfa.amsl.com>; Fri, 17 Jul 2020 02:01:06 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.913
X-Spam-Level: **
X-Spam-Status: No, score=2.913 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, GB_SUMOF=5, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yO9cZZ9Vcdxn for <quic@ietfa.amsl.com>; Fri, 17 Jul 2020 02:01:05 -0700 (PDT)
Received: from mail-lj1-x229.google.com (mail-lj1-x229.google.com [IPv6:2a00:1450:4864:20::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A54BB3A1593 for <quic@ietf.org>; Fri, 17 Jul 2020 02:01:04 -0700 (PDT)
Received: by mail-lj1-x229.google.com with SMTP id s9so11727342ljm.11 for <quic@ietf.org>; Fri, 17 Jul 2020 02:01:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=8xNWMPLV10wRS9XhdvMEUyRoXO1TBg/eoQMlvyJfRF4=; b=IvN91ZzEn+MYx4KpmduhpY7jedNKxbTrPprDVfLIpf381LpdZ+WP35QssIptTMd2T/ 0XQXcHdDXr0yr7UxIKHTwKejjMIqAxlygnzzTvdDYUc5oykcHmH5gSRiwjVr18bLmWYL S558wqDX5iG0sRGqv2LDJ+fh/7j9VA76XJQiUHewxNp00YF0d08Ko9W/d6dGGVXtUX7f yaIEDAyePaHR4YJ16pXKWV5llIjPiRuA0HhEMr5O68z/H7l348Te4wiuLwrdhjm68xu5 PNHzPTGJjLItaJqPEuSVfe72zGmHp66hGA0Pehwb5HT0T7fv+Ul9B9LeQCsWp8SX4bmC EQ5w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=8xNWMPLV10wRS9XhdvMEUyRoXO1TBg/eoQMlvyJfRF4=; b=X+WfKi16J7BaidIHj+ykIa4aWA9G9ev88wVn5FKXnhIrVZsOzn1U9CtbRPc7aRxPdp G6Qv2qH+pXuToKd1WcKRf0x8tSEa9QAOj5k3hIFmR12ccR3H4OUzKYknQlP+yusXVxh1 WnF2a8JAMRq6qrD2bCw755qTwyT1IcgCZDSBZeXQcpFO6d6cFt9UbhRZ7YQ5aUWabD+m s1ItAfKHyJqYXJM02zxYFz0FN8M/cafSzPKP6sH+AJU/302TOBAYQb3dN2gK7yqE6bLN M8adXzsdvJSkNVpqowontoxLeyB0OFj0UozwVejvjkbRruio7MTGMpZMg4s8ZEvA7Fwz 0bEg==
X-Gm-Message-State: AOAM530ozq1XzQGJDYmWIqXAAd4Lpj5Z4+6hPGG75WepeqJ+oPqHXU0R Uhged0TiyDIJDgFD+N2j6yC7Cga8tt3JyzoEMnXFxeLk
X-Google-Smtp-Source: ABdhPJx17A8f1lrSuUqeULsgBTbtGfMkUj7mlwlS3dc0UkgxJL/J5WspBIdbN5ukRsT5T2KkkkI1+1BpCt6e+BA30Pc=
X-Received: by 2002:a2e:8056:: with SMTP id p22mr4065812ljg.397.1594976462127; Fri, 17 Jul 2020 02:01:02 -0700 (PDT)
MIME-Version: 1.0
From: Jana Iyengar <jri.ietf@gmail.com>
Date: Fri, 17 Jul 2020 02:00:51 -0700
Message-ID: <CACpbDcctB0nX2yTxHQEs6UcUCTH0WdfiNAPfpRKe3oH-0aYZYA@mail.gmail.com>
Subject: Handshake processing delays and impact on recovery
To: QUIC WG <quic@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000073474f05aa9f65ed"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/KBxSt7rv2PIs6RyumXwRJ0KUK6Q>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 17 Jul 2020 09:01:07 -0000

I'd like to bring this slightly tricky issue under the wg's eyes.
Discussions so far have yielded a strategy that I propose further below.
This summarizes conversations in 3821
<https://github.com/quicwg/base-drafts/issues/3821> and 3874
<https://github.com/quicwg/base-drafts/pull/3874> (and the outcomes might
touch 3889 <https://github.com/quicwg/base-drafts/issues/3889>). The
purpose of this email is to see if we can agree on the strategies outlined
here, so we can move forward on the PR there and/or other associated issues
and PRs.

Problems:

During the handshake, an endpoint can receive packets (carrying data or
acknowledgements) that it cannot decrypt because the requisite keys are not
yet available. This can happen because the crypto machinery at the endpoint
is waiting for earlier packets to be received to yield the keys. Similarly,
a client might not be able to read 1-RTT packets received from the server
because it is waiting for certificate validation to complete.

1.

The assumption in the TLS specification is that when an endpoint is busy
doing something (certificate validation or key generation for instance), it
won't be able to respond to received packets with acknowledgements (Section
4.1.4
<https://quicwg.org/base-drafts/draft-ietf-quic-tls.html#section-4.1.4-1>).

Contrary to the assumption in the spec, Chrome and Firefox both process the
certificate chain asynchronously, and therefore acks handshake packets
immediately, even the ones that arrive after they have started verifying
the certificate chain. But they buffer 1-RTT packets, and therefore the
issue exists for 1-RTT packets.

As a result, there's potential delay, sometimes significant, in responding
to some packets during the handshake. There are two side effects of this.

S1.

First, this means that an endpoint that runs a PTO timer and retransmits
data for 0-RTT and 1-RTT packets while the handshake is still in progress
might be doing this quite unnecessarily. The peer may well have received
this data and have queued it, waiting to be able to decrypt it.

S2.

Second, acks for these packets that are buffered and then later decrypted
and read can be sent much later than when the packets were received at the
peer, causing the ack_delay to potentially be much higher than the
advertised max_ack_delay. The endpoint ignores ack_delay larger than
max_ack_delay, and this inflates the smoothed_rtt and rtt_variance,
potentially inflating the PTO significantly (which can cause handshakes to
fail if the endpoint employs a "handshake timeout" period).


Proposed Solutions:

S1.

To solve the first problem, we already have a condition in the recovery
draft to _not_ arm the PTO on 1-RTT data until the handshake is
complete (Section
6.2.1
<https://quicwg.org/base-drafts/draft-ietf-quic-recovery.html#section-6.2.1-6>).
This is not quite adequate, since it still allows a client to arm a PTO
timer on 1-RTT data on sending a CFIN and 1-RTT data, before ensuring that
the server has received the CFIN. The proposed solution here is to change
this condition from "handshake complete" to "handshake confirmed". Meaning
that an endpoint only arms a PTO timer on 0-RTT or 1-RTT data when it knows
that the peer has the keys necessary to read and write in 1-RTT, and can
therefore respond with an ACK.

We think this holds up, but we'd like to get this under everyone's eyes, to
ensure that the discussion isn't missing subtle deadlock issues.

S2.

To solve the second problem, the proposal is to use ack_delay as reported
by the endpoint that buffers the data appropriately. ack_delay is the
amount of the time the information (regardless of that being the packet
itself or the acknowledgement) being buffered intentionally. For this, we
need the following:

- endpoints MUST report an ack_delay that is the sum of the controlled
delay (i.e. the duration of the data and the ack being buffered)

- endpoints ignore max_ack_delay for acknowledgements of packets that were
sent prior to handshake confirmation. Those are the ones that might have
been buffered at the peer until keys were ready.

In addition, we explicitly recommend endpoints that asynchronously process
handshake messages to buffer 1-RTT packets, instead of dropping them, as
dropping packets has significant impact on congestion control. Note that
this matches what those implementations already do.



Thoughts?


- jana