[TLS] Issues with buffered, ACKed KeyUpdates in DTLS 1.3

David Benjamin <davidben@chromium.org> Fri, 12 April 2024 23:00 UTC

Return-Path: <davidben@google.com>
X-Original-To: tls@ietfa.amsl.com
Delivered-To: tls@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A0A64C14F6A9 for <tls@ietfa.amsl.com>; Fri, 12 Apr 2024 16:00:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -16.297
X-Spam-Level:
X-Spam-Status: No, score=-16.297 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-2.049, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.248, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=chromium.org
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id x1a0tP-pLIVN for <tls@ietfa.amsl.com>; Fri, 12 Apr 2024 16:00:17 -0700 (PDT)
Received: from mail-yb1-xb29.google.com (mail-yb1-xb29.google.com [IPv6:2607:f8b0:4864:20::b29]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E3924C14F61B for <tls@ietf.org>; Fri, 12 Apr 2024 16:00:16 -0700 (PDT)
Received: by mail-yb1-xb29.google.com with SMTP id 3f1490d57ef6-dc742543119so1374575276.0 for <tls@ietf.org>; Fri, 12 Apr 2024 16:00:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1712962816; x=1713567616; darn=ietf.org; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=Y1/CJUglole0w78RLsoSPAqQ6dEjP2t2FEMl1z2Y2Lc=; b=Q1tjDkaLCkSEWM7Im0Mn7GKhQe3wSnFau5dfpbIaXhmXjRTyUg1KzEnSc7LKp1i7+Q QzMN0FU70nx6nuDmVgQ/LcjxfiyOvQCSY/T1ZE5So35o5vz+PlQHvKcH26BRjDTHBZIP WTwTPfG7Vn/H8D2yXG2yfZyWUx04DAuAUNYEE=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712962816; x=1713567616; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Y1/CJUglole0w78RLsoSPAqQ6dEjP2t2FEMl1z2Y2Lc=; b=PoptirM1hBT5Tzaf6iP9AW7bLk1irh646m39j7pq3/bk+RUbw+ZYCOfKXQq3vp3q3l 2Jp/99OHe+XkuMJfiNz2wTsdXhV7ShzV+7VvZjivBLznOsj0cxXFCLlyIoAZ/QNrXb1d WZxxmwe2xNpPvo7pyCelsqImYrW/Vy/mb54Hq64eP7u0e9O/VRGYApfBGHNu38+oNv5u Re/lpa7AeHvrw2edfD7DXXtLdCkTDUc9RDd1JGSrO8YSuI0HuryZGIRaT7hLfCk1U9AR zaIabbJ9eTfinrfMhOvSKuGhVA0AUiOkSPnpZOBR1d+DftPFoDuASzKxnIt1Fne9xXoR ZgIA==
X-Gm-Message-State: AOJu0YxsfANnbnVqRSoj8XkiAJ9kcJhv7H82cFyrcS6abNeFYGB8eb1P 3q66nsd/M3Bjn5ieE5dGGBdoX95qMnG0kuz2wpdaJ0T9/mUTmu7KTHclQZ0BLZ9NOORh8FnN8Ys g5SCKJyhUTBouvaAB3S85YXrzG+5B1cM9+ivBl6TvDJQyG1ai2OU=
X-Google-Smtp-Source: AGHT+IE5s/9Nt+GPXUftYdbg97XsXJiDlnCMyYiD+liA5TPC2MonH8ICFlw5vX3XihfXBY+45Ra5CYzNOWaLfIdjGew=
X-Received: by 2002:a25:fb02:0:b0:dc2:1dd0:1d1b with SMTP id j2-20020a25fb02000000b00dc21dd01d1bmr4109789ybe.19.1712962815288; Fri, 12 Apr 2024 16:00:15 -0700 (PDT)
MIME-Version: 1.0
From: David Benjamin <davidben@chromium.org>
Date: Fri, 12 Apr 2024 18:59:57 -0400
Message-ID: <CAF8qwaCAJif0SA+uyZ=vGUZ29bwrFNL2jrS9wTOxjxaA2JLOaw@mail.gmail.com>
To: "<tls@ietf.org>" <tls@ietf.org>
Cc: bbe@chromium.org, Nick Harper <nharper@chromium.org>
Content-Type: multipart/alternative; boundary="0000000000001f44d10615ee3c9c"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tls/_ku3-YDcroNmG_QKZsYTtqYzC0M>
Subject: [TLS] Issues with buffered, ACKed KeyUpdates in DTLS 1.3
X-BeenThere: tls@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "This is the mailing list for the Transport Layer Security working group of the IETF." <tls.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tls>, <mailto:tls-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tls/>
List-Post: <mailto:tls@ietf.org>
List-Help: <mailto:tls-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tls>, <mailto:tls-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Apr 2024 23:00:21 -0000

Hi all,

This is going to be a bit long. In short, DTLS 1.3 KeyUpdates seem to
conflate the peer *receiving* the KeyUpdate with the peer *processing* the
KeyUpdate, in ways that appear to break some assumptions made by the
protocol design.

*When to switch keys in KeyUpdate*

So, first, DTLS 1.3, unlike TLS 1.3, applies the KeyUpdate on the ACK, not
when the KeyUpdate is sent. This makes sense because KeyUpdate records are
not intrinsically ordered with app data records sent after them:

> As with other handshake messages with no built-in response, KeyUpdates
MUST be acknowledged. In order to facilitate epoch reconstruction (Section
4.2.2), implementations MUST NOT send records with the new keys or send a
new KeyUpdate until the previous KeyUpdate has been acknowledged (this
avoids having too many epochs in active use).
https://www.rfc-editor.org/rfc/rfc9147.html#section-8-1

Now, the parenthetical says this is to avoid having too many epochs in
active use, but it appears that there are stronger assumptions on this:

> After the handshake is complete, if the epoch bits do not match those
from the current epoch, implementations SHOULD use the most recent **past**
epoch which has matching bits, and then reconstruct the sequence number for
that epoch as described above.
https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.2-3
(emphasis mine)

> After the handshake, implementations MUST use the highest available
sending epoch [to send ACKs]
https://www.rfc-editor.org/rfc/rfc9147.html#section-7-7

These two snippets imply the protocol wants the peer to definitely have
installed the new keys before you start using them. This makes sense
because sending stuff the peer can't decrypt is pretty silly. As an aside,
DTLS 1.3 retains this text from DTLS 1.2:

> Conversely, it is possible for records that are protected with the new
epoch to be received prior to the completion of a handshake. For instance,
the server may send its Finished message and then start transmitting data.
Implementations MAY either buffer or discard such records, though when DTLS
is used over reliable transports (e.g., SCTP [RFC4960]), they SHOULD be
buffered and processed once the handshake completes.
https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.1-2

The text from DTLS 1.2 talks about *a* handshake, which presumably refers
to rekeying via renegotiation. But in DTLS 1.3, the epoch reconstruction
rule and the KeyUpdate rule mean this is only possible during the
handshake, when you see epoch 4 and expect epoch 0-3. The steady state
rekeying mechanism never hits this case. (This is a reasonable change
because there's no sense in unnecessarily introducing blips where the
connection is less tolerant of reordering.)

*Buffered handshake messages*

Okay, so KeyUpdates want to wait for the recipient to install keys, except
we don't seem to actually achieve this! Section 5.2 says:

> DTLS implementations maintain (at least notionally) a next_receive_seq
counter. This counter is initially set to zero. When a handshake message is
received, if its message_seq value matches next_receive_seq,
next_receive_seq is incremented and the message is processed. If the
sequence number is less than next_receive_seq, the message MUST be
discarded. If the sequence number is greater than next_receive_seq, the
implementation SHOULD queue the message but MAY discard it. (This is a
simple space/bandwidth trade-off).
https://www.rfc-editor.org/rfc/rfc9147.html#section-5.2-7

I assume this is intended to apply to post-handshake messages too. (See
below for a discussion of the alternative.) But that means that, when you
receive a KeyUpdate, you might not immediately process it. Suppose
next_receive_seq is 5, and the peer sends NewSessionTicket(5),
NewSessionTicket(6), and KeyUpdate(7). 5 is lost, but 6 and 7 come in,
perhaps even in the same record which means that you're forced to ACK both
or neither. But suppose the implementation is willing to buffer 3 messages
ahead, so it ACKs the 6+7 record, by the rules in section 7, which permits
ACKing fragments that were buffered and not yet processed.

That means the peer will switch keys and now all subsequent records from
them will come from epoch N+1. But the sender is not ready for N+1 yet, so
we contradict everything above. We also contradict this parenthetical in
section 8:

> Due to loss and/or reordering, DTLS 1.3 implementations may receive a
record with an older epoch than the current one (the requirements above
preclude receiving a newer record).
https://www.rfc-editor.org/rfc/rfc9147.html#section-8-2

I assume then that this was not actually what was intended.

*Options (and non-options)*

Assuming I'm reading this right, we seem to have made a mess of things. The
sender could avoid this by only allowing one active post-handshake
transaction at a time and serializing them, at the cost of taking a
round-trip for each. But the receiver needs to account for all possible
senders, so that doesn't help. Some options that come to mind:

*1. Accept that the sender updates its keys too early*

Apart from contradicting most of the specification text, the protocol
doesn't *break* per se if you just allow the peer to switch keys early in
this buffered KeyUpdate case. We *merely* contradict all of the explanatory
text and introduce a bunch of cases that the specification suggests are
impossible. :-) Also the connection quality is poor.

The sender will use epoch N+1 at a point when the peer is on N. But epoch
reconstruction will misread it as N-3 instead of N+1, and either way you
won't have the keys to decrypt it yet! The connection is interrupted (and
with all packets discarded because epoch reconstruction fails!) until the
peer retransmits 5 and you catch up. Until then, not only will you not
receive application data, but you also won't receive ACKs. This also adds a
subtle corner case on the sender side: the sender cannot discard the old
sending keys because it still has unACKed messages from the previous epoch
to retransmit, but this is not called out in section 8. Section 8 only
discusses the receiver needing to retain the old epoch.

This seems not great. Also it contradicts much of the text in the spec,
including section 8 explicitly saying this case cannot happen.

*2. Never ACK buffered KeyUpdates*

We can say that KeyUpdates are special and, unless you're willing to
process them immediately, you must not ACK the records containing them.
This means you might under-ACK and the peer might over-retransmit, but
seems not fatal. This also seems a little hairy to implement if you want to
avoid under-ACKing unnecessarily. You might have message
NewSessionTicket(6) buffered and then receive a record with
NewSessionTicket(5) and KeyUpdate(7). That record may appear unACKable, but
it's fine because you'll immediately process 5 then 6 then 7... unless your
NewSessionTicket process is asynchronous, in which case it might not be?

Despite all that mess, this seems the most viable option?

*3. Declare this situation a sender error*

We could say this is not allowed and senders MUST NOT send KeyUpdate if
there are any outstanding post-handshake messages. And then the receiver
should fail with unexpected_message if it ever receives KeyUpdate at a
future message_seq. But as the RFC is already published, I don't know if
this is compatible with existing implementations.

*4. Explicit KeyUpdateAck message*

We could have made a KeyUpdateAck message to signal that you've processed a
KeyUpdate, not just sent it. But that's a protocol change and the RFC is
stamped, so it's too late now.

*5. Process KeyUpdate out of order*

We could say that the receiver doesn't buffer KeyUpdate. It just goes ahead
and processes it immediately to install epoch N+1. This seems like it would
address the issue but opens more cans of worms. Now the receiver needs to
keep the old epoch around for more than packet reorder, but also to pick up
the retransmissions of the missing handshake messages. Also, by activating
the new epoch, the receiver now allows the sender to KeyUpdate again, and
again, and again. But, several epochs later, the holes in the message
stream may remain unfilled, so we still need the old keys. Without further
protocol rules, a sender could force the receiver to keep keys arbitrarily
many records back. All this is, at best, a difficult case that is unlikely
to be well-tested, and at worst get the implementation into some broken
state and then misbehave badly.

*6. Post-handshake transactions aren't ordered at all*

It could be that my assumption above was wrong and the next_receive_seq
discussion in 5.2 only applies to the handshake. After all, section 5.8.4
discusses how every post-handshake transaction duplicates the "state
machine". Except it only says to duplicate the 5.8.1 state machine, and
it's unclear ambiguous whether that includes the message_seq logic.

However, going this direction seems to very quickly make a mess. If each
post-handshake transaction handles message_seq independently, you cannot
distinguish a retransmission from a new transaction. That seems quite bad,
so presumably the intent was to use message_seq to distinguish those. (I.e.
the intent can't have been to duplicate the message_seq state.) Indeed, we
have:

> However, in DTLS 1.3 the message_seq is not reset, to allow
distinguishing a retransmission from a previously sent post-handshake
message from a newly sent post-handshake message.
https://www.rfc-editor.org/rfc/rfc9147.html#section-5.2-6

But if we distinguish with message_seq AND process transactions out of
order, now receivers need to keep track of fairly complex state in case
they process messages 5, 7, 9, 11, 13, 15, 17, ... but then only get the
even ones later. And we'd need to define some kind of sliding window for
what happens if you receive message_seq 9000 all of a sudden. And we import
all the cross-epoch problems in option 5 above. None of that is in the
text, so I assume this was not the intended reading, and I don't think we
want to go that direction. :-)

*Digression: ACK fate-sharing and flow control*

All this alludes to another quirk that isn't a problem, but is a little
non-obvious and warrants some discussion in the spec. Multiple handshake
fragments may be packed into the same record, but ACKs apply to the whole
record. If you receive a fragment for a message sequence too far into the
future, you are permitted to discard the fragment. But if you discard
*any* fragment,
you cannot ACK the record, *even if there were fragments which you did
process*. During the handshake, an implementation could avoid needing to
make this decision by knowing the maximum size of a handshake flight. After
the handshake, there is no inherent limit on how many NewSessionTickets the
peer may choose to send in a row, and no flow control.

QUIC ran into a similar issue here and said an implementation can choose an
ad-hoc limit, after which it can choose to either wedge the post-handshake
stream or return an error.
https://github.com/quicwg/base-drafts/issues/1834
https://github.com/quicwg/base-drafts/pull/2524

I suspect the most practical outcome for DTLS (and arguably already
supported by the existing text, but not very obviously), is to instead say
the receiver just refuses to ACK stuff and, okay, maybe in some weird edge
cases the receiver under-ACKs and then the sender over-retransmits, until
things settle down. Whereas ACKs are a bit more tightly integrated with
QUIC, so refusing to ACK a packet due to one bad frame is less of an
option. Still, I think this would have been worth calling out in the text.


So... did I read all this right? Did we indeed make a mess of this, or did
I miss something?

David