Re: [TLS] Issues with buffered, ACKed KeyUpdates in DTLS 1.3

"Tschofenig, Hannes" <hannes.tschofenig@siemens.com> Tue, 16 April 2024 09:19 UTC

Return-Path: <hannes.tschofenig@siemens.com>
X-Original-To: tls@ietfa.amsl.com
Delivered-To: tls@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4CC84C14F68B for <tls@ietfa.amsl.com>; Tue, 16 Apr 2024 02:19:02 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_NONE=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=siemens.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9ZZ8kzdUK_QT for <tls@ietfa.amsl.com>; Tue, 16 Apr 2024 02:18:58 -0700 (PDT)
Received: from EUR04-DB3-obe.outbound.protection.outlook.com (mail-db3eur04on2066.outbound.protection.outlook.com [40.107.6.66]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D7894C14E513 for <tls@ietf.org>; Tue, 16 Apr 2024 02:18:57 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=TgNydPYasv5cwuAaKnvcSrpb1/Ob5JbxPpgDpZqhKn2Tf7henBq7xHQpj2oix0twfTnXEUTz3u+ZnCTIplPA2/23VpVaCVvLTRMl6iaNLqjrhXDmPOFPlkFG7163s8Q86xC0JU1v//vds86xQLYDCohhX+s5sZWMhZ9bM7+1JGoFItevHv9/QtoJbBplmaMXKl0CkVY21MFMieFe8BpcG5iYzr1HG646MB8J+q/Ykp/oUEeIUYMYarSM6iz/V3wu8H3M/DnUbDXL2NGNkuGf+B55ejNBPCrkzkEu1oR5MkL7iG0Xp50r0GxbgKqB4vglbIl/RcWkjj2uhSOG12rr6g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5abBiMH4CfCNBOiB/bgXczb2aRDEtCTbP1IUeN/aQ1c=; b=NbX9XOayE+g6+2392OmsoxadP7FkFtAOSIdr8LLVwlmX7kHTtnJbQgZ0z0aEa83RP2QovBCwlSv10P0Jp7vx0Lsq8Ku8FNG/Dym1mzQmn0MBmIrGdeeq48j8c6eLnN631fVe2WpvzdK/1u/2JKMepL1yZOABW/upKfEUq1I3is42y4N5MUoPBXL9VC+29MT5Y3hMxDjm3vVi5j5k7MfbHtFrqt7ILLxdCpkeCC1TFDfaRn2gz3LpN0dqqOYbpl7bPJLo35P2aXGqmG5a7iINXD1Q5L/YnxfkAz82pDnSUZveEBo296AzamHO3Lx9aZMxC3wfngpyZ3MhbNNUhcRV9Q==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=siemens.com; dmarc=pass action=none header.from=siemens.com; dkim=pass header.d=siemens.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=siemens.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5abBiMH4CfCNBOiB/bgXczb2aRDEtCTbP1IUeN/aQ1c=; b=nd7QXPXxJqTAUFREDif2IBECs6YhCm69zpgEGdkOlz7QzdRy1es6wTGAk4OfX0D7BkBkiiGK4ouAhx7UU+wfTTFhyx0/pXd2wmYwjOpMA1I48EfBZw1RCfMHNyjwgwWkUhP5vyuCy2F3KIc70UXphGsrQjK69w9eUwe2Xw2+12SZnadWbMdt3LcC72Uf6qilxO9sAB1Wq+xOskbMS3di+YuGIL3xZcgg9MJ29e2ytfy/1fVpru2vjTEpOylmcnVspVWDEuS/B3IgQ1thT+Ez6aSzda5bPFy4VUm1g2uYkfjSbFcedP74eJENRXKES5D0S45mMw53XnfUTNkEo8tX8g==
Received: from AS8PR10MB7427.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:20b:5ab::22) by DBAPR10MB4074.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:10:1c9::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7452.50; Tue, 16 Apr 2024 09:18:54 +0000
Received: from AS8PR10MB7427.EURPRD10.PROD.OUTLOOK.COM ([fe80::9172:20d1:3f36:a3d]) by AS8PR10MB7427.EURPRD10.PROD.OUTLOOK.COM ([fe80::9172:20d1:3f36:a3d%3]) with mapi id 15.20.7452.049; Tue, 16 Apr 2024 09:18:54 +0000
From: "Tschofenig, Hannes" <hannes.tschofenig@siemens.com>
To: David Benjamin <davidben@chromium.org>, "<tls@ietf.org>" <tls@ietf.org>
CC: Nick Harper <nharper@chromium.org>
Thread-Topic: [TLS] Issues with buffered, ACKed KeyUpdates in DTLS 1.3
Thread-Index: AQHajS1A+owYZt1oCEuW7ifhr5urzLFmfhKAgAQkm1A=
Date: Tue, 16 Apr 2024 09:18:54 +0000
Message-ID: <AS8PR10MB7427BA4C60115664B4571C0BEE082@AS8PR10MB7427.EURPRD10.PROD.OUTLOOK.COM>
References: <CAF8qwaCAJif0SA+uyZ=vGUZ29bwrFNL2jrS9wTOxjxaA2JLOaw@mail.gmail.com> <CAF8qwaDnvOAeQWMCprs=2xqaFFn7saBQg9mAVwXTSo1MGhfWcA@mail.gmail.com>
In-Reply-To: <CAF8qwaDnvOAeQWMCprs=2xqaFFn7saBQg9mAVwXTSo1MGhfWcA@mail.gmail.com>
Accept-Language: de-DE, en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
msip_labels: MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_ActionId=2a0454e6-b0b8-4e48-8860-2364545a4a2b; MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_ContentBits=0; MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_Enabled=true; MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_Method=Standard; MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_Name=restricted; MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_SetDate=2024-04-16T09:14:41Z; MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_SiteId=38ae3bcd-9579-4fd4-adda-b42e1495d55a;
authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=siemens.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: AS8PR10MB7427:EE_|DBAPR10MB4074:EE_
x-ms-office365-filtering-correlation-id: ebbca55d-08ff-4b12-c340-08dc5df63cfe
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: 7ZXMNNEs4x0hBis1RQ4tk5+W9n/PmDKULKNdkVTQnsRTyJkv3pQU6vrF/FqIKLWfH3N+El9GOuXqagpW17TW7RORP4ltB01qRBB0dVKep7o7iusMygmKHmkWdJ+6GV71g1Mmmxkuk6uOp84JwnADT5Hg8XzsXUOLqmpU+BbynSJANSDXdil+Fou7Ywa/Wy8Wz4Lk4Vzv3acYMg6k+uch4ZSCaJMCdlLBCdSKyHOi62YurocoscHULM2OgSVIIbBHzrzR54WtSlQVnjB7NnlrPrTtkKd1MoUcYp5C9iVytGrB18nXtWgD9PQRunkCeEyrbhWsLPdV35sdnUBhl2brL2r7t1JzHs9AmcotI3OHoZ3HAEPGeyBTpgyBoBiAj62QOSlu2eaIdyWQdNIZ1ZU22enGAQeqKxeayjEwwURTHQz4ZfGJd8v5EfCW5Vczk3GzOu8BU0Y1aVR0yIvWANlYiPWPLdD3V2VMLpIX6CVVLqzVa5IOSdc/4sH3w2IFY6duhwm4KWxPohBT9X0kehAr9Xu5sKLZYSuceLvd7Kma0w7nISSlTFlGCtINtltnUwlr5sVxaEqk2zAB+ORpE4IaFs5geqmHCzSlQK/SJZ4/dSoO2nPBAvkjdxA3R+Gs960KPgpJq2qMInVpAXmANvWuYPwTmVuGo/YJbiZZ0XQxAx6zMyy0iW4qc3DfzWUbhqms
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:AS8PR10MB7427.EURPRD10.PROD.OUTLOOK.COM; PTR:; CAT:NONE; SFS:(13230031)(366007)(376005)(1800799015)(38070700009); DIR:OUT; SFP:1101;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: J1SApdUadyrd5WbnbCPluX61HtSbQnKyt0tloBvfvEoZu9svJZVehZBa3fsIdJ5KjvnUDNL9NCERoSJNDHAEcAolmvTfHTMEdnM4r02lJVHBdM53YGcsaI0nf9ywpVPJer3Uno2kPmSZeLOZ0AdTMFxpe8rLJm9PO2zDfn1/off4uo8IANunSyBlm55GhEfZrXvIQvegl7E2I8p+Wv9Y/T0UY8rpsNr7vjRbTMK/wURhuQ/A+SVOTgd8gkrVv1vlgw3nb3oANGkBvVZeJNLWJl1+8t3ZmnzIB/+da+NfP+jrEUzlCZ6bB92FjHtFYPmiXV+JoVd3Q7v41aY+1SbwmUL0Yubn3+16wawgnmB/n633sfNf7VpjsNuaMcf4MpQorLNcSlLHBstqC0PJ9mZJzDdCHiGL414iKZjlHGKpjmu+/pfqRezmDYPTSsaFB/v2ZNxvKaL2YYRrRAZrTNI9iZS/6ZkdPc9aZd/B+4GsGJ4QHsrH1BL1kQRX7PpRAWJACRuhlWeB8ktLQR6lSzp3g94P9GLVNMrx/+3o2mmx+UNlcidpO6aYR1a95T9eUe/Tf6tzYy+M2XTPmchw/SyBy7KK5H5iinWrbQPYhPd6/4D+OCW569shfDyB44qWEXtkKeihPuww1x7uA80MS5FWUpMhAZ2pVztVhPiDl+3/mjI1xPotz0etnw3u3NHfghZ65WscV/znrDMW80MheIXOB2wdM0/qR6mPYI/u0JeK9SLkSdpDrHjckTnJh7jPvQATi78X9bTkHUa7hkZZhWhkbgc+tCFS7hYFBjxhOSSY0GgNDPjIsNwR6HAz9inXn+8rjxlBY6GaTFm2j9o654c6sLhxRKV0pJXmWxz2AUX3CcJ0O/6OcWIx/jP3wGbFoDQrsx0lWadVXcbIdtyJmfbN8UviNkuogIx1qDd5txU1XeCYVWbbRfdfGBOz7ikYOU2/UOhXoU0zxP4wqVzFKrK69JKtxyuiKhrmtPuaJA5G+cQJFKwIVA2vbtNleccOy3F/G2sO1tVQBHrq06woR2C5m2EnV4sgNMfGjOTFCjrok4xOItp56JnLIQRz7nq40a0YKwdwI8qAvRfzjWjmFfV8vBGPQ+drb/ke83AEokJFid58SkxgXU+niGFoj3HZAfFlMRxBDFzfJsF40xaQxO9x+ropoKNpUVc3ez1G9rE55SNXdHtPkLvQGT/TnqKAuWbyx6Y+NhDhWfhcVkS1+TZy6ljUsV0yE+KeXe8fg9vemvyau1qSVUVSs4vnRLiw/uye5IfqwacgC6jIBrMu71zt3tsoFKMEBqEVyDpurRunBwblj1UzwFgH2d2o4S9f6ny3WGcsw7/bOgNJtRGybeypZzg8IktnKs38vyOi8LApy7Fz6iDkcoICVwX+5tuYR93bXJGoy+Dn1iRSAOpXQpFpOdS0JIpKYAh+++lFIpLL32fl0b8UOe/WtijzXcfIh1YEWN51VD/uul2BIiVajcD5NPR1o+VCHCYyfsGWlw6IV9nY7ViF8N1qeMmnrRf5e/hXMJJCFzMlRV77FaCI+JRCDXlShinabv2MODzQXqkOXfRswUI3imgZaFC6MCKT4A2unPY7Mw9rvGfJrMrW6KImNw==
Content-Type: multipart/alternative; boundary="_000_AS8PR10MB7427BA4C60115664B4571C0BEE082AS8PR10MB7427EURP_"
MIME-Version: 1.0
X-OriginatorOrg: siemens.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: AS8PR10MB7427.EURPRD10.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-Network-Message-Id: ebbca55d-08ff-4b12-c340-08dc5df63cfe
X-MS-Exchange-CrossTenant-originalarrivaltime: 16 Apr 2024 09:18:54.6401 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 38ae3bcd-9579-4fd4-adda-b42e1495d55a
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 9WsI8PBVCrqLVf7rMo/VHBOIF1b5Kjki1lob4KmF+E9Ha6k526ZrfhZ4KMvnqPLpyLJqpT6qoqNm4Syq5kfOU8X4dkj0koNIqE9ouPeljew=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DBAPR10MB4074
Archived-At: <https://mailarchive.ietf.org/arch/msg/tls/Y0BUDO1EKTEfWbEb0k7oOfmcrHg>
Subject: Re: [TLS] Issues with buffered, ACKed KeyUpdates in DTLS 1.3
X-BeenThere: tls@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "This is the mailing list for the Transport Layer Security working group of the IETF." <tls.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tls>, <mailto:tls-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tls/>
List-Post: <mailto:tls@ietf.org>
List-Help: <mailto:tls-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tls>, <mailto:tls-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Apr 2024 09:19:02 -0000

Hi David,

this is great feedback. Give me a few days to respond to this issue with my suggestion for moving forward.

Ciao
Hannes

From: TLS <tls-bounces@ietf.org> On Behalf Of David Benjamin
Sent: Saturday, April 13, 2024 7:59 PM
To: <tls@ietf.org> <tls@ietf.org>
Cc: Nick Harper <nharper@chromium.org>
Subject: Re: [TLS] Issues with buffered, ACKed KeyUpdates in DTLS 1.3

Another issues with DTLS 1.3's state machine duplication scheme:

Section 8 says implementation must not send new KeyUpdate until the KeyUpdate is ACKed, but it says nothing about other post-handshake messages. Suppose KeyUpdate(5) in flight and the implementation decides to send NewSessionTicket. (E.g. the application called some "send NewSessionTicket" API.) The new epoch doesn't exist yet, so naively one would start sending NewSessionTicket(6) in the current epoch. Now the peer ACKs KeyUpdate(5), so we transition to the new epoch. But retransmissions must retain their original epoch:

> Implementations MUST send retransmissions of lost messages using the same epoch and keying material as the original transmission.
https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.1-3

This means we must keep sending the NST at the old epoch. But the peer may have no idea there's a message at that epoch due to packet loss! Section 8 does ask the peer to keep the old epoch around for a spell, but eventually the peer will discard the old epoch. If NST(6) didn't get through before then, the entire post-handshake stream is now wedged!

I think this means we need to amend Section 8 to forbid sending *any* post-handshake message after KeyUpdate. That is, rather than saying you cannot send a new KeyUpdate, a KeyUpdate terminates the post-handshake stream at that epoch and all new post-handshake messages, be they KeyUpdate or anything else, must be enqueued for the new epoch. This is a little unfortunate because a TLS library which transparently KeyUpdates will then inadvertently introduce hiccups where post-handshake messages triggered by the application, like post-handshake auth, are blocked.

That then suggests some more options for fixing the original problem.

7. Fix the sender's KeyUpdate criteria

We tell the sender to wait for all previous messages to be ACKed too. Fix the first paragraph of section 8 to say:

> As with other handshake messages with no built-in response, KeyUpdates MUST be acknowledged. Acknowledgements are used to both control retransmission and transition to the next epoch. Implementations MUST NOT send records with the new keys until the KeyUpdate and all preceding messages have been acknowledged. This facilitates epoch reconstruction (Section 4.2.2) and avoids too many epochs in active use, by ensuring the peer has processed the KeyUpdate and started receiving at the new epoch.
>
> A KeyUpdate message terminates the post-handshake stream in an epoch. After sending KeyUpdate in an epoch, implementations MUST NOT send any new post-handshake messages in that epoch. Note that, if the implementation has sent KeyUpdate but is waiting for an ACK, the next epoch is not yet active. In this case, subsequent post-handshake messages may not be sent until receiving the ACK.

And then on the receiver side, we leave things as-is. If the sender implemented the old semantics AND had multiple post-handshake transactions in parallel, it might update keys too early and then we get into the situation described in (1). We then declare that, if this happens, and the sender gets confused as a result, that's the sender's fault. Hopefully this is not rare enough (did anyone even implement 5.8.4, or does everyone just serialize their post-handshake transitions?) to not be a serious protocol break? That risk aside, this option seems the most in spirit with the current design to me.

8. Decouple post-handshake retransmissions from epochs

If we instead say that the same epoch rule only applies for the handshake, and not post-handshake messages, I think option 5 (process KeyUpdate out of order) might become viable? I'm not sure. Either way, this seems like a significant protocol break, so I don't think this is an option until some hypothetical DTLS 1.4.


On Fri, Apr 12, 2024 at 6:59 PM David Benjamin <davidben@chromium.org<mailto:davidben@chromium.org>> wrote:
Hi all,

This is going to be a bit long. In short, DTLS 1.3 KeyUpdates seem to conflate the peer receiving the KeyUpdate with the peer processing the KeyUpdate, in ways that appear to break some assumptions made by the protocol design.

When to switch keys in KeyUpdate

So, first, DTLS 1.3, unlike TLS 1.3, applies the KeyUpdate on the ACK, not when the KeyUpdate is sent. This makes sense because KeyUpdate records are not intrinsically ordered with app data records sent after them:

> As with other handshake messages with no built-in response, KeyUpdates MUST be acknowledged. In order to facilitate epoch reconstruction (Section 4.2.2), implementations MUST NOT send records with the new keys or send a new KeyUpdate until the previous KeyUpdate has been acknowledged (this avoids having too many epochs in active use).
https://www.rfc-editor.org/rfc/rfc9147.html#section-8-1

Now, the parenthetical says this is to avoid having too many epochs in active use, but it appears that there are stronger assumptions on this:

> After the handshake is complete, if the epoch bits do not match those from the current epoch, implementations SHOULD use the most recent *past* epoch which has matching bits, and then reconstruct the sequence number for that epoch as described above.
https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.2-3
(emphasis mine)

> After the handshake, implementations MUST use the highest available sending epoch [to send ACKs]
https://www.rfc-editor.org/rfc/rfc9147.html#section-7-7

These two snippets imply the protocol wants the peer to definitely have installed the new keys before you start using them. This makes sense because sending stuff the peer can't decrypt is pretty silly. As an aside, DTLS 1.3 retains this text from DTLS 1.2:

> Conversely, it is possible for records that are protected with the new epoch to be received prior to the completion of a handshake. For instance, the server may send its Finished message and then start transmitting data. Implementations MAY either buffer or discard such records, though when DTLS is used over reliable transports (e.g., SCTP [RFC4960]), they SHOULD be buffered and processed once the handshake completes.
https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.1-2

The text from DTLS 1.2 talks about *a* handshake, which presumably refers to rekeying via renegotiation. But in DTLS 1.3, the epoch reconstruction rule and the KeyUpdate rule mean this is only possible during the handshake, when you see epoch 4 and expect epoch 0-3. The steady state rekeying mechanism never hits this case. (This is a reasonable change because there's no sense in unnecessarily introducing blips where the connection is less tolerant of reordering.)

Buffered handshake messages

Okay, so KeyUpdates want to wait for the recipient to install keys, except we don't seem to actually achieve this! Section 5.2 says:

> DTLS implementations maintain (at least notionally) a next_receive_seq counter. This counter is initially set to zero. When a handshake message is received, if its message_seq value matches next_receive_seq, next_receive_seq is incremented and the message is processed. If the sequence number is less than next_receive_seq, the message MUST be discarded. If the sequence number is greater than next_receive_seq, the implementation SHOULD queue the message but MAY discard it. (This is a simple space/bandwidth trade-off).
https://www.rfc-editor.org/rfc/rfc9147.html#section-5.2-7

I assume this is intended to apply to post-handshake messages too. (See below for a discussion of the alternative.) But that means that, when you receive a KeyUpdate, you might not immediately process it. Suppose next_receive_seq is 5, and the peer sends NewSessionTicket(5), NewSessionTicket(6), and KeyUpdate(7). 5 is lost, but 6 and 7 come in, perhaps even in the same record which means that you're forced to ACK both or neither. But suppose the implementation is willing to buffer 3 messages ahead, so it ACKs the 6+7 record, by the rules in section 7, which permits ACKing fragments that were buffered and not yet processed.

That means the peer will switch keys and now all subsequent records from them will come from epoch N+1. But the sender is not ready for N+1 yet, so we contradict everything above. We also contradict this parenthetical in section 8:

> Due to loss and/or reordering, DTLS 1.3 implementations may receive a record with an older epoch than the current one (the requirements above preclude receiving a newer record).
https://www.rfc-editor.org/rfc/rfc9147.html#section-8-2

I assume then that this was not actually what was intended.

Options (and non-options)

Assuming I'm reading this right, we seem to have made a mess of things. The sender could avoid this by only allowing one active post-handshake transaction at a time and serializing them, at the cost of taking a round-trip for each. But the receiver needs to account for all possible senders, so that doesn't help. Some options that come to mind:

1. Accept that the sender updates its keys too early

Apart from contradicting most of the specification text, the protocol doesn't break per se if you just allow the peer to switch keys early in this buffered KeyUpdate case. We merely contradict all of the explanatory text and introduce a bunch of cases that the specification suggests are impossible. :-) Also the connection quality is poor.

The sender will use epoch N+1 at a point when the peer is on N. But epoch reconstruction will misread it as N-3 instead of N+1, and either way you won't have the keys to decrypt it yet! The connection is interrupted (and with all packets discarded because epoch reconstruction fails!) until the peer retransmits 5 and you catch up. Until then, not only will you not receive application data, but you also won't receive ACKs. This also adds a subtle corner case on the sender side: the sender cannot discard the old sending keys because it still has unACKed messages from the previous epoch to retransmit, but this is not called out in section 8. Section 8 only discusses the receiver needing to retain the old epoch.

This seems not great. Also it contradicts much of the text in the spec, including section 8 explicitly saying this case cannot happen.

2. Never ACK buffered KeyUpdates

We can say that KeyUpdates are special and, unless you're willing to process them immediately, you must not ACK the records containing them. This means you might under-ACK and the peer might over-retransmit, but seems not fatal. This also seems a little hairy to implement if you want to avoid under-ACKing unnecessarily. You might have message NewSessionTicket(6) buffered and then receive a record with NewSessionTicket(5) and KeyUpdate(7). That record may appear unACKable, but it's fine because you'll immediately process 5 then 6 then 7... unless your NewSessionTicket process is asynchronous, in which case it might not be?

Despite all that mess, this seems the most viable option?

3. Declare this situation a sender error

We could say this is not allowed and senders MUST NOT send KeyUpdate if there are any outstanding post-handshake messages. And then the receiver should fail with unexpected_message if it ever receives KeyUpdate at a future message_seq. But as the RFC is already published, I don't know if this is compatible with existing implementations.

4. Explicit KeyUpdateAck message

We could have made a KeyUpdateAck message to signal that you've processed a KeyUpdate, not just sent it. But that's a protocol change and the RFC is stamped, so it's too late now.

5. Process KeyUpdate out of order

We could say that the receiver doesn't buffer KeyUpdate. It just goes ahead and processes it immediately to install epoch N+1. This seems like it would address the issue but opens more cans of worms. Now the receiver needs to keep the old epoch around for more than packet reorder, but also to pick up the retransmissions of the missing handshake messages. Also, by activating the new epoch, the receiver now allows the sender to KeyUpdate again, and again, and again. But, several epochs later, the holes in the message stream may remain unfilled, so we still need the old keys. Without further protocol rules, a sender could force the receiver to keep keys arbitrarily many records back. All this is, at best, a difficult case that is unlikely to be well-tested, and at worst get the implementation into some broken state and then misbehave badly.

6. Post-handshake transactions aren't ordered at all

It could be that my assumption above was wrong and the next_receive_seq discussion in 5.2 only applies to the handshake. After all, section 5.8.4 discusses how every post-handshake transaction duplicates the "state machine". Except it only says to duplicate the 5.8.1 state machine, and it's unclear ambiguous whether that includes the message_seq logic.

However, going this direction seems to very quickly make a mess. If each post-handshake transaction handles message_seq independently, you cannot distinguish a retransmission from a new transaction. That seems quite bad, so presumably the intent was to use message_seq to distinguish those. (I.e. the intent can't have been to duplicate the message_seq state.) Indeed, we have:

> However, in DTLS 1.3 the message_seq is not reset, to allow distinguishing a retransmission from a previously sent post-handshake message from a newly sent post-handshake message.
https://www.rfc-editor.org/rfc/rfc9147.html#section-5.2-6

But if we distinguish with message_seq AND process transactions out of order, now receivers need to keep track of fairly complex state in case they process messages 5, 7, 9, 11, 13, 15, 17, ... but then only get the even ones later. And we'd need to define some kind of sliding window for what happens if you receive message_seq 9000 all of a sudden. And we import all the cross-epoch problems in option 5 above. None of that is in the text, so I assume this was not the intended reading, and I don't think we want to go that direction. :-)

Digression: ACK fate-sharing and flow control

All this alludes to another quirk that isn't a problem, but is a little non-obvious and warrants some discussion in the spec. Multiple handshake fragments may be packed into the same record, but ACKs apply to the whole record. If you receive a fragment for a message sequence too far into the future, you are permitted to discard the fragment. But if you discard any fragment, you cannot ACK the record, even if there were fragments which you did process. During the handshake, an implementation could avoid needing to make this decision by knowing the maximum size of a handshake flight. After the handshake, there is no inherent limit on how many NewSessionTickets the peer may choose to send in a row, and no flow control.

QUIC ran into a similar issue here and said an implementation can choose an ad-hoc limit, after which it can choose to either wedge the post-handshake stream or return an error.
https://github.com/quicwg/base-drafts/issues/1834
https://github.com/quicwg/base-drafts/pull/2524

I suspect the most practical outcome for DTLS (and arguably already supported by the existing text, but not very obviously), is to instead say the receiver just refuses to ACK stuff and, okay, maybe in some weird edge cases the receiver under-ACKs and then the sender over-retransmits, until things settle down. Whereas ACKs are a bit more tightly integrated with QUIC, so refusing to ACK a packet due to one bad frame is less of an option. Still, I think this would have been worth calling out in the text.


So... did I read all this right? Did we indeed make a mess of this, or did I miss something?

David