Re: [TLS] Transport Issues in DTLS 1.3

Eric Rescorla <ekr@rtfm.com> Fri, 26 March 2021 22:08 UTC

MIME-Version: 1.0
References: <CAM4esxR3YPoWaxU9B--oaT9r2bh_QBNH=tt0FsiUKaAT=M6_fg@mail.gmail.com>
In-Reply-To: <CAM4esxR3YPoWaxU9B--oaT9r2bh_QBNH=tt0FsiUKaAT=M6_fg@mail.gmail.com>
From: Eric Rescorla <ekr@rtfm.com>
Date: Fri, 26 Mar 2021 15:08:01 -0700
Message-ID: <CABcZeBMS5fUej0q5XhbxM5sMLQwAAyCgyAfbkTORQjvMM+jb7A@mail.gmail.com>
To: Martin Duke <martin.h.duke@gmail.com>
Cc: draft-ietf-tls-dtls13.all@ietf.org, Mark Allman <mallman@icsi.berkeley.edu>, Lars Eggert <lars@eggert.org>, Gorry Fairhurst <gorry@erg.abdn.ac.uk>, "<tls@ietf.org>" <tls@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000018c58b05be77c655"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tls/56-FTPW_h3acT1t7aC4d53orTqs>
Subject: Re: [TLS] Transport Issues in DTLS 1.3
Precedence: list

Hi folks,

This is a combined response to Martin Duke and to Mark Allman.

Before I respond in detail I'd like to level set a bit.

First, DTLS does not provide a generic reliable bulk data transmission
capability. Rather, it provides an unreliable channel (a la UDP).
That channel is set up with a handshake protocol and DTLS provides
relibaility for that protocol. However, that protocol is run
infrequently and generally involves relatively small amounts
(typically << 10KB) of data being sent. This means that we have rather
more latitude in terms of how aggressively we retransmit because
it only applies to a small fraction of the traffic.

Second, DTLS 1.2 is already widely deployed. It uses a simple "wait
for the timer to expire and retransmit everything" approach, with the
timer being doubled on each retransmission. This doesn't always
provide ideal results, but also has not caused the network to
collapse. I don't know much about how things are deployed in the IoT
setting (paging Hannes Tschofenig) but at least in the WebRTC context,
we have found the 1000ms guidance to be unduly long (as a practical
matter, video conferencing just won't work with delays over
100-200ms). Firefox uses 50ms and AIUI Chrome uses a value derived
from the ICE handshake (which is probably better because there
are certainly times where 50ms is too short).

Martin Duke's Comments:

> In Sec 5.8.2, it is a significant change from DTLS 1.2 that the
> initial timeout is dropping from 1 sec to 100ms, and this is worthy of
> some discussion. This violation of RFC8961 ought to be explored
> further. For a client first flight of one packet, it seems
> unobjectionable. However, I'm less comfortable with a potentially
> large server first flight, or a client second flight, likely leading
> to a large spurious retransmission. With large flights, not only is a
> short timeout more dangerous, but you are more likely to get an ACK in
> the event of some loss that allows you to shortcut the timer anyway
> (i.e. the cost of long timeout is smaller)

You seem to be implicitly assuming that there is individual packet
loss rather than burst loss. If the entire flight is lost, you want to
just fall back to retransmitting.

> Relatedly, in section 5.8.3 there is no specific recommendation for a
> maximum flight size at all. I would think that applications SHOULD
> have no more than 10 datagrams outstanding unless it has some OOB
> evidence of available bandwidth on the channel, in keeping with de
> facto transport best practice.

I agree that this is a reasonable change.

> Finally, I am somewhat concerned that the lack of any window reduction
> might perform poorly in constrained environments.

I'm skeptical that this is actually the case. As a practical matter,
TLS flights rarely exceed 5 packets. For instance, Fastly's data on
QUIC [0] indicates that the server's first flight (the biggest flight
in the TLS 1.3 handshake) is less than 5 packets for the vast majority
of handshakes, even without certificate compression. Given that
constrained environments have more incentive to reduce bandwidth, I
would expect them to typically be smaller, either via using smaller
certificates or using some of the existing techniques for reducing
handshake size such as cert compression or cached info.

> Granted, doubling
> the timeout will reduce the rate, but when retransmission is
> ack-driven there is essentially no reduction of sending rate in
> response to loss.

I don't believe this is correct. Recall that unlike TCP, there's
generally no buffer of queued packets waiting to be transmitted.
Rather, there is a fixed flight of data which must be delivered.  With
one exceptional case [1], an ACK will reflect that some but not all of
the data was delivered and processed; when retransmitting, the
sender will only retransmit the un-ACKed packets, which naturally
reduces the sending rate. Given the quite small flights in play
here, that reduction is likely to be quite substantial. For instance,
if there are three packets and 1 is ACKed, then there will
be a reduction of 1/3.

> I want to emphasize that I am not looking to fully recreate TCP here;
> some bounds on this behavior would likely be satisfactory.
>
> Here is an example of something that I think would be workable. It is
> meant to be a starting point for discussion. I've asked for some input
> from the experts in this area who may feel differently.
>
> - In general, the initial timeout is 100ms.
> - The timeout backoff is not reset after successful delivery.
>   This
>   allows the "discovery" in bullet 1 to be safely applied to larger
>   flights.

Note that the timeout is actually only reset after successful loss-free
delivery of a flight:

   Implementations SHOULD retain the current timer value until a
   message is transmitted and acknowledged without having to
   be retransmitted, at which time the value may be
   reset to the initial value.

There seems to be some confusion here (perhaps due to bad writing).
When the text says "resets the retransmission timer" it means "re-arm
it with the current value" not "re-set it to the initial default". For
instance, suppose that I send flight 1 with retransmit timer value
T. After T seconds, I have not received anything and so I retransmit
it, doubling to 2T. After I get a response, I now send a new
flight. The timer should be 2T, not T.

With that said, I think it would be reasonable to re-set to whatever
the measured RTT was, rather than the initial default. This would
avoid potentially resetting to an overly low default (though it's
not clear to me how this could happen because if your RTT estimate
is too low you will never get a delivery without retransmission).

> - For a first flight of > 2 packets, the sender MUST either (a) set
>   the initial timeout to 1 second OR (b) retransmit no more than 2
>   packets after timeout.
> - flights SHOULD be limited to 10 packets
> - on timeout or ack-indicated retransmission, no more than half
>   (minimum one) of the flight should be retransmitted
>
> The theory here is that it's responsive to RTTs > 100ms, but small
> flights can be more aggressive, and large flows are likely to have
> ack-driven retransmission.

I think it would be useful to distinguish two sets of concerns here:

1. That timeout-driven retransmission is too aggressive due to
   too-short timers.

2. That ACK-driven retransmission will be too aggressive (presumably
   due to the ACK indicating congestion-driven loss; if the loss
   is due to burst errors, then we want to retransmit aggressively).

On point (1), I think that the fact that we have extensive deployment
of timeout-driven retransmission in the field with short timers is
fairly strong evidence that it will not destroy the Internet and more
generally that the "retransmit the whole flight" design is safe in
this case. I certainly agree that there might be settings in which
100ms is too short. Rather than litigate the timer value, which I
agree is a judgement call, I suggest we increase the default somewhat
(250? 500) and then indicate that if the application has information
that a shorter timer is appropriate, it can use one.

As far as point (2) goes, I don't think that any change is indicated
here. As I indicated above, there is a finite amount of data to
transmit and the design of the ACKs is such that you will continue to
make forward progress (and if you're not, you won't be getting
ACKs). Given the small fraction of the network traffic that will be
DTLS handshakes, the primary risk here seems to be that on a very
constrained network, you will get suboptimal performance for your
handshake, but even that should resolve in a small number of round
trips, especially if the receiver buffers out of order packets (which
you obviously want to do in a constrained network). And if you do have
random loss rather than congestion loss, backing off will have a very
negative impact on the handshake for minimal reduction in packets
transmitted [2].

With that said, given that your concern seems to be large flights,
I could maybe live with halving the *window* rather than the
size of the flight. In your example, you suggest an initial window
of 10, so this would give us 10, 5, 3, ... This would have little
practical impact on the vast majority of handshakes, but I suppose
might slightly improve things on the edge cases where you have
a large flight *and* a high congestion network.

Mark Allman's comments:

> A few specific things (in addition to what Gorry said, which I
> absolutely agree with):
>
>   - "Though timer values are the choice of the implementation,
>     mishandling of the timer can lead to serious congestion
>     problems"
>
>     + Gorry flagged this and I am flagging it again.  If this is
>       something that can lead to serious problems, let's not just
>       leave it to "choice of the implementation".  Especially if we
>       have some idea how to make it less problematic.

I'm not sure what you'd like here. I think the guidance in this
specification is reasonable, so I'd be happy to just remove this
text.

>   - "Implementations SHOULD use an initial timer value of 100 msec
>     (the minimum defined in RFC 6298 [RFC6298])"
>
>     + I wrote RFC 6298 and I have no idea where this is coming from!
>
>     + Even if this value of 100msec is OK for DTLS it shouldn't lean
>       on RFC 6298 because RFC 6298 doesn't say that is OK.  I.e.,
>       the parenthetical is objectively wrong.
>
>     + RFC 6298 says the INITIAL RTO should be 1sec (point (2.1) in
>       section 2).  RFC 8961 affirms this and also says the INITIAL
>       RTO should be 1sec (requirement (1) in section 4).

Yeah, I'm not sure what happened here. I could go track down the
PRs but I'll just plead editorial error. I suggest we just remove
the parenthetical because it's not helping here.

>   - "Note that a 100 msec timer is recommended rather than the
>     3-second RFC 6298 default in order to improve latency for
>     time-sensitive applications."
>
>     + Again, this mis-states RFC 6298, which says the initial RTO is
>       1sec (not 3sec).  (Previous to RFC 6298 the initial RTO was
>       3sec, which is probably where the notion comes from.  Most of
>       the purpose of RFC 6298 was to drop the initial RTO to 1sec.)

My bad. I'll fix this.

>     + This is a statement of desire, not any sort of principled
>       justification for using 100msec.  At the least this should be
>       much better argued.

See my note to Martin Duke above. What's appropriate in a very low
volume handshake protocol is different from what's appropriate in a
bulk transport protocol. With that said, as I said to Martin, I don't
think litigating the precise value is that helpful, so I propose we
just increase it to a somewhat larger value and explicitly acknowledge
that specific settings may want to use a shorter value.

>   - "The retransmit timer expires: the implementation transitions to
>     the SENDING state, where it retransmits the flight, resets the
>     retransmit timer, and returns to the WAITING state."
>
>     + Maybe this is spec sloppiness, but boy does it sound like the
>       recipe TCP used before VJCC to collapse the network.  I.e.,
>       expire and retransmit the window.  Rinse and repeat.  It may
>       be the intention is for backoff to be involved.  But, that
>       isn't what it says.

It says it elsewhere, in the section you quoted:

   a congested link.  Implementations SHOULD use an initial timer value
   of 100 msec (the minimum defined in RFC 6298 {{RFC6298}}) and double
   the value at each retransmission, up to no less than 60 seconds
   (the RFC 6298 maximum).

As I said to Martin, I think some of the confusion is that this
specification
uses "reset" to mean both "re-arm" and "set the value back to the initial"
and depends on context to clarify that. Obviously that's not been
entirely successful, so I propose to use re-arm" where I mean "start a
timer with the now current value".

As noted above, this piece of the retransmission algorithm is already
quite widely deployed (it was in DTLS 1.2) so I think there's a reasonably
strong presumption that it is not horribly dangerous, though concededly
suboptimal (hence the addition of ACKs in this specification),

>   - “When they have received part of a flight and do not immediately
>     receive the rest of the flight (which may be in the same UDP
>     datagram). A reasonable approach here is to set a timer for 1/4 the
>     current retransmit timer value when the first record in the flight
>     is received and then send an ACK when that timer expires.”
>
>     + Where does 1/4 come from?  Why is it "reasonable"?  This just
>       feels like a complete WAG that was pulled out of the air.

Yes, it was in fact pulled out of the air (though I did discuss it
with Ian Swett a bit). To be honest, any value here is going to be
somewhat pulled out of the air, especially because during the
handshake the retransmit timer values are incredibly imprecise,
consisting as they do of (at most) one set of samples.  In general,
this value is a compromise between ACKing too aggressively (thus
causing spurious retransmission of in-flight packets) and ACKing too
conservatively (thus causing spurious retransmission of received
packets).

If you have a different proposal, I'm certainly open to it. FWIW,
QUIC's max_ack_delay is 25ms, and that would certainly be fine with
me.

-Ekr

[0]
https://www.fastly.com/blog/quic-handshake-tls-compression-certificates-extension-study
[1] When SH is lost.
[2] In fact, there will be *more* packets transmitted because you now will
have
ACKs for each chunk of the flight, though of course they will be transmitted
over a longer time scale.

On Thu, Mar 25, 2021 at 9:51 AM Martin Duke <martin.h.duke@gmail.com> wrote:

> Hello all,
>
> The outcome of the telechat was that I agreed to start a thread on how to
> fix the significant transport issues with the DTLS 1.3 draft. If I am
> correct, there was no early TCPM or TSVWG review. A major protocol with
> significant transport-layer functionality would benefit from such review in
> the future.
>
> *Who is in this thread*:
>
> For easy reference, here is my DISCUSS, which goes so far as to express a
> straw man design that would come closer to addressing the concerns:
> https://mailarchive.ietf.org/arch/msg/tls/3g20CQkKWPGX-BAqfuEagR2ppGY/
>
> Besides TLSWG, I've added Lars (RFC8085
> <https://datatracker.ietf.org/doc/rfc8085/>), Mark Allman (RFC8961
> <https://datatracker.ietf.org/doc/rfc8961/>), and Gorry Fairhurst (also
> RFC8085). Mark and Gorry have already sent me private comments that I
> invite them to resend here. To summarize briefly, they amplified my
> DISCUSS, made the new point that 8085 is directly relevant here, and are
> concerned there aren't enough MUSTs
>
> If people think there would be value in advertising this thread to the
> TCPM and TSVWG working groups, I can do so, at the risk of introducing more
> ancillary document churn.
>
> *Suggested plan:*
>
> Anyway, as a first step perhaps we can have Mark, Gorry, and Lars add
> anything they'd like and then invite the draft authors to either make a
> proposal or push back. If there are non-kosher things that DTLS 1.2 has
> done with no observable problems, that would be an interesting data point:
> within limits, introducing a latency regression into DTLS 1.3 would be
> perverse.
>
> DTLS is a very important protocol and it is worth the time to get these
> things right.
>
> Thanks,
> Martin Duke
> Transport AD
>

[TLS] Transport Issues in DTLS 1.3 Martin Duke
Re: [TLS] Transport Issues in DTLS 1.3 Gorry Fairhurst
Re: [TLS] Transport Issues in DTLS 1.3 Eric Rescorla
Re: [TLS] Transport Issues in DTLS 1.3 Eric Rescorla
Re: [TLS] Transport Issues in DTLS 1.3 Martin Duke
Re: [TLS] Transport Issues in DTLS 1.3 Hannes Tschofenig
Re: [TLS] Transport Issues in DTLS 1.3 Mark Allman
Re: [TLS] Transport Issues in DTLS 1.3 Martin Duke
Re: [TLS] Transport Issues in DTLS 1.3 Bill Frantz
Re: [TLS] Transport Issues in DTLS 1.3 Gorry Fairhurst
Re: [TLS] Transport Issues in DTLS 1.3 Hannes Tschofenig
Re: [TLS] Transport Issues in DTLS 1.3 Mark Allman
Re: [TLS] Transport Issues in DTLS 1.3 Mark Allman
Re: [TLS] Transport Issues in DTLS 1.3 Eric Rescorla