Re: [TLS] Transport Issues in DTLS 1.3

Eric Rescorla <ekr@rtfm.com> Fri, 26 March 2021 23:01 UTC

MIME-Version: 1.0
References: <CAM4esxR3YPoWaxU9B--oaT9r2bh_QBNH=tt0FsiUKaAT=M6_fg@mail.gmail.com> <CABcZeBMS5fUej0q5XhbxM5sMLQwAAyCgyAfbkTORQjvMM+jb7A@mail.gmail.com>
In-Reply-To: <CABcZeBMS5fUej0q5XhbxM5sMLQwAAyCgyAfbkTORQjvMM+jb7A@mail.gmail.com>
From: Eric Rescorla <ekr@rtfm.com>
Date: Fri, 26 Mar 2021 16:00:54 -0700
Message-ID: <CABcZeBNcBCi56t4sjHPOeKhdGSecC+TBQffuYsQisPaLvoUYdA@mail.gmail.com>
To: Martin Duke <martin.h.duke@gmail.com>
Cc: draft-ietf-tls-dtls13.all@ietf.org, Mark Allman <mallman@icsi.berkeley.edu>, Lars Eggert <lars@eggert.org>, Gorry Fairhurst <gorry@erg.abdn.ac.uk>, "<tls@ietf.org>" <tls@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000380cf905be788343"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tls/sPsjInRDVS8LtduCVtbDh8f8gIw>
Subject: Re: [TLS] Transport Issues in DTLS 1.3
Precedence: list

On Fri, Mar 26, 2021 at 3:08 PM Eric Rescorla <ekr@rtfm.com> wrote:

> Hi folks,
>
> This is a combined response to Martin Duke and to Mark Allman.
>
> Before I respond in detail I'd like to level set a bit.
>
> First, DTLS does not provide a generic reliable bulk data transmission
> capability. Rather, it provides an unreliable channel (a la UDP).
> That channel is set up with a handshake protocol and DTLS provides
> relibaility for that protocol. However, that protocol is run
> infrequently and generally involves relatively small amounts
> (typically << 10KB) of data being sent. This means that we have rather
> more latitude in terms of how aggressively we retransmit because
> it only applies to a small fraction of the traffic.
>
> Second, DTLS 1.2 is already widely deployed. It uses a simple "wait
> for the timer to expire and retransmit everything" approach, with the
> timer being doubled on each retransmission. This doesn't always
> provide ideal results, but also has not caused the network to
> collapse. I don't know much about how things are deployed in the IoT
> setting (paging Hannes Tschofenig) but at least in the WebRTC context,
> we have found the 1000ms guidance to be unduly long (as a practical
> matter, video conferencing just won't work with delays over
> 100-200ms). Firefox uses 50ms and AIUI Chrome uses a value derived
> from the ICE handshake (which is probably better because there
> are certainly times where 50ms is too short).
>
>
>
> Martin Duke's Comments:
>
> > In Sec 5.8.2, it is a significant change from DTLS 1.2 that the
> > initial timeout is dropping from 1 sec to 100ms, and this is worthy of
> > some discussion. This violation of RFC8961 ought to be explored
> > further. For a client first flight of one packet, it seems
> > unobjectionable. However, I'm less comfortable with a potentially
> > large server first flight, or a client second flight, likely leading
> > to a large spurious retransmission. With large flights, not only is a
> > short timeout more dangerous, but you are more likely to get an ACK in
> > the event of some loss that allows you to shortcut the timer anyway
> > (i.e. the cost of long timeout is smaller)
>
> You seem to be implicitly assuming that there is individual packet
> loss rather than burst loss. If the entire flight is lost, you want to
> just fall back to retransmitting.
>
>
> > Relatedly, in section 5.8.3 there is no specific recommendation for a
> > maximum flight size at all. I would think that applications SHOULD
> > have no more than 10 datagrams outstanding unless it has some OOB
> > evidence of available bandwidth on the channel, in keeping with de
> > facto transport best practice.
>
> I agree that this is a reasonable change.
>
>
> > Finally, I am somewhat concerned that the lack of any window reduction
> > might perform poorly in constrained environments.
>
> I'm skeptical that this is actually the case. As a practical matter,
> TLS flights rarely exceed 5 packets. For instance, Fastly's data on
> QUIC [0] indicates that the server's first flight (the biggest flight
> in the TLS 1.3 handshake) is less than 5 packets for the vast majority
> of handshakes, even without certificate compression. Given that
> constrained environments have more incentive to reduce bandwidth, I
> would expect them to typically be smaller, either via using smaller
> certificates or using some of the existing techniques for reducing
> handshake size such as cert compression or cached info.
>
>
>
> > Granted, doubling
> > the timeout will reduce the rate, but when retransmission is
> > ack-driven there is essentially no reduction of sending rate in
> > response to loss.
>
> I don't believe this is correct. Recall that unlike TCP, there's
> generally no buffer of queued packets waiting to be transmitted.
> Rather, there is a fixed flight of data which must be delivered.  With
> one exceptional case [1], an ACK will reflect that some but not all of
> the data was delivered and processed; when retransmitting, the
> sender will only retransmit the un-ACKed packets, which naturally
> reduces the sending rate. Given the quite small flights in play
> here, that reduction is likely to be quite substantial. For instance,
> if there are three packets and 1 is ACKed, then there will
> be a reduction of 1/3.
>
>
> > I want to emphasize that I am not looking to fully recreate TCP here;
> > some bounds on this behavior would likely be satisfactory.
> >
> > Here is an example of something that I think would be workable. It is
> > meant to be a starting point for discussion. I've asked for some input
> > from the experts in this area who may feel differently.
> >
> > - In general, the initial timeout is 100ms.
> > - The timeout backoff is not reset after successful delivery.
> >   This
> >   allows the "discovery" in bullet 1 to be safely applied to larger
> >   flights.
>
> Note that the timeout is actually only reset after successful loss-free
> delivery of a flight:
>
>    Implementations SHOULD retain the current timer value until a
>    message is transmitted and acknowledged without having to
>    be retransmitted, at which time the value may be
>    reset to the initial value.
>
> There seems to be some confusion here (perhaps due to bad writing).
> When the text says "resets the retransmission timer" it means "re-arm
> it with the current value" not "re-set it to the initial default". For
> instance, suppose that I send flight 1 with retransmit timer value
> T. After T seconds, I have not received anything and so I retransmit
> it, doubling to 2T. After I get a response, I now send a new
> flight. The timer should be 2T, not T.
>
> With that said, I think it would be reasonable to re-set to whatever
> the measured RTT was, rather than the initial default. This would
> avoid potentially resetting to an overly low default (though it's
> not clear to me how this could happen because if your RTT estimate
> is too low you will never get a delivery without retransmission).
>

NM on this piece. I see how that happens and I think it's fine to reset
to "measured RTT" whatever that is. As I said, in practice this situation
is very rare with DTLS because there are so few handshake flights.

-Ekr


>
> > - For a first flight of > 2 packets, the sender MUST either (a) set
> >   the initial timeout to 1 second OR (b) retransmit no more than 2
> >   packets after timeout.
> > - flights SHOULD be limited to 10 packets
> > - on timeout or ack-indicated retransmission, no more than half
> >   (minimum one) of the flight should be retransmitted
> >
> > The theory here is that it's responsive to RTTs > 100ms, but small
> > flights can be more aggressive, and large flows are likely to have
> > ack-driven retransmission.
>
> I think it would be useful to distinguish two sets of concerns here:
>
> 1. That timeout-driven retransmission is too aggressive due to
>    too-short timers.
>
> 2. That ACK-driven retransmission will be too aggressive (presumably
>    due to the ACK indicating congestion-driven loss; if the loss
>    is due to burst errors, then we want to retransmit aggressively).
>
> On point (1), I think that the fact that we have extensive deployment
> of timeout-driven retransmission in the field with short timers is
> fairly strong evidence that it will not destroy the Internet and more
> generally that the "retransmit the whole flight" design is safe in
> this case. I certainly agree that there might be settings in which
> 100ms is too short. Rather than litigate the timer value, which I
> agree is a judgement call, I suggest we increase the default somewhat
> (250? 500) and then indicate that if the application has information
> that a shorter timer is appropriate, it can use one.
>
> As far as point (2) goes, I don't think that any change is indicated
> here. As I indicated above, there is a finite amount of data to
> transmit and the design of the ACKs is such that you will continue to
> make forward progress (and if you're not, you won't be getting
> ACKs). Given the small fraction of the network traffic that will be
> DTLS handshakes, the primary risk here seems to be that on a very
> constrained network, you will get suboptimal performance for your
> handshake, but even that should resolve in a small number of round
> trips, especially if the receiver buffers out of order packets (which
> you obviously want to do in a constrained network). And if you do have
> random loss rather than congestion loss, backing off will have a very
> negative impact on the handshake for minimal reduction in packets
> transmitted [2].
>
> With that said, given that your concern seems to be large flights,
> I could maybe live with halving the *window* rather than the
> size of the flight. In your example, you suggest an initial window
> of 10, so this would give us 10, 5, 3, ... This would have little
> practical impact on the vast majority of handshakes, but I suppose
> might slightly improve things on the edge cases where you have
> a large flight *and* a high congestion network.
>
>
> Mark Allman's comments:
>
> > A few specific things (in addition to what Gorry said, which I
> > absolutely agree with):
> >
> >   - "Though timer values are the choice of the implementation,
> >     mishandling of the timer can lead to serious congestion
> >     problems"
> >
> >     + Gorry flagged this and I am flagging it again.  If this is
> >       something that can lead to serious problems, let's not just
> >       leave it to "choice of the implementation".  Especially if we
> >       have some idea how to make it less problematic.
>
> I'm not sure what you'd like here. I think the guidance in this
> specification is reasonable, so I'd be happy to just remove this
> text.
>
>
>
> >   - "Implementations SHOULD use an initial timer value of 100 msec
> >     (the minimum defined in RFC 6298 [RFC6298])"
> >
> >     + I wrote RFC 6298 and I have no idea where this is coming from!
> >
> >     + Even if this value of 100msec is OK for DTLS it shouldn't lean
> >       on RFC 6298 because RFC 6298 doesn't say that is OK.  I.e.,
> >       the parenthetical is objectively wrong.
> >
> >     + RFC 6298 says the INITIAL RTO should be 1sec (point (2.1) in
> >       section 2).  RFC 8961 affirms this and also says the INITIAL
> >       RTO should be 1sec (requirement (1) in section 4).
>
> Yeah, I'm not sure what happened here. I could go track down the
> PRs but I'll just plead editorial error. I suggest we just remove
> the parenthetical because it's not helping here.
>
>
> >   - "Note that a 100 msec timer is recommended rather than the
> >     3-second RFC 6298 default in order to improve latency for
> >     time-sensitive applications."
> >
> >     + Again, this mis-states RFC 6298, which says the initial RTO is
> >       1sec (not 3sec).  (Previous to RFC 6298 the initial RTO was
> >       3sec, which is probably where the notion comes from.  Most of
> >       the purpose of RFC 6298 was to drop the initial RTO to 1sec.)
>
> My bad. I'll fix this.
>
>
> >     + This is a statement of desire, not any sort of principled
> >       justification for using 100msec.  At the least this should be
> >       much better argued.
>
> See my note to Martin Duke above. What's appropriate in a very low
> volume handshake protocol is different from what's appropriate in a
> bulk transport protocol. With that said, as I said to Martin, I don't
> think litigating the precise value is that helpful, so I propose we
> just increase it to a somewhat larger value and explicitly acknowledge
> that specific settings may want to use a shorter value.
>
>
>
> >   - "The retransmit timer expires: the implementation transitions to
> >     the SENDING state, where it retransmits the flight, resets the
> >     retransmit timer, and returns to the WAITING state."
> >
> >     + Maybe this is spec sloppiness, but boy does it sound like the
> >       recipe TCP used before VJCC to collapse the network.  I.e.,
> >       expire and retransmit the window.  Rinse and repeat.  It may
> >       be the intention is for backoff to be involved.  But, that
> >       isn't what it says.
>
> It says it elsewhere, in the section you quoted:
>
>    a congested link.  Implementations SHOULD use an initial timer value
>    of 100 msec (the minimum defined in RFC 6298 {{RFC6298}}) and double
>    the value at each retransmission, up to no less than 60 seconds
>    (the RFC 6298 maximum).
>
> As I said to Martin, I think some of the confusion is that this
> specification
> uses "reset" to mean both "re-arm" and "set the value back to the initial"
> and depends on context to clarify that. Obviously that's not been
> entirely successful, so I propose to use re-arm" where I mean "start a
> timer with the now current value".
>
> As noted above, this piece of the retransmission algorithm is already
> quite widely deployed (it was in DTLS 1.2) so I think there's a reasonably
> strong presumption that it is not horribly dangerous, though concededly
> suboptimal (hence the addition of ACKs in this specification),
>
>
> >   - “When they have received part of a flight and do not immediately
> >     receive the rest of the flight (which may be in the same UDP
> >     datagram). A reasonable approach here is to set a timer for 1/4 the
> >     current retransmit timer value when the first record in the flight
> >     is received and then send an ACK when that timer expires.”
> >
> >     + Where does 1/4 come from?  Why is it "reasonable"?  This just
> >       feels like a complete WAG that was pulled out of the air.
>
> Yes, it was in fact pulled out of the air (though I did discuss it
> with Ian Swett a bit). To be honest, any value here is going to be
> somewhat pulled out of the air, especially because during the
> handshake the retransmit timer values are incredibly imprecise,
> consisting as they do of (at most) one set of samples.  In general,
> this value is a compromise between ACKing too aggressively (thus
> causing spurious retransmission of in-flight packets) and ACKing too
> conservatively (thus causing spurious retransmission of received
> packets).
>
> If you have a different proposal, I'm certainly open to it. FWIW,
> QUIC's max_ack_delay is 25ms, and that would certainly be fine with
> me.
>
> -Ekr
>
> [0]
> https://www.fastly.com/blog/quic-handshake-tls-compression-certificates-extension-study
> [1] When SH is lost.
> [2] In fact, there will be *more* packets transmitted because you now will
> have
> ACKs for each chunk of the flight, though of course they will be
> transmitted
> over a longer time scale.
>
> On Thu, Mar 25, 2021 at 9:51 AM Martin Duke <martin.h.duke@gmail.com>
> wrote:
>
>> Hello all,
>>
>> The outcome of the telechat was that I agreed to start a thread on how to
>> fix the significant transport issues with the DTLS 1.3 draft. If I am
>> correct, there was no early TCPM or TSVWG review. A major protocol with
>> significant transport-layer functionality would benefit from such review in
>> the future.
>>
>> *Who is in this thread*:
>>
>> For easy reference, here is my DISCUSS, which goes so far as to express a
>> straw man design that would come closer to addressing the concerns:
>> https://mailarchive.ietf.org/arch/msg/tls/3g20CQkKWPGX-BAqfuEagR2ppGY/
>>
>> Besides TLSWG, I've added Lars (RFC8085
>> <https://datatracker.ietf.org/doc/rfc8085/>), Mark Allman (RFC8961
>> <https://datatracker.ietf.org/doc/rfc8961/>), and Gorry Fairhurst (also
>> RFC8085). Mark and Gorry have already sent me private comments that I
>> invite them to resend here. To summarize briefly, they amplified my
>> DISCUSS, made the new point that 8085 is directly relevant here, and are
>> concerned there aren't enough MUSTs
>>
>> If people think there would be value in advertising this thread to the
>> TCPM and TSVWG working groups, I can do so, at the risk of introducing more
>> ancillary document churn.
>>
>> *Suggested plan:*
>>
>> Anyway, as a first step perhaps we can have Mark, Gorry, and Lars add
>> anything they'd like and then invite the draft authors to either make a
>> proposal or push back. If there are non-kosher things that DTLS 1.2 has
>> done with no observable problems, that would be an interesting data point:
>> within limits, introducing a latency regression into DTLS 1.3 would be
>> perverse.
>>
>> DTLS is a very important protocol and it is worth the time to get these
>> things right.
>>
>> Thanks,
>> Martin Duke
>> Transport AD
>>
>

[TLS] Transport Issues in DTLS 1.3 Martin Duke
Re: [TLS] Transport Issues in DTLS 1.3 Gorry Fairhurst
Re: [TLS] Transport Issues in DTLS 1.3 Eric Rescorla
Re: [TLS] Transport Issues in DTLS 1.3 Eric Rescorla
Re: [TLS] Transport Issues in DTLS 1.3 Martin Duke
Re: [TLS] Transport Issues in DTLS 1.3 Hannes Tschofenig
Re: [TLS] Transport Issues in DTLS 1.3 Mark Allman
Re: [TLS] Transport Issues in DTLS 1.3 Martin Duke
Re: [TLS] Transport Issues in DTLS 1.3 Bill Frantz
Re: [TLS] Transport Issues in DTLS 1.3 Gorry Fairhurst
Re: [TLS] Transport Issues in DTLS 1.3 Hannes Tschofenig
Re: [TLS] Transport Issues in DTLS 1.3 Mark Allman
Re: [TLS] Transport Issues in DTLS 1.3 Mark Allman
Re: [TLS] Transport Issues in DTLS 1.3 Eric Rescorla