Re: [tcpm] Benjamin Kaduk's Discuss on draft-ietf-tcpm-rto-consider-16: (with DISCUSS and COMMENT)

Benjamin and I discussed this briefly in the telechat. I would summarize as
follows:

- We seem to have clear agreement on all but the first DISCUSS point
- We should probably say something about the security implications of
spoofed acks but not be too specific. The document doesn't really have a
section (that I can find right now) that discusses the properties of ACKs,
so perhaps the best place to put this is in security considerations.
Something to the effect of

"An attacker could spoof a sufficient number of receiver messages to either
alter the feedback time calculation, or indicate the receipt of packets
that did not actually arrive at the receiver. Endpoints SHOULD include
measures to make it hard for off-path attackers to guess the fields of
these packets to mitigate this attack"

Together with the agreed fixes, this should be enough to move forward.

Martin

On Tue, Jul 7, 2020 at 8:01 AM Mark Allman <mallman@icsi.berkeley.edu>
wrote:

>
> Hi Benjamin!
>
> > ----------------------------------------------------------------------
> > DISCUSS:
> > ----------------------------------------------------------------------
> >
> > If we are providing BCP-level requirements for time-based loss
> > detection via absence of protocol-level acknowledgment, for new
> > protocols, it seems appropriate to mandate that the acknowledgment
> > signal is reliable, i.e., not spoofable by at least an off-path
> > attacker, and ideally not spoofable by an on-path attacker either.
> > I would love for this to be a cryptographically protected
> > mechanism, but expect that I can't get away with mandating
> > something that strong, and that something with "enough bits of
> > entropy" will suffice.  (I'd prefer "enough" to be 128 but could
> > perhaps be persuaded that a lower value is appropriate as a
> > minimum requirement.)
>
> I have been thinking about this since yesterday and have a few
> thoughts ...
>
>   - In some sense, I like the general thought.  It would tighten
>     things up.
>
>   - That said, pragmatically, I think the risk from spoofed ACKs is
>     pretty small.  For instance, if you want to screw up the RTO you
>     need to pretty continuously screw up the FT measurements for
>     anything long running.  So, it isn't enough to guess and inject
>     correctly once, but you have to do so on an ongoing basis.  And,
>     with much less than 128 bits of entropy something like TCP is
>     not particularly susceptible to this sort of thing.
>
>   - And, so we can obviously make up a requirement for future
>     protocols here, but I don't think that will be a "BCP" or even
>     "CP".  It would not be like the rest of the document where the
>     aspects of loss detection are well understood.
>
>   - And, we'd have to decide what we meant by "new protocol".  Is
>     that some wholly new thing or is that a new algorithm for SCTP
>     or TCP or DNS?
>
>   - Perhaps there is some non-normative note that could be added
>     that folks are encouraged to think about making the feedback
>     non-spoofable for FT measurements (among many other reasons, of
>     course).  I dunno ...
>
> > Point S.3 in Section 3 indicates that "[t]he requirements in this
> > document apply only to endpoint-to- endpoint unicast
> > communication.  Reliable multicast (e.g., [RFC5740]) protocols are
> > explicitly outside the scope of this document."  This limitation
> > of scope should be reflected in the document's title, Abstract,
> > and Introduction.
>
> (A note can be added.)
>
> > I would also like to get an explicit confirmation that the various
> > (non-)requirements on the details of exponential backoff and
> > reduced weighting for old FT samples are as-intended (see
> > COMMENT).  Specifically, are there limitations on the base of the
> > exponent for the exponential backoff, and is there a requirement
> > to give more recent FT samples more precedence than older FT
> > samples when computing an RTO estimate?
>
> Yes- the intent is to not specify the details.  The basic gist of
> the document is that we have learned that the details are not hugely
> important and so getting away from them is fine.
>
> > ----------------------------------------------------------------------
> > COMMENT:
> > ----------------------------------------------------------------------
> >
> > Abstract
> >
> >     of congestion along a network path).  While many mechanisms have
> >     been designed to detect loss, protocols ultimately can only count on
> >     the passage of time without delivery confirmation to declare a
> >     packet "lost".  Each implementation of a time-based loss detection
> >
> > I'm not entirely convinced that the literal meaning of what these
> > words say is universally true for all protocols (specifically "can
> > only count on" and omitting a qualifier such as "many" or "most"
> > protocols or "reliably count on").  Consider, for example, the DTN
> > bundle protocol, which has some circumstances in which a given
> > store-and-forward node will flat-out not attempt to forward a
> > protocol onward to its final recipient, but can send back a status
> > report to the sender noting the confirmed lack of delivery.  I
> > suggest adding some form of qualifier to admit the possibility of
> > such unusual situations.
> >
> > Section 1
> >
> >     missing.  However, despite our best intentions and most robust
> >     mechanisms we cannot place ultimate faith in receiving such
> >     acknowledgments, but can only truly depend on the passage of time.
> >     Therefore, our ultimate backstop to ensuring that we detect all loss
> >     is a timeout.  [...]
> >
> > Despite the superficial similarity of tone, a strict reading of
> > these words does not seem to share the issue I raised about the
> > Abstract.  (That said, this phrasing still seems a bit overblown
> > and my personal preference would be to tone it down some.)
>
> The point is objectively true and actually meant to be absolute.
> Yes, an explicit non-delivery notification can be sent.  Such
> notifications may be useful and even quite important.  But,
> ultimately, we cannot be rely upon such things to ensure reliable
> communications because they themselves may get lost, corrupted,
> delayed, etc.  But, if time somehow stops marching forward then
> .... well, WGAS about the packets?! :)
>
> Too me this is a crucial protocol building touchstone.  One has to
> think about what we can and cannot count on.  And to what degree.
>
> > Section 2
> >
> >       - This document does not update or obsolete any existing RFC.
> >         These previous specifications---while generally consistent with
> >         the requirements in this document---reflect community consensus
> >         and this document does not change that consensus.
> >
> >       - The requirements in this document are meant to provide for
> >         network safety and, as such, SHOULD be used by all time-based
> >         loss detection mechanisms.
> >
> > I'm not sure that I understand how the combinations of these two
> > statements apply to existing specifications.  Is there some sense that
> > existing specs might leave some flexibility and when there is such
> > flexibility, implementations/deployments SHOULD use the flexibility to
> > match this document as closely as possible?  Or are existing deployments
> > and implementations of existing RFCs also grandfathered out?  If the
> > intent of the second bullet is to only apply to new protocols, please
> > say so explicitly.  (Section 4 seems to imply the "only new protocols"
> > case, but it's hard to be sure.)
>
> The first bullet is intended to be backwards looking and the second
> bullet is intended to be forward looking.  A few extra words there
> to make the second bullet more clear would be (in hindsight!)
> useful.
>
> >       - The requirements in this document may not be appropriate in all
> >         cases and, therefore, inconsistent deviations and variants may
> >         be necessary (hence the "SHOULD" in the last bullet).  However,
> >         inconsistencies MUST be (a) explained and (b) gather consensus.
> >
> > Similarly, is this statement only forward-looking?  Presumably the first
> > bullet ("does not update or obsolete any existing") trumps this one, but
> > it's good to be clear.
>
> Yes- forward looking.  Clarifying words to be added.
>
> > Additionally, I share the GenART reviewer's unease with this
> > strong of a requirement and have low confidence that the IETF will
> > reliably adhere to it in the future.
>
> In fairness to Stewart---even though I disagree with him---I think
> his point is broader than just non-adherence.  (Stewart and I have
> iterated on his comments and in fact a number of his LC comments
> made the document better.  But, on this point we just fundamentally
> disagreed.)
>
> I don't presume to speak for Stewart, but I believe he fears this
> will become a touchstone for area-on-area warfare whereby perfectly
> justifiable deviations won't be well understood as "perfectly
> justifiable" by folks not intimate with the given context.  And,
> this will in turn hold things up.  And, he seems especially
> concerned about cases where we want to update something that has
> been used forever and, yet, deviates from rto-consider.
>
> My take is three-fold ...
>
>   - First, this statement in rto-consider really doesn't change
>     anything, IMO.  If we're not explaining why we're doing things
>     and not getting consensus then we shouldn't be publishing it.
>
>   - Second, this explicit statement was added to get WG consensus.
>     Removing it or softening it would, I believe, run counter to the
>     consensus the document has already gathered.
>
>   - Third, IMO, saying a given mechanism has proven to be fine in a
>     given context should really be enough justification.  That is,
>     yeah, I can see that some protocol lawyer could be pedantic.
>     And, that might be annoying, I guess.  But, I think reasonable
>     people should be able to form a consensus that rebuts such
>     things.
>
> > Section 3
> >
> >     (S.1) The requirements in this document apply only to the primary or
> >           last resort time-based loss detection.
> >
> > I'm not sure I understand what "primary or last resort [sic]" means.
> > Are there time-based loss-detection mechanisms that are neither primary
> > nor last resort?  (nit: hyphenate "last-resort" as a compound adjective,
> > and many other compound adjectives throughout.)
> > (May also affect S.2.)
> > I see S.2 links to [DCCM13,CCDJ20,IS20] as examples of non-primary
> > non-last-resort schemes, however, I think we should be more clear about
> > what we mean when we introduce the term and why we exclude certain
> > classes of time-based loss-detection mechanisms.
> >
> >           detectors.  However, these mechanisms do not obviate the need
> >           for a "retransmission timeout" or "RTO" because---as we
> >           discuss in Section 1---only the passage of time can ultimately
> >           be relied upon to detect loss.  In cases such as these, the
> >
> > [This text is more like the Abstract's, implying that the passage of
> > time is the only way to detect loss, universally]
>
> I dunno.  I would have to think about this a bit.  As you say, we
> tried to sprinkle in references to non-last-resort timers.  Perhaps
> there are some words we can add here to explain things better.
> You're the first one to trip on this.  But, I'd be happy to go
> through this text and think about it again.
>
> > Section 4
> >
> >             In other words, the RTO should represent an empirically-
> >             derived reasonable amount of time that the sender should
> >             wait for delivery confirmation before deciding the given
> >             data is lost.  Network paths are inherently dynamic and
> >             therefore it is crucial to incorporate multiple FT samples
> >             in the RTO to take into account the delay variation across
> >             time.
> >
> > It feels weird to say that the delay will vary over time here but say
> > nothing about the corresponding need to discount very old FT samples.
> > Point (b) talks about adding new samples in, but not about removing or
> > discounting old samples.  That the TCP example does exhibit this
> > property does not excuse a lack of explicit mention as a property of
> > note.
>
> I think a note that says "it's fine to discount old samples" is OK
> and perfectly within the spirit of the guideline.  I don't want to
> overly specify that one has to do this.
>
> >         (d) An RTO mechanism MUST NOT use ambiguous FT samples.
> >
> >             Assume two copies of some segment X are transmitted at times
> >             t0 and t1 and then at time t2 the sender receives
> >             confirmation that X in fact arrived.  In some cases, it is
> >
> > "segment" feels like use of TCP-specific terminology without disclaimer
> > or generalization.
>
> (I can change this to 'packet'.)
>
> >     (3) Loss detected by the RTO mechanism MUST be taken as an
> >         indication of network congestion and the sending rate adapted
> >         using a standard mechanism (e.g., TCP collapses the congestion
> >         window to one segment [RFC5681]).
> >         [...]
> >         An exception to this rule is if an IETF standardized mechanism
> >         determines that a particular loss is due to a non-congestion
> >         event (e.g., packet corruption).  In such a case a congestion
> >
> > If it's a MUST with an exception, doesn't that make it ... no longer a
> > MUST?
>
> I don't think so.  But, you're no doubt a better protocol lawyer
> than I am.  I believe this says you really have to do this, but
> there is a **very precise way** to get out of it.  If we made this a
> SHOULD then it would be very much looser---i.e., you really ought to
> do this unless you have a good reason.  Here we are telling you what
> the good reason has to look like.
>
> >     (4) Each time the RTO is used to detect a loss, the value of the RTO
> >         MUST be exponentially backed off such that the next firing
> >         requires a longer interval.  [...]
> >
> > Do we expect this to be exponential backoff by doubling, or is it
> > admissible to use an alternative base for the exponent?  (Is there
> > a minimum allowable value?  E.g., 1.000001 does not seem like it
> > would provide much protection for the network.)
>
> The document does not say to double the RTO.  It means to not
> specify the constant.
>
> Again, the document is meant to get away from specifics where we
> can.  I don't know how to set the minimal acceptable backoff.  I
> think we'd all agree 1.000001 is not really going to cut it.  But,
> would 1.8?  How about 1.7?  What about 1.3?  I believe the consensus
> is that "exponential backoff" is enough.
>
> >     Further, a number of implementations use a steady-state minimum RTO
> >     that are less than the 1 second specified in [RFC6298] (which is
> >     different from the initial RTO we specify in Section 4, Requirement
> >     1).  While the implication of these deviations from the standard may
> >
> > Just to check: our requirement 1 is the same as RFC 6298, and the "less
> > than 1 second" is what's different.  Perhaps wording this akin to "and
> > as such is in violation of the RTO specified in Requirement 1 from
> > Section 4" would avoid the ambiguity about what is different from what.
>
> Well.  Um.  I read it many times, but cannot understand what you
> wrote about the document's ambiguity! ;-) Maybe that is my lack of
> coffee, I dunno.  But, let me try anyway ...
>
> RFC 6298 says the INITIAL RTO should be no less than 1sec.  That
> agrees with rto-consider (requirement 1).
>
> RFC 6298 further says the MINIMUM RTO over the entire lifetime of a
> TCP connection should be no less than 1sec.  There is no
> corresponding admonishment in rto-consider.
>
> Probably because these two things are both "1sec" in TCP a lot of
> people have gotten them mixed up in their minds during the
> development of this document.  The parenthetical you quote was no
> doubt meant to clarify to someone.  But, this discussion can be
> cleaned up, probably.
>
> > Section 6
> >
> > Noting that we incorporate the security considerations of RFC 6298
> > by reference, and quoting from there:
> >
> >    In addition, even if the attacker can cause the sender's RTO to reach
> >    too small a value, it appears the attacker cannot leverage this into
> >    much of an attack (compared to the other damage they can do if they
> >    can spoof packets belonging to the connection), since the sending TCP
> >    will still back off its timer in the face of an incorrectly
> >    transmitted packet's loss due to actual congestion.
> >
> > Just to check my understanding: this "actual congestion" would
> > need to be between the sender and the attacker, right?  Since the
> > attacker's spoofed ACKs would cause the sender to not detect loss
> > due to congestion that occurs between the attacker and the
> > intended recipient.
>
> Oh boy.  My brain is thrashing to try to page back in RFC6298.
>
> The above text isn't the most clear, so I assume I didn't write
> it. :)
>
> Hmm.
>
> I think the basic point is that if an attacker can spoof they can do
> much worse than impacting the computation of the RTO.  And, if what
> they want to do is mess with a connection's performance and not the
> reliability of the connection---say because that'd be
> user-visible---then they cannot hide real losses.  And, since they
> cannot hide real losses then from a network safety perspective this
> all fails safely because the endpoint will backoff when detecting
> those losses.
>
> > References
> >
> > The way we cite [KP87] ("MUST use Karn's algorithm [KP87,RFC6298]")
> > implies that it should be categorized as normative.  Similarly for RFC
> > 6298 itself.
>
> I dunno.  The entirety of the algorithm is given in rto-consider.
> So, the reference is really meant to give credit and point people at
> a place to read about it in more detail instead of as a pointer that
> must be understood to enact the guideline.  Perhaps just a
> restructuring ...
>
>   Therefore, in this situation an implementation MUST use Karn's
>   algorithm and use neither version of the FT sample and hence not
>   update the RTO (see [KP87,RFC6298] for a broader discussion of
>   Karn's algorithm).
>
> Or, put them under "Normative".  I really don't much care.
>
> >     [RFC3124] Balakrishnan, H., S. Seshan, "The Congestion Manager", RFC
> >         2134, June 2001.
> >
> > Fix the typo, please -- RFC 2134 is the articles of incorporation of
> > ISOC.
>
> Well, I do not want to be anywhere near ISOC anything!  Good catch,
> thanks!  (And, what a weird typo ... all the digits are there, but
> in such a weird order ...)
>
> allman
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm
>