Re: [tcpm] Genart last call review of draft-ietf-tcpm-rto-consider-14

> On 6 Jun 2020, at 08:19, Gorry Fairhurst <gorry@erg.abdn.ac.uk> wrote:
> 
> Please see below.
> 
> On 05/06/2020 17:43, Mark Allman wrote:
>> Hi Stewart!
>> 
>> Thanks for the feedback.  Sorry for the long RTT.  I had a recent
>> deadline and am now trying to dig out.
>> 
>>> Major issues:
>>> 
>>> As far as I can see this text only applies to exchanges between
>>> applications and network support applications such as
>>> DNS. I.e. this is targeted at layer 4 and above. Given the
>>> religious nature of BCPs in the eyes of some reviewers, and to
>>> prevent endless explanations by those that design routing
>>> protocols, OAM and other lower layer sub-system I think there
>>> needs to a scoping text in block capitals at the at the very start
>>> of the documnet.
>> I am not entirely sure what you're suggesting here.  Per note to
>> Tom, I am going to add a few words to the intro.  Maybe that will
>> help.  I think it's unlikely I'll use block capitals! :-)
>> 
>>> =========
>>> 
>>>       - The requirements in this document may not be appropriate in all
>>>         cases and, therefore, inconsistent deviations may be necessary
>>>         (hence the "SHOULD" in the last bullet).  However,
>>>         inconsistencies MUST be (a) explained and (b) gather consensus.
>>> 
>>> SB> That can be quite an onerous obligation  and provide scope for
>>> SB> endless argument when reviewers are not domain experts in the
>>> SB> protocol being designed.
>> This was added because another reviewer thought it was for sure
>> necessary.
>> 
>> I guess I don't understand why you'd call this 'an onerous
>> obligation' since presumably you'd do it anyway without this
>> document.  Are we ramming things through without consensus?  If not
>> (my assumption), (b) is no sweat.  Are we ramming things through
>> without thought?  If not (my assumption), (a) is straightforward and
>> hopefully is being done anyway.  In other words, I don't understand
>> the complaint here because if you don't want to use the guidelines
>> then that is fine, but in going through the standard process to
>> define a loss detector you'll end up meeting this bullet.  Even if
>> this document doesn't get published or didn't exist our documents
>> should still be meeting this bullet.
>> 
>>> =======
>>> 
>>>           While there are a bevy of uses for timers in protocols---from
>>>           rate-based pacing to connection failure detection and
>>>           beyond---these are outside the scope of this document.
>>> 
>>> SB> I am not sure what that means for the applicability of this
>>> SB> document.
>> This was added at some point along the way because someone thought
>> something like rate-based pacing could be covered by the guidelines
>> and the intent is to say it is not.  I have zero love for this bit
>> and would happily remove it, but am loathe to do so because the old
>> comment will then come back.
> I think Mark is correct, there are many transport uses of timers, and calling out a small number of other uses was important to scope this withing the transport discussions, even if it just says "timers also do other stuff".

If the scope of this is explicitly transport and above I have no issues.

If it has a greater scope the scope of study and recommendations really needs to increase accordingly.

>>> =========
>>> 
>>>     (1) As we note above, loss detection happens when a sender does not
>>>         receive delivery confirmation within an some expected period of
>>>         time.  In the absence of any knowledge about the latency of a
>>>         path, the initial RTO MUST be conservatively set to no less than
>>>         1 second.
>>> 
>>> SB> This issue may be addressed by the scoping text, but 1s is no
>>> SB> use when you are trying to detect sub 50ms of packet loss in
>>> SB> the infrastructure.
>> We have to start somewhere when we know nothing.
>> 
>> I think in my thread with Tom we hit upon this notion that the
>> document is really about sort of arbitrary, unknown and therefore
>> presumed unreliable networks.  I am going to add some words to this
>> effect.  Does this help?
>> 
>> Again, for specific environments where things are more nailed down
>> and known, deviations are fine and explicitly OK.  But, as a general
>> default I think saying "when you don't know anything < 50msec is
>> cool" is unlikely to be appropriate.  Well, no, I think it would be
>> quite inappropriate, actually.
> This is I think a natural discussion based on a different perspective. The 1 second initial starting value for a transport path has been there for a long time, and transport reviewers will frequently quote this be it for transport:  SCTP, TCP, or for UDP-based apps (BCP: 145 Sect 3.1.1). I'd expect this is about the assumed starting position for an Internet path.
> 
> True if we're talking about a link between adjacent peers, this is something very different. 
> 

We do multi-hop OAM in RTG to hold the infrastructure together.

Again, my point is that if the scope is L4 and above I have no issue, but the scope seems to be wider.

>>> =============
>>> 
>>>     (3) Each time the RTO is used to detect a loss, the value of the RTO
>>>         MUST be exponentially backed off such that the next firing
>>>         requires a longer interval.  The backoff SHOULD be removed after
>>>         either (a) the subsequent successful transmission of
>>>         non-retransmitted data, or (b) an RTO passes without detecting
>>>         additional losses.  The former will generally be quicker.  The
>>>         latter covers cases where loss is detected, but not repaired.
>>> 
>>>         A maximum value MAY be placed on the RTO.  The maximum RTO MUST
>>>         NOT be less than 60 seconds (as specified in [RFC6298]).
>>> 
>>>         This ensures network safety.
>>> 
>>> SB> This does not work in OAM applications.
>> Well, OK, get consensus to do something different---which is
>> completely fine.  I think retransmission timers have shown
>> themselves to be crucial for preventing collapse and, again, as a
>> default I think this is our best advice.
>> 
> It should be applicable for OAM applications that use a path across the Internet that can change, and certainly could be bad advice for controlled environment. It's actually not new, BCP: 145 also speaks of backoff.

A common standard rule in OAM type situation is three fast packets and then back-off.

>>> Minor issues:
>>> 
>>>  "By waiting long enough that we are unambiguously
>>>   certain a packet has been lost we cannot repair losses in a timely
>>>   manner and we risk prolonging network congestion."
>>> 
>>> I have a concern here that the emphasis is on classical
>>> operation. We are beginning to see application to run over the
>>> network where the timely delivery of a packet is critical for
>>> correct operation of even SoL. As a BCP the text needs to
>>> recognise that the scope and purpose of IP is changing and that
>>> classical learning and rules derived from them may not apply.
>>> 
>>> Also if not ruled out of scope earlier we need to be clear at this
>>> point that things like BFD have different considerations.
> Isn't BFD is a link protocol between adjacent systems?

No, not always, you can have multiple-hop BFD.

This is infrastructure and not user data and there is a school of though that in data planes where
The same data path is used for both control and user data, the user data is sacrificial to maintaining
The infrastructure. The reason that you do backoff in these cases is not to avoid congestion but
Instead to avoid overloading the control peer, i.e. the route processor in the peer router.

- Stewart
>> I am going to suggest we revisit this after I hack out a little
>> extra text for the intro.  You can see if that helps.
>> 
>>> ==========
>>> 
>>>       "- This document does not update or obsolete any existing RFC.
>>>         These previous specifications---while generally consistent with
>>>         the requirements in this document---reflect community consensus
>>>         and this document does not change that consensus."
>>> 
>>> I think it needs to be clear that adherence to this RFC is not
>>> required for minor updates and extensions to existing RFCs. Having
>>> seen minor routing extension held up by security concerns related
>>> to underlying protocols rather than the extension itself there is
>>> a lot of sensitivity on this point in some quarters of the IETF.
>> Um.  Do you have suggested words?  I am not much of a protocol
>> lawyers (thankfully!), but I am not really conjuring the case you're
>> concerned about.  Something like ...
>> 
>>   (1) RFC XXXX was published 10 years ago and violates
>>       rto-consider.
>>   (2) We want to do a XXXXbis.
>>   (3) The bis has to then explain why it's cool to violate
>>       rto-consider.
>> 
>> .... ?
>> 
>> I would say if XXXX has a loss detector that had consensus and has
>> been in use for a while it'd be pretty easy to get consensus for
>> XXXXbis that we can still use it as it has worked fine.
>> 
>>> It might be useful to make it clear that there are some
>>> applications that would prefer no data to late data.
>> This document is about loss detection, not what one does after
>> detecting.  So, we do say ...
>> 
>>     However, as discussed above, the detected loss need not be
>>     repaired
>> 
>> I am happy to re-enforce this point.  Text suggestions welcome.
>> 
>>> Nits/editorial comments:
>>> 
>>> The terminology section confuses ID-nits - I think it should be a
>>> section in its own right later in the document.
>> Yeah- id-nits as it is run when submitting doesn't flag this.  It
>> was flagged by someone else in LC.  Because I am old school it's
>> hard to renumber everything and so I was just leaving this for the
>> rfc-ed to do something reasonable here.
>> 
>>> The following nits issues need looking at
>>> 
>>>   == Missing Reference: 'RFC5681' is mentioned on line 377, but not defined
>>> 
>>>   == Unused Reference: 'RFC3940' is defined on line 515, but no explicit
>>>      reference was found in the text
>>> 
>>>   == Unused Reference: 'RFC4340' is defined on line 519, but no explicit
>>>      reference was found in the text
>>> 
>>>   == Unused Reference: 'RFC6582' is defined on line 540, but no explicit
>>>      reference was found in the text
>> I will fix all these.  Again, I was trusting the id-nits when I
>> submitted and these were not flagged (or, if they were it wasn't in
>> a way that foisted them on my screen).  But, they're easy fixes, so
>> thanks!
>> 
>> allman
>> 
>> 
>> _______________________________________________
>> tcpm mailing list
>> tcpm@ietf.org <mailto:tcpm@ietf.org>
>> https://www.ietf.org/mailman/listinfo/tcpm <https://www.ietf.org/mailman/listinfo/tcpm>