Re: [tcpm] Last Call: <draft-ietf-tcpm-rack-13.txt> (TheRACK-TLPlossdetectionalgorithm for TCP) to Proposed Standard

Markku Kojo <kojo@cs.helsinki.fi> Tue, 15 December 2020 18:57 UTC

Return-Path: <kojo@cs.helsinki.fi>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5A3253A16C2; Tue, 15 Dec 2020 10:57:11 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1
X-Spam-Level:
X-Spam-Status: No, score=-1 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FUZZY_IMPORTANT=1, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cs.helsinki.fi
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id N-RScrSqbHYz; Tue, 15 Dec 2020 10:57:06 -0800 (PST)
Received: from script.cs.helsinki.fi (script.cs.helsinki.fi [128.214.11.1]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5F30C3A16BE; Tue, 15 Dec 2020 10:57:04 -0800 (PST)
X-DKIM: Courier DKIM Filter v0.50+pk-2017-10-25 mail.cs.helsinki.fi Tue, 15 Dec 2020 20:56:51 +0200
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.helsinki.fi; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version:content-type; s=dkim20130528; bh=LnxEuUZqghc89b9MG 3M14HyKZGK3KkkGOlDWBiTwmnI=; b=G2iKHf+QXRjMP9nXFc/QG4GTWGBq4TaS6 pQBsbuAo5ERb7XwqYUZKVt/kt46fC4FQ+ZJOJkMzG9p5M5rDOK8iwUPFua35jJ7Y f4rFdnsRiAcXMh84fWpy0zcRk7RTr838PzfILbuOQSkj5FvgkZlYn/GdAKdh5kGA NZvvsftMfA=
Received: from hp8x-60 (88-113-50-238.elisa-laajakaista.fi [88.113.50.238]) (AUTH: PLAIN kojo, TLS: TLSv1/SSLv3,256bits,AES256-GCM-SHA384) by mail.cs.helsinki.fi with ESMTPSA; Tue, 15 Dec 2020 20:56:50 +0200 id 00000000005A01BC.000000005FD906F2.00004B8D
Date: Tue, 15 Dec 2020 20:56:50 +0200
From: Markku Kojo <kojo@cs.helsinki.fi>
To: Neal Cardwell <ncardwell@google.com>
cc: Yuchung Cheng <ycheng@google.com>, last-call@ietf.org, "tcpm@ietf.org Extensions" <tcpm@ietf.org>, draft-ietf-tcpm-rack@ietf.org, Michael Tuexen <tuexen@fh-muenster.de>, draft-ietf-tcpm-rack.all@ietf.org, tcpm-chairs <tcpm-chairs@ietf.org>
In-Reply-To: <CADVnQykrm1ORm7N+8L0iEyqtJ2rQ1dr1xg+EmYcWQE9nmDX_mA@mail.gmail.com>
Message-ID: <alpine.DEB.2.21.2012141505360.5844@hp8x-60.cs.helsinki.fi>
References: <160557473030.20071.3820294165818082636@ietfa.amsl.com> <alpine.DEB.2.21.2012030145440.5180@hp8x-60.cs.helsinki.fi> <CAK6E8=diHBZJC5Ei=wKt=j=om1aDcFU8==kSYEtp=KZ4g__+Xg@mail.gmail.com> <alpine.DEB.2.21.2012071227390.5180@hp8x-60.cs.helsinki.fi> <CAK6E8=fNd3ToWEoCYHwgPG7QUvCXw3kV2rwH=hqmhibQmQNseg@mail.gmail.com> <alpine.DEB.2.21.2012081502530.5180@hp8x-60.cs.helsinki.fi> <CADVnQykrm1ORm7N+8L0iEyqtJ2rQ1dr1xg+EmYcWQE9nmDX_mA@mail.gmail.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; format="flowed"; charset="US-ASCII"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/fM787ReZ0oTBRRkdugLmxympGi0>
Subject: Re: [tcpm] Last Call: <draft-ietf-tcpm-rack-13.txt> (TheRACK-TLPlossdetectionalgorithm for TCP) to Proposed Standard
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Dec 2020 18:57:12 -0000

Hi Neal, all,

(my apologies for this message getting so long but I didn't want to drop 
any parts of it in case someone wants to get the entire discussion in one 
piece. Also, apologies for repeating some of my concerns inline below; 
Those who are clear with the problems may well opt skipping those parts.)

Thanks Neal for the reply. I believe we are slowly converging with many 
of my comments but two of the important issues still remain open. That is, 
where can an implementer find advice for correct congestion control 
actions with RACK-TLP, when:
(1) a loss of rexmitted segment is detected
(2) an entire flight of data gets dropped (and detected),
     that is, when there is no feedback available and a timeout
     is needed to detect the loss

I fully understand the intent of the RACK-TLP draft not to include 
congestion control actions in this document (or the intent to minimise 
such advise). However, when new ways to detect loss are introduced and 
published (as this document would do), IMHO it is imperative to have such 
congestion control actions readily available for an implementer. AFAIK 
we do not have in the RFC series correct actions available for these two 
cases.

I do not have a strong opinion whether the necessary actions should be in 
this document or in another document, as long as we have them. This is 
more or less a process decision, I think.

Please see more details inline.

I also investigated a bit more on RACK-TLP as specified in the draft and 
it seems that there is a problem with the pseudocode as currently 
presented (some sequence numbers seem to be off by one, resulting in 
wrong outcome in some/many? cases). Please see inline in the end of my 
reply.


  On Sat, 12 Dec 2020, Neal Cardwell wrote:

> Hi Markku,
>
> Thanks for your many detailed and thoughtful comments. Please see
> below in-line for our comments based on discussions among RACK-TLP
> draft authors.
>
> On Tue, Dec 8, 2020 at 11:21 AM Markku Kojo <kojo@cs.helsinki.fi> wrote:
>>
>> Hi,
>>
>> please see inline.
>>
>> On Mon, 7 Dec 2020, Yuchung Cheng wrote:
>>
>>> On Mon, Dec 7, 2020 at 8:06 AM Markku Kojo <kojo@cs.helsinki.fi> wrote:
>>>>
>>>> Hi Yuchung,
>>>>
>>>> thanks for your reply. My major point is that IMHO this RACK-TLP
>>>> specification should give the necessary advice w.r.t congestion control
>>>> in cases when such advice is not available in the RFC series or is not
>>>> that easy to interpreted correctly from the existing RFCs.
>>>>
>>>> Even though the CC principles were available in RFC series I believe you
>>>> agree with me that getting detailed CC actions correct is sometimes quite
>>>> hard, especially if one needs to interpret and apply them from another
>>>> spec than what one is implementing. Please see more details inline below.
>>>>
>>>> In addition, we need to remember that this document is not meant only for
>>>> TCP experts following the tcpm list and having deep understanding of
>>>> congestion control but also those implementing TCP, for the very first
>>>> time, for various appliances, for example. They do not have first hand
>>>> experience in implementing TCP congestion control and deserve clear
>>>> advice what to do.
>>>>
>>>> On Fri, 4 Dec 2020, Yuchung Cheng wrote:
>>>>
>>>>> On Fri, Dec 4, 2020 at 5:02 AM Markku Kojo <kojo@cs.helsinki.fi> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I know this is a bit late but I didn't have time earlier to take look at
>>>>>> this draft.
>>>>>>
>>>>>> Given that this RFC to be is standards track and RECOMMENDED to replace
>>>>>> current DupAck-based loss detection, it is important that the spec is
>>>>>> clear on its advice to those implementing it. Current text seems to
>>>>>> lack important advice w.r.t congestion control, and even though
>>>>>> the spec tries to decouple loss detection from congestion control
>>>>>> and does not intend to modify existing standard congestion control
>>>>>> some of the examples advice incorrect congestion control actions.
>>>>>> Therefore, I think it is worth to correct the mistakes and take
>>>>>> yet another look at a few implications of this specification.
>>>>> As you noted, the intention is to decouple the two as much as possible
>>>>>
>>>>> Unlike the 20 years ago where TCP loss detection and congestion
>>>>> control are essentially glued in one piece, the decoupling of the two
>>>>> (including modularizing congestion controls in implementations) has
>>>>> helped fueled many great inventions of new congestion controls.
>>>>> Codifying so-called-default C.C. reactions in the loss detection is a
>>>>> step backward that the authors try their best to avoid.
>>>>
>>>> While I fully agree with the general principle of decoupling loss
>>>> detection from congestion control when it is possible without leaving
>>>> open questions, I find it hard to get congestion control right with this
>>>> spec in certain cases I raised just by following the current standards
>>>> track CC specifications. The reason for this is that RACK-TLP introduces
>>>> new ways to detect loss (i.e., not present in any earlier standard track
>>>> RFC) and the current CC specifications do not provide correct CC actions
>>>> for such cases as I try to point out below.
>>>>
>>>>> To keep the
>>>>> document less "abstract / unclear" as many WGLC reviewers commented,
>>>>> we use examples to illustrate that includes CC actions. But the
>>>>> details of these CC actions are likely to become obsolete as CC
>>>>> continues to advance hopefully.
>>>>
>>>> Agreed. But I would appreciate if the CC actions in the examples would
>>>> correctly follow what is specified in the the current CC RFCs. And, I
>>>> would suggest explicitly citing the RFC(s) that each of the examples is
>>>> illustriating so that there is no doubt which CC variant the example
>>>> is valid with. Then there is no problem with the correctness of the
>>>> example either even if the cited RFC becomes later obsoleted.
>>>>
>>>>>>
>>>>>> Sec. 3.4 (and elsewhere when discussing recovering a dropped
>>>>>> retransmission):
>>>>>>
>>>>>> It is very useful that RACK-TLP allows for recovering dropped rexmits.
>>>>>> However, it seems that the spec ignores the fact that loss of a
>>>>>> retransmission is a loss in a successive window that requires reacting
>>>>>> to congestion twice as per RFC 5681. This advice must be included in
>>>>>> the specification because with RACK-TLP recovery of dropped rexmit
>>>>>> takes place during the fast recovery which is very different
>>>>>> from the other standard algorithms and therefore easy to miss
>>>>>> when implementing this spec.
>>>>>
>>>>> per RFC5681 sec 4.3 https://tools.ietf.org/html/rfc5681#section-4.3
>>>>> "Loss in two successive windows of data, or the loss of a
>>>>>  retransmission, should be taken as two indications of congestion and,
>>>>>   therefore, cwnd (and ssthresh) MUST be lowered twice in this case."
>>>>>
>>>>> RACK-TLP is a loss detection algorithm. RFC5681 is crystal clear on
>>>>> this so I am not sure what clause you suggest to add to RACK-TLP.
>>>>
>>>> Right, this is the CC *principle* in RFC 5681 I am refering to but I am
>>>> afraid it is not enough to help one to correctly implement such lowering
>>>> of cwnd (and ssthresh) twice when a loss of a retransmission is detected
>>>> during Fast Recovery. Nor do RFCs clearly advice *when* this reduction
>>>> must take place.
>>>>
>>>> Congestion control principles tell us that congestion must be reacted
>>>> immediately when detected. But at the same time, standards track CC
>>>> specifications react to congestion only once during Fast Recovery
>>>> because the losses in the current window, which Fast Recovery repairs,
>>>> occured during the same RTT. That is, the current CC specifications do
>>>> not handle lost rexmits during Fast Recovery but, instead, the correct CC
>>>> reaction to a loss of a rexmit is automatically achieved by those
>>>> specifications via RTO when cwnd is explicitly reset upon RTO.
>>>>
>>>> Furthermore, the problem I am trying to point out is that there is no
>>>> correct rule/formula available in the standards track RFCs that would
>>>> give the correct way to reduce cwnd when the loss of rexmit is detected
>>>> with RACK-TLP.
>>>>
>>>> I suggest everyone reading this message stops reading at this point
>>>> for a while before continuing reading and figures out themselves what
>>>> they think would be the correct equation to use in the standards RFC
>>>> series to find the new halved value for cwnd when RACK-TLP detects a loss
>>>> of a rexmit during Fast Recovery. I would appreciate a comment on the
>>>> tcpm list from those who think they found the correct answer immediately
>>>> and easily, and what was the formula to use.
>>>>
>>>> ...
>>>>
>>>> I think the best advice one may find by looking at RFC 6675 (and RFC
>>>> 5681) is to set
>>>>
>>>>   ssthresh = cwnd = (FlightSize / 2) (RFC 6675, Sec 5, algorithm step 4.2)
>>>>
>>>> Now, let's modify the example in Sec 3.4 of the draft:
>>>>
>>>> 1. Send P0, P1, P2, ..., P20
>>>>    [Assume P1, ..., P20 dropped by network]
>>>>
>>>> 2.   Receive P0, ACK P0
>>>> 3a.  2RTTs after (2), TLP timer fires
>>>> 3b.  TLP: retransmits P20
>>>> ...
>>>> 5a.  Receive SACK for P20
>>>> 5b.  RACK: marks P1, ..., P20 lost
>>>>       set cwnd=ssthresh=FlightSize/2=10
>>>> 5c.  Retransmit P1, P2 (and some more depending on how CC implemented)
>>>>       [P1 retransmission dropped by network]
>>>>
>>>>       Receive SACK P2 & P3
>>>> 7a.  RACK: marks P1 retransmission lost
>>>>       As per RFC 6675: set cwnd=ssthresh=FlightSize/2=20/2=10
>>>> 7b.  Retransmit P1
>>>>       ...
>>>>
>>>> So, is the new value of the cwnd (10MSS) correct and halved twice? If not,
>>>> where is the correct formula to do it?
>>>>
>>>> Before RFC 5681 was published we had a discussion on FlighSize and that
>>>> during Fast Recovery it does not reflect the amount of segments in
>>>> flight correctly but may also have a too large value. It was decided not
>>>> to try to correct it because it only has an impact when RTO fires during
>>>> Fast Recovery and in such a case cwnd is reset to 1 MSS. Having too large
>>>> ssthresh for RTO recovery in some cases was not considered that bad
>>>> because a TCP sender anyway takes to most conservative CC action with the
>>>> cwnd and would slow start from cwnd = 1 MSS. But now when RACK-TLP enables
>>>> detecting loss of a rexmit during Fast Recovery we have an unresolved
>>>> problem.
>>>
>>>> 1. Send P0, P1, P2, ..., P20
>>>>    [Assume P1, ..., P20 dropped by network]
>>>>
>>>> 2.   Receive P0, ACK P0
>>>> 3a.  2RTTs after (2), TLP timer fires
>>>> 3b.  TLP: retransmits P20
>>>> ...
>>>> 5a.  Receive SACK for P20
>>>> 5b.  RACK: marks P1, ..., P20 lost
>>>>       set cwnd=ssthresh=FlightSize/2=10
>>>> 5c.  Retransmit P1, P2 (and some more depending on how CC implemented)
>>>>       [P1 retransmission dropped by network]
>>>>
>>>>       Receive SACK P2 & P3
>>>> 7a.  RACK: marks P1 retransmission lost
>>>>       As per RFC 6675: set cwnd=ssthresh=FlightSize/2=20/2=10
>>>> 7b.  Retransmit P1
>>>
>>> To account for your points, IMO clearly stating the existing CC RFC
>>> interactions w/o mandating any C.C. actions are the best way to move
>>> forward. Here are the text diff I proposed:
>>>
>>> "Figure 1, above, illustrates  ...
>>> Notice a subtle interaction with existing congestion control actions
>>> on event 7a. It essentially starts another new episode of congestion
>>> due to the detection of lost retransmission. Per RFC5681 (section 4.3)
>>> that loss in two successive windows of data, or the loss of a
>>> retransmission, should be taken as two indications of congestion as a
>>> principle. But RFC6675 that introduces the pipe concept does not
>>> specify such a second window reduction. This document reminds RACK-TLP
>>
>> This is not quite correct characterization of RFC 6675.
>> RFC 6675 does not repeat all guidelines of RFC 5681. RFC 6675 DOES specify
>> the two indications of congestion implicitly by clearly stating in the
>> intro that it follows the guidelines set in RFC  5681.
>> And RFC 5681 articulates a set of MUSTs which ALL alternative loss
>> recovery algorithms MUST follow in order to become RFCs.
>>
>> (*) RFC 5681, Sec 4.3 (pp. 12-13):
>>
>>   "While this document does not standardize any of the
>>    specific algorithms that may improve fast retransmit/fast recovery,
>>    ...
>>    That is, ...
>>    Loss in two successive windows of data, or the loss of a
>>    retransmission, should be taken as two indications of congestion and,
>>    therefore, cwnd (and ssthresh) MUST be lowered twice in this case."
>>
>>
>>> implementation to carefully consider the new multiple congestion
>>> episode cases in the corresponding congestion control."
>>
>> I am sorry to say that this is very fuzzy and leaves an implementer
>> all alone what to do.
>>
>> More inportantly, AFAIK we have no discussion nor consensus in the IETF
>> on what is the correct CC action when lost rexmit is detected.
>> Should one reset cwnd like current CCs do, or would "halving again" be
>> fine, or something else? IMHO this requires experimentation to decide.
>>
>> I assumed there is an implementation of RACK-TLP with corresponding CC
>> actions and experimental results to devise what are the implications of
>> the selected CC action(s) with various levels of congestion. But it seems
>> we do not have such implementation nor experimental evidence? Am I wrong?
>
> [RACK-TLP-team:]
>
> We do have such implementations of RACK-TLP with corresponding CC
> actions, and experimental results were reported in RACK-TLP
> presentations and earlier revisions of the RACK-TLP draft. With

Yes, that was my understanding what is implemented. Given there is this 
implementation that reacts appropriately to congestion when RACK-TLP 
detects loss of a retransmitted segment, could you please illuminate the 
congestion control actions taken in the implementation with which there 
is experimental experience.

If I understood it correctly, the RACK-TLP in question is implemented 
with PRR. Let's assume another implementer wants to add RACK-TLP with PRR 
in the stack she/he is working on. I believe that the implementer would 
think that detecting a loss of a rexmit during fast recovery would start 
a new congestion episode in PRR (and that's what Yuchung also indicated 
in the new proposed text in his previeous reply shown above). With 
that advice and reading RFC 6937, an implementer would reinitialize PRR 
state for the new congestion episode. So how should the implementer 
calculate new values for ssthresh and RecoverFS? Should she/he use what 
is currently described in RFC 6937, Sec 3 (to keep discussion simple 
assume the implementer wants to have an IETF Standards Track comformant 
implementation, so use CongCtrlAlg() = FlightSize/2):

  ssthresh = CongCtrlAlg()  // Target cwnd after recovery
  prr_delivered = 0         // Total bytes delivered during recovery
  prr_out = 0               // Total bytes sent during recovery
  RecoverFS = snd.nxt-snd.una // FlightSize at the start of recovery

Or should one use something else? AFAIK depending on the loss pattern 
neither FlightSize nor "snd.nxt-snd.una" would give correct values for 
the intended behavior. Depending on the loss pattern, at the time 
of detecting a loss of rexmit FlighSize may be ~ 50% larger than in the 
beginning of fast recovery. This would result in increasing ssthresh 
(send rate), instead of decreasing it. It seems that something else in 
PRR algo would also need to be modified to get sndcnt correct but I 
cannot figure it out right now.

If one wants to implement RACK-TLP without PRR, that is, with RFC 6675, 
we have the same probelm with FlightSize in determining a new value for 
cwnd and ssthresh.

My point is that using what we currently have out there in the RFC 
series gives incorrect CC advice. And, having correct CC actions in this 
case is far from trivial so IMHO high quality standards specifications 
should not leave an implementer of RACK-TLP alone to figure it out.

In addition, if a new episode of fast recovery is started for CC purposes 
how would that play together with RACK-TLP loss detection? In Sec 7.1 the 
text says initialize (reset) TLP.end_seq and TLP.is_retrans when 
initiating fast recovery. But detecting a lost rexmit seemingly should 
not reinitialize TLP because PTO is disabled during fast recovery as per 
Sec 7.2. If CC actions are in another document and loss detection in this 
document, it would be of high importance to clarify when each of the 
fast recovery episodies (CC vs. loss detection) are supposed to end: is 
the end condition the same or different?

> respect to implementations, there is the Linux TCP stack, which has
> been using RACK-TLP as the default loss recovery algorithm since Linux
> v4.18 in August 2018. The exact commit is:
>
>  b38a51fec1c1 tcp: disable RFC6675 loss detection
>
> With Linux, the default sender behavior since that commit has been
> governed by CUBIC, PRR, and RACK-TLP.
>
> RACK-TLP is used by all Google TCP traffic, including internal RPC
> traffic and external YouTube and google.com traffic over the public
> Internet. At Google we have years of high-traffic-volume experience
> with RACK-TLP with three different CC schemes: CUBIC+PRR, BBRv1, and
> BBRv2.
>
> My understanding is that Netflix FreeBSD TCP traffic uses RACK-TLP,
> and Microsoft Windows TCP traffic has also been using RACK-TLP.

And I wonder what would be the CC actions taken in each of these 
implementations when a loss of rexmited segment is detected?

>> I sincerely apologize that this got raised this late in the prosess and
>> I know how irritating it may be. I like the idea of RACK-TLP and by no
>> means my intention is not to hold up the process but the lack of evidence
>> makes me quite concerned. In particular, when this document says that
>> the current loss detection SHOULD be replaced in all implementations with
>> RACK-TLP.
>>
>>> and to emphasize in section 9.3 Interaction with congestion control
>>>
>>> "9.3.  Interaction with congestion control
>>>
>>> RACK-TLP intentionally decouples loss detection ... this appropriate.
>>> As mentioned in Figure 1 caption, RFC5681 mandates a principle that
>>> Loss in two successive windows of data, or the loss of a
>>> retransmission, should be taken as two indications of congestion, and
>>> therefore reacted separately. However implementation of RFC6675 pipe
>>> algorithm may not directly account for this newly detected congestion
>>> events properly. Therefore the documents reminds RACK-TLP
>>> implementation to carefully consider these implications in its
>>> corresponding congestion control.
>>>
>>> ..."
>>>
>>>
>>>
>>>>
>>>>
>>>>>> Sec 9.3:
>>>>>>
>>>>>> In Section 9.3 it is stated that the only modification to the existing
>>>>>> congestion control algorithms is that one outstanding loss probe
>>>>>> can be sent even if the congestion window is fully used. This is
>>>>>> fine, but the spec lacks the advice that if a new data segment is sent
>>>>>> this extra segment MUST NOT be included when calculating the new value
>>>>>> of ssthresh as per the equation (4) of RFC 5681. Such segment is an
>>>>>> extra segment not allowed by cwnd, so it must be excluded from
>>>>>> FlightSize, if the TLP probe detects loss or if there is no ack
>>>>>> and RTO is needed to trigger loss recovery.
>>>>>
>>>>> Why exclude TLP (or any data) from FlightSize? The congestion control
>>>>> needs precise accounting of the flight size to react to congestion
>>>>> properly.
>>>>
>>>> Because FlightSize does not always reflect the correct amount of data
>>>> allowed by cwnd. When a TCP sender is not already in loss recovery and
>>>> it detects loss, this loss indicates the congestion point for the TCP
>>>> sender, i.e., how much data it can have outstanding. It is this amount of
>>>> data that it must use in calculating the new value of cwnd (and
>>>> ssthresh), so it must not include any data sent beyond the congestion
>>>> point. When TLP sends a new data segment, it is beyond the congestion
>>>> point and must not be included. Same holds for the segments sent via
>>>> Limited Transmit: they are allowed to be send out by the packet
>>>> conservation rule (DupAck indicates a pkt has left the network, but does
>>>> not allow increasing cwnd), i.e., the actual amount of data in flight
>>>> remains the same.
>>>>
>>>>>> In these cases the temporary over-commit is not accounted for as DupAck
>>>>>> does not decrease FlightSize and in case of an RTO the next ACK comes too
>>>>>> late. This is similar to the rule in RFC 5681 and RFC 6675 that prohibits
>>>>>> including the segments transmitted via Limitid Transmit in the
>>>>>> calculation of ssthresh.
>>>>>>
>>>>>> In Section 9.3 a few example scenarios are used to illustriate the
>>>>>> intended operation of RACK-TLP.
>>>>>>
>>>>>>   In the first example a sender has a congestion window (cwnd) of 20
>>>>>>   segments on a SACK-enabled connection.  It sends 10 data segments
>>>>>>   and all of them are lost.
>>>>>>
>>>>>> The text claims that without RACK-TLP the ending cwnd would be 4 segments
>>>>>> due to congestion window validation. This is incorrect.
>>>>>> As per RFC 7661 the sender MUST exit the non-validated phase upon an
>>>>>> RTO. Therefore the ending cwnd would be 5 segments (or 5 1/2 segments if
>>>>>> the TCP sender uses the equation (4) of RFC 5681).
>>>>>>
>>>>>> The operation with RACK-TLP would inevitably result in congestion
>>>>>> collapse if RACK-TLP behaves as described in the example because
>>>>>> it restores the previous cwnd of 10 segments after the fast recovery
>>>>>> and would not react to congestion at all! I think this is not the
>>>>>> intended behavior by this spec but a mistake in the example.
>>>>>> The ssthresh calculated in the beginning of loss recovery should
>>>>>> be 5 segments as per RFC 6675 (and RFC 5681).
>>>>> To clarify, would this text look more clear?
>>>>>
>>>>> 'an ending cwnd set to the slow start threshold of 5 segments (half of
>>>>> the original congestion window of 10 segments)'
>>>>
>>>> This is correct, but please replace:
>>>>
>>>>   (half of the original congestion window of 10 segments)
>>>> -->
>>>>   (half of> the original FlightSize of 10 segments)
>>>
>>> sure will do
>>>
>>>>
>>>> cwnd in the example was 20 segments.
>>>>
>>>> Please also correct the ending cwnd for "without RACK" scenario.
>>>> I poined out wrong equation number IN RFC 5681 and INCORRECT cwnd value,
>>>> my apologies. It MAENT equation (3) AND that results in ending cwnd of 5
>>>> and 2/5 MSS (not 5 and 1/2 MSS).
>>>> NB: and if a TCP sender implements entering CA when cwnd > ssthresh, then
>>>> ending cwnd would be 6 and 1/6 MSS).
>>>>
>>>>>>
>>>>>> Furthermore, it seems that this example with RACK-TLP refers to using
>>>>>> PRR_SSRB which effectively implements regular slow start in this
>>>>>> case(?). From congestion control point of view this is correct because
>>>>>> the entire flight of data as well as ack clock was lost.
>>>>>>
>>>>>> However, as correctly discussed in Sec 2, congestion window must be reset
>>>>>> to 1 MSS when an entire flight of data is and Ack clock is lost. But how
>>>>>> can an implementor know what to do if she/he is not implementing the
>>>>>> experimental PRR algrorithm? This spec articulates specifying an
>>>>>> alternative for DupAck counting, indicating that TLP is used to trigger
>>>>>> Fast Retransmit & Fast Recovery only, not a loss recovery in slow start.
>>>>>> This means that without an additional advise an implementation of this
>>>>>> spec would just halve the cwnd and ssthresh and send a potentially very
>>>>>> large burst of segments in the beginning of the Fast Recovery because
>>>>>> there is no ack clock. So, this spec begs for an advise (MUST) when to
>>>>>> slow start and reset cwnd and when not, or at least a discussion of
>>>>>> this problem and some sort of advise what to do and what to avoid.
>>>>>> And, maybe a recommendation to implement it with PRR?
>>>>>
>>>>> It's wise to decouple loss detection (RACK-TLP) vs congestion/burst
>>>>> control (when to slow-start). The use of PRR is just an example to
>>>>> illustrate and not meant for a recommendation.
>>>>
>>>> I understand the use of PRR was just an example, but my point is that if
>>>> one wants to implement RACK-TLP but does not intend to implement PRR but
>>>> RFC 6675 then we do not have a rule in RFC 6675 to correctly implement
>>>> CC for the case when an entire flight is lost and loss is detected with
>>>> TLP. Congestion control principle for this is clear and also stated in
>>>> this draft but IMHO it is not enough to ensure correct implementation.
>>>>
>>>> To my understanding we only have implementation experience for RACK-TLP
>>>> only togerher with PRR, which has the necessary rule to handle this kind
>>>> of scenario correctly.
>>>>
>>>> So, my question is how can one implement CC correctly without PRR such
>>>> a scenario where entire inflight is lost?
>>>> Which rule and where in the RFC series gives the necessary guidance to
>>>> reset cwnd and slow start when TCP detets loss of an entire flight?
>>>
>>> I think we're going in loops. To move forward it'd help if you suggest
>>> some text you like to change.
>>
>> Unfortunately I do not have an immediate solution. I assumed there is an
>> implementation and you could enlighten what was the solution there and
>> experimental results showing how it seems to work. If I have understood
>> it correctly there is a solution/implementation and experience of RACK-TLP
>> that works together with PRR but no solution/implementation nor experience
>> how it works without PRR. Am I correct?
>
> [RACK-TLP-team:]
>
> The implementation solutions and experiments were mentioned above. Our
> team's experience is with RACK-TLP in combination with either
> CUBIC+PRR, BBRv1, and BBRv2. We are not sure if there are other TCP
> implementations that implement RACK-TLP without PRR.

So, in addition to the lack of experimental evidence using RACK-TLP 
without PRR we have a specification problem for which we have no answer:

The RACK-TLP text suggests that if PTO fires and TLP detects a loss, the 
TCP sender enters fast recovery. But

PTO = timeout that expires if there is no feedback (ACKs) from
       the network, i.e., entire flight of data is lost.

TLP = a single segment that is sent when timer expires. What is
       different compared to RTO is that PTO may expire faster
       and a different segment is (re)transmitted, allowing to
       detect more than one lost segment with a single Ack w/ SACK
       triggered by TLP (with regular RTO recovery only the next
       unacknowledged segment can reliably be considered lost)

>From congestion control point of view this means that PTO is no different 
from RTO. The congestion control principle as stated in Sec 2.2 of this 
document tells us that a loss of entire flight deserves a cautious 
response, i.e., reset cwnd and slow start.

When RACK-TLP is implemented with PRR we have appropriate CC actions 
specified in RFC 6937 that effectively reset cwnd and force the TCP sender 
to slow start, even though it tecnically enters fast recovery. This is 
safe.

However, if an implementer wants to implement RACK-TLP without PRR, 
she/he has to consult RFC 6675 which gives an incorrect CC response for 
this case, resulting in an unsafe implementation. The problem is not in 
the fast recovery algorithm in RFC 6675 because it was not intended to be 
initiated with TLP-kind of loss detection.

So the question is how to ensure an implementer has the necessary advice 
to implement correct congestion control actions for RACK-TLP once this 
document is published. The lack of correct advice becomes even more 
prominent as this document requires (all SACK capable) TCPs to implement 
RACK-TLP.

>> Given my understanding of PRR, the problem of an entire flight being
>> dropped is quite nicely solved with it. However, implementing RACK-TLP
>> without PRR begs for a solution. Here is my current understanding:
>>
>> (1) RACK part is what can be called to "replace" DupAck counting
>> (2) TLP part is effectively a new retransmission timeout to detect
>>      a loss of an entire flight of data (i.e., it is only invoked
>>      when a tail of the current flight becomes lost which equals to
>>      the case of losing an entire flight since all segments before
>>      the tail loss will get cumulatively Acked and hence these segments
>>      are not anymore a part of the current flight at the time the loss
>>      is detected. And we an assume application limited TCP sender.
>> (3) Current CC principles require resetting cwnd in such a case
>>      and entering slow start (and so effectively does PRR though it is
>>      not explicitly stated in RFC 6937). Slow start avoids a big burst
>>      if the lost flight is big.
>> (4) RACK-TLP would possibly like to allow that cwnd is set to a half
>>      of the lost flight and not to slow start. This means that the bigger
>>      the lost flight is, the bigger is the burst that gets transmitted at
>>      the beginning of the recovery which is bad. So, this approach would
>>      need at least a rule/advice for burst avoidance (slow start, pacing,
>>      ...).
>> (5) When the lost flight is small (<= 4 segments) there is no difference
>>      in recovery efficiency between (3) and (4). If the lost flight is
>>     > 5 segments, then (4) takes less RTTs to complete the recovery
>>      but generates a burst. Note, even if pacing over one SRTT would be
>>      used, it is still a burst.
>>
>> Now, it would be useful to have experimental data to know how the size of
>> the lost flight is distributed. Is it typically just a few segments as
>> illustriated in the examples of the draft or is it often larger?
>>
>> My advice would be to add a rule that cwnd MUST be reset to 1 MSS
>> and the sender MUST enter slow start if TLP detects loss of an entire
>> flight. This would be safe. Otherwise, without experimental evidence from
>> a wide range of different network conditions and workloads it feels
>> unsafe to allow more aggressive approach.
>
> [RACK-TLP-team:]
>
> Our team's position is that the RACK-TLP draft documents a loss
> detection algorithm, and normative declarations about congestion
> control behavior (like your "cwnd MUST be reset to 1 MSS", above)
> should be in a separate document focused on congestion control.

I do not disagree with you what becomes to the role of congestion control 
in the RACK-TLP draft. It may well be in another document but IMHO such 
document must be available at the time this docment gets published.
How can we otherwise ensure correct implementations?

So this is more of a process question than critique towards the draft.


>>>>> Section 3 has a lengthy section to elaborate the key point of RACK-TLP
>>>>> is to maximize the chance of fast recovery. How C.C. governs the
>>>>> transmission dynamics after losses are detected are out of scope of
>>>>> this document in our authors' opinions.
>>>>>
>>>>>
>>>>>>
>>>>>> Another question relates to the use of TLP and adjusting timer(s) upon
>>>>>> timeout. In the same example discussed above, it is clear that PTO
>>>>>> that fires TLP is just a more aggressive retransmit timer with
>>>>>> an alternative data segment to (re)transmit.
>>>>>>
>>>>>> Therefore, as per RFC 2914 (BCP 41), Sec 9.1, when PTO expires, it is in
>>>>>> effect a retransmission timout and the timer(s) must be backed-off.
>>>>>> This is not adviced in this specification. Whether it is the TCP RTO
>>>>>> or PTO that should be backed-off is an open question.  Otherwise,
>>>>>> if the congestion is persistent and further transmission are also lost,
>>>>>> RACK-TLP would not react to congestion properly but would keep
>>>>>> retransmitting with "constant" timer value because new RTT estimate
>>>>>> cannot be obtained.
>>>>>> On a buffer bloated and heavily congested bottleneck this would easily
>>>>>> result in sending at least one unnecessary retransmission per one
>>>>>> delivered segment which is not advisable (e.g., when there are a huge
>>>>>> number of applications sharing a constrained bottleneck and these
>>>>>> applications are sending only one (or a few) segments and then
>>>>>> waiting for an reply from the peer before sending another request).
>>>>>
>>>>> Thanks for pointing to the RFC.  After TLP, RTO timers will
>>>>> exp-backoff (as usual) for stability reasons mentioned in sec 9.3
>>>>> (didn't find 9.1 relevant).
>>>>
>>>> My apologies for refering to the wrong section of RFC 2914, Yes, I meant
>>>> Sec 9.3.
>>>>
>>>>> In your scenario, you presuppose the
>>>>> retransmission is unnecessary so obviously TLP is not good. Consider
>>>>> what happens without TLP where all the senders fire RTO spuriously and
>>>>> blow up the network. It is equally unfortunate behavior. "bdp
>>>>> insufficient of many flows" is a congestion control problem
>>>>
>>>> If (without TLP) RTO is spurious, it may result in unnecessary
>>>> retransmissions. But we have F-RTO (RFC 5682) and Eifel (RFC 3522) to
>>>> detect and resolve it without TLP, so I don't find it as a problem.
>>>>
>>>> To clarify more, what I am concerned about. Think about a scenario where a
>>>> (narrow) bottleneck becomes heavily congested by a huge number of
>>>> competing senders such that the available capacity per sender is less
>>>> than 1 segment (or << 1 MSS).
>>>> This is a situation that network first enters before congestion collapse
>>>> gets realized. So, it is extremely important that all CC and timer
>>>> mechanisms handle it properly. Regular TCP handles it via RFC 6298 by
>>>> backing off RTO expotentially and keeping this backed-off RTO until an
>>>> new ACK is received for new data. This saves RACK-TLP from full congestion
>>>> collapse. But consider what happens: even though RTO is backed off, each
>>>> time a TCP sender manages to get one segment through (with cwnd = 1 MSS)
>>>> it always first arms PTO with more or less constant value of 2*SRTT. If
>>>> the bottleneck is buffer bloated the actual RTT easily exceeds 2*SRTT and
>>>> TLP becomes spurious. After a spurious TLP, RTO expires (maybe more than
>>>> once before exponential back-off of RTO results in large enough value)
>>>> and a new RTT sample is not received. So, SRTT remains unchanged and even
>>>> if sometimes a new sample is received, SRTT gets very slowly adjusted. As
>>>> a result, each TCP sender would keep on sending a spurious TLP for each
>>>> new segment resulting in at least 50% of the packets being unnecessary
>>>> rexmitted and the utilization of the bottleneck is < 50%. This would not
>>>> be a full congestion collapse but has unwanted symptoms towards
>>>> congestion collapse (Note: there is no clear line for the level of
>>>> reduction in delivery of useful data is considered as congestion
>>>> collapse).
>>>
>>> AFAIK you are saying: under extreme congestion shared by many short
>>> flows, RACK-TLP can cause more packet losses because of the more
>>> aggressive PTO timer. I agree and can add this to the "section 9.3".
>>
>> No, that was not what I meant. This happens simply when enough flows are
>> competing on the same bottleneck. It may be long flows or maybe
>> applications with more or less continuous request-reply exhanges.
>> E.g., if the bottleneck bit rate is 1 Mbits/s, RTT is 500msecs and PMTU
>> is 1500B, then a bit over 40 simultaneous flows would mean that the
>> bottleneck becomes fully utilized with the equal share of roughly 1 MSS
>> per flow. With a quite typical bottleneck buffer size roughly equaling to
>> BDP, about 80+ flows would fill up the buffer as well and increase
>> RTT >= 1 secs.  A buffer bloated bottleneck buffer would mean even larger
>> RTT and allow more flows sharing the bottleneck without loss.
>>
>> If the number of TCP flows is > 90 (or >> 90) RACK-TLP would ends up
>> (almost) always unnecessarily retransmitting each new segment once (PTO
>> being 1 sec or around 1 sec). RTO back-off after a few rounds saves each
>> TCP sender from additional unnecessary rexmits but still ~ 50% of the
>> deivered packets are not making any useful progress.
>>
>> If the bottleneck buffer is small, it would result in more losses as you
>> suggest but would not be that much of problem because majority of the
>> unnecessary rexmits would not get delivered over the bottleneck link.
>> Instead, they wuold get dropped at the bottleneck. Unnecessary rexmits
>> would just create extra load on the poth before the bottleneck (which of
>> course is not a non-problem either).
>
> [RACK-TLP-team:]
>
> If we are correctly understanding the scenario you are outlining, the
> PTO would likely not be 1 sec in this case. In this scenario, the RTT
> is around 1sec, so the PTO would be around 2*srtt, which would be
> around 2 secs. With a 1sec RTT, and even considering delayed ACKs, it
> seems very unlikely that a 2 sec PTO would cause a significant number
> of unnecessary TLP probes.

Well. Roughly the first initiated 40 flows will have SRTT ~ 0.5 sec. If 
next ~ 40 flows arrive within an RTT or so they will get RTT samples 
roughly between 0.5 - 1.x secs. The rest 80 - N hundreds flows will not 
get an RTT sample and will use initial/default PTO = 1 sec (unless using 
timestamps). The larger the buffer bloat the worse things get.
Once ~200 flows are competing over the bottleneck, RTT will be well 
beyond 2 secs and grows when more flows join, guaranteeing spurious TLP 
for each new segment (unless timestamps are in use).

> Perhaps your postulated PTO of 1 sec is coming from the pseudocode
> path in TLP_calc_PTO() where, if SRTT is not available, the PTO is 1
> second? However, this case where there is no SRTT only happens if (a)
> the connection is *not* using TCP timestamps (enabled for most
> connections on the public Internet, due to default-enabled support in
> Linux/Android and iOS/MacOS), and (b) the connection has *never* had a
> single RTT sample from a non-retransmitted packet. To meet both of
> those conditions is quite rare in practice. Is this the path that is
> causing your concerns about a PTO of 1 sec?

Sure. Having timestamps helps a lot in case the bottleneck buffer is 
buffer bloated (though RTT sample will not include delayed Ack effect for 
rexmits that are immediately acked). But RACK-TLP does not require 
timestamps and there are a lot of TCP stacks out there in addition to 
Linux/Android and iOS/MacOS. And more are coming with IoT expansion, 
for example. In addition, it is quite suprising how old (mobile) devices 
many people are using, some of which are old enough and have not been 
eligible for an OS upgrade for many years.

But my quick example scenario was just to illustriate the problem, maybe 
typical in some areas where the only Internet access is over a 
geostationary satellite connection and people quite likely not using 
mobile/cellular devices. And, I selected a buffer-bloated scenario 
because that generates unnecessary rexmits which are a prerequisite for 
classical congestion collapse.

It may well be in some environments that these hundreds of competing 
flows are application limited (request-reply continuously) and get 
started over a longer period of time, each getting RTT sample(s) over a 
lightly loaded bottleneck and then becoming idle or sending data only 
sporadically. Then later they all become active more or less at the same 
time (maybe by a common trigger). Base RTT may then be much lower than 
500 ms as long as there is enough buffer bloat.

Furthermore, timestamps are not that helpful in case of a heavily 
congested small buffer bottleneck. Having approximately accurate SRTT 
does not alone help in reducing the sending rate below 1 segment/RTT in 
heavily congested scenarios. Exponential backoff of timer is a must.

I believe that either of us know what's really out there employing 
Internet for communication purposes. It does not matter whether it is a 
notable percentage or not as long there is a group of users who are 
encountering the envisioned problem.
It may very well be a small minority, but if this kind of a problem 
escalates for this group very often (maybe almost always), it means we 
are disabling use of Internet technology for them.

> Furthermore, in cases like this where the number of flows is roughly
> equal to or greater than the BDP of the path, Reno/[RFC5681] and CUBIC
> will spend the majority of their time in loss recovery (the minimum
> ssthresh from [RFC5681] being 2*SMSS).

Sure there is surprisingly lot to improve with TCP when the window needs 
to be kept lower than 2 MSS or when the available share would be even 
less than 1 MSS/RTT. RFC 5681 would exponentially back-off on every RTO, 
leaving opportunity for other flows to make progress during the 
backed-off interval. Once the sender is lucky and gets an Ack for a new 
segment, min ssthresh of 2*MSS unfortunately allows it to accelerate to 
2*MSS/RTT and maybe more until RTO expires again. We cannot avoid some 
drops or unnecessary rexmits. Actually, in such a heavily congested 
scenario the combination of Karn's exponential back-off and slow start 
results in Multiplicative Increase Multiplicative Decrease (MIMD) 
behaviorwhich does not converge that well to fairness. But it avoids 
congestion collapse. It is kind a tradeoff between safe and stable 
behavior during heavy congestion and being able to ramp up fast when the 
congestion fades away. Even though the current behavior is not ideal 
for fairness, we should not make it even worse.

Currently only way to avoid spending most of the time in loss 
recovery seems to be ECN that would make a TCP sender to become rate 
controlled once cwnd hits the minimum cwnd of 1 MSS(*). But there is a 
lot to optimise also with ECN in this kind of scenario ...

(*) to avoid anyone repeating the misconception that minimum cwnd is
     2 MSS per equation (4) in RFC 5681: equation (4) sets ssthresh,
     not cwnd. RFC 3168 clearly states that cwnd must be halved on
     an arrival of ECE; this holds until cwnd = 1 MSS. After this,
     the sending rate of new segments is (RTO) timer controlled and
     the timer avlue exponentially backed-off if more ECEs arrive.
     At least two major OSs implement this incorrectly unless now
     already fixed.

> And the RACK-TLP ID specifies
> that TLP probes are not sent while in loss recovery.

Maybe I missed it but I couldn't find an end condition of RTO recovery 
for RACK-TLP purposes in the document. Regular RTO does not have such end 
condition because it is not needed. Congestion control and loss recovery 
are quite decoupled during  RTO recovery; slow start follows its own 
rules and ends when cwnd > ssthresh and loss recovery has its own 
rules to select segments to rexmit and just moves on sending new data 
once all lost segments have been retransmitted. Some implementations may 
have such variable/condition but it didn't help me.

Without timestamps, if TLP is not backed-off, a TLP is likely to be sent 
for each new segment (assuming certain conditions). Each not backed-off 
rexmit is potentially stealing an opportunity from another flow to get 
its segment being delivered, thereby contributing to unfair use of 
bottleneck resources.

>>> What authors disagree is that RTO must be back-off on the first
>>> instance if TLP is not acked. While your suggestion helps the
>>> congestion case, it may also hurt the recovery in other cases when the
>>> TLP is dropped due to light/burst/transient congestion. Arguing which
>>> scenarios matter more subjectively is not productive.
>>
>> When RACK-TLP is required to be implemented by all TCP stacks, it
>> is extremely important that it always works in a safe way and congestion
>> is always the primary concern to get properly handled. Without arguing
>> more, I just want to point out that TCP must work reasonably for all
>> Internet users. That means it must not generate a situation where even a
>> small minority of the Internet users often or almost always encounter
>> severe problems with their connectivity.
>>
>> For example, also users in developing countries where possibly an entire
>> village shares just one mobile (cellular, maybe 3G possibly only GPRS)
>> connection for their Internet access and pays per amount of data should
>> get reasonable TCP behavior. In such a case, I cannot agree with
>> engineering that results in almost always getting only half of the
>> already scarce bandwidth whle paying double price fot the useful data.
>>
>> But thinking about a way forward. Karn's algorithm would require backing
>> off PTO even if you get Ack of PTO. Relaxing this does not sound me that
>> bad at all, because there is often a pause before TLP and a TCP sender
>> gets feedback, so apparently conditions are not that bad and loss
>> recovery gets triggered (hopefully in slow start).
>> If TLP is not acked, RTO is needed and recovery is completed in using RTO
>> recovery in slow start. Now, if the RTO recovery is successful (no
>> losses during RTO recovery), it should be quite likely that the TCP
>> sender is able to successfully send also one new segment, because it
>> enters CA when it is sending at the half of the previous rate. So, once
>> you get Ack for the new segment, the TLP back off can be removed and it
>> is unlikely that the TLP back off slowed down next loss detection.
>> On the other hand, if the RTO recovery after an unsuccessful TLP is not
>> successful (more losses are detected), it is quite likely that congestion
>> has not been resolved. So, it is important to be conservative and have a
>> backed-off PTO (or even turn it off) to avoid (further) unnecessary
>> rexmits. If PTO is not backed of, I'd envision PTO mainly failing to get
>> an Ack in such a case and thereby it not being that useful.
>>
>> This certainly would require experimental data in a heavily congested
>> setting to really figure out the actual impact of different alternatives.
>>
>>> So the question we need to look at is if RACK-TLP 2RTT-PTO + regular
>>> RTO-backoff is going to cause major stability issues in extreme
>>> congestion. My understanding based on my officemate Van Jacobson's
>>> explanation is, as long as there's eventual exponential backoff, we'll
>>> avoid the repeated shelling the network.
>>
>> Right. But the problem is that with TLP as specified we do not have full
>> exponential back off. It lacks Karn's clamped retransmit backoff (as it
>> is called in Van Jacobson's seminal paper) which requires keeping
>> backed-off timer of a retransmitted segment for the next (new) segment and
>> ensures that there is eventual exponential backoff.
>> Backing off just RTO is not enough, because "fixed" PTO for each new
>> segment breaks this "back-off chain", that is, the exponential backoff
>> is not continueed until there is evidence that congestion has been
>> resolved (a cumulative Ack arrives for new data). But as I said, backing
>> off RTO saves RACK-TLP from a full congestion collapse. Still, wasting
>> 50% of the available network capacity in certain usage scenarios does not
>> sound acceptable for me.
>
> [RACK-TLP-team:]
>
> To address concerns related to Karn's algorithm, perhaps it would be
> sufficient to specify that after a RACK-TLP sender sends a TLP probe
> that is a retransmitted segment, it temporarily disables further TLP
> retransmissions until the sender obtains a valid RTT sample (i.e. from
> a non-retransmitted segment, or a segment with [RFC7323] TCP
> timestamps)?

As a basic rule: disabling TLP probes for the period of loss recovery as 
currently specified sounds like a correct approach provided that the end 
condition

  "until the sender obtains a valid RTT sample"

is added (this would be without timestamps, see some considerations with 
timestamps later below).

Then some justification and maybe some exeptions can also be specified 
(all these without timestamps):

If fast recovery is not entered via detecting loss using TLP, the 
sender does not need to wait for a valid RTT sample. It should be enough 
that RecoveryPoint is reached, i.e., use normal exit condition for fast 
recovery. However, I am not sure whether detecting a loss of a rexmit 
during fast recovery might change this?

I think it does not matter whether a TLP probe is a retransmitted segment
or not:

(a) if there is no ack after TLP probe and RTO expires it should be clear 
that TLP must be disabled until an ack for a new previously not rexmitted 
segment is received, regardless of the probe segment sent by TLP.
This is Karn's algorithm.

(b) if TLP triggers an Ack that is used to detect a loss of an entire 
flight of data, this may later during loss recovery result in an RTO 
which definitely requires disabling TLP as per the basic rule. There 
might be some exeptions when the basic rule is not necessary but it is 
hard for me to figure one right now, except

(c) if TLP rexmit detects and repairs a single loss, then CC is anyway 
invoked and Ack is received. If the arriving Ack is triggered by the TLP 
probe segment, it indicates that the conditions are likely not that bad 
(RTO not needed). So it might be ok, to rearm PTO for the next time 
without any backoff (without disbling it). However,

(d) if TLP was spurious rexmit, then PTO value obviously was too low 
(effective RTT may have increased suddenly and Acks got just delayed). 
This is possibly detected only one RTT later when the DupAck of the TLP 
rexmit arrives. TLP may have got rearmed by that time again with too 
low value, so it should be rearmed with doubled value to avoid another 
spurious TLP PTO (or disabled).

Then with timestamps. I think the simple rule for disabling TLP "until 
the sender obtains a valid RTT sample" is not necessarily giving the 
desired outcome in all cases. For example, if a TLP probe is send and 
then a late ack arrives for a segment sent before the TLP probe and gives 
a valid RTT sample with help of timestamps. This would allow a new TLP 
even though the sender has not got an Ack for a new segment sent after 
the TLP probe?

So, also with timestamps the requirement must be that the sender obtains 
a valid RTT sample for a segment sent after the TLP probe. And, at least, 
if RTO recovery is entered, then the Ack must be for a new segment.
Cannot figure out now all potential exceptions when a valid RTT sample 
with timestamps would allow re-enabling TLP earlier. Too complex 
algorithm with too many possible scenarios to investigate right now ;)

A single, simple rule would be advisable here. Otherwise, this may end up 
being too complex to get it right.


>>> As a matter of fact some
>>> major TCP (!= Linux) implementation has implemented linear backoff of
>>> first N RTOs before exp-backoff.
>>
>> But that's quite ok as long as there is eventual exponential backoff,
>> including Karn's clamped retransmit backoff. Linear back-off in the
>> beginning just makes resolving (heavy) congestion a bit slower.
>>
>>>>
>>>>>>
>>>>>> Additional notes:
>>>>>>
>>>>>> Sec 2.2:
>>>>>>
>>>>>> Example 2:
>>>>>> "Lost retransmissions cause a  resort to RTO recovery, since
>>>>>>   DUPACK-counting does not detect the loss of the retransmissions.
>>>>>>   Then the slow start after RTO recovery could cause burst losses
>>>>>>   again that severely degrades performance [POLICER16]."
>>>>>>
>>>>>> RTO reovery is done in slow start. The last sentence is confusing as
>>>>>> there is no (new) slow-start after RTO recovery (or more precisely
>>>>>> slow start continues until cwnd > ssthresh). Do you mean: if/when slow
>>>>>> start still continues after RTO Recovery has repaired lost segments,
>>>>>> it may cause burst losses again?
>>>>> I mean the slow start after (the start of) RTO recovery. HTH
>>>>
>>>> Tnx. I'd appreciate if the text could be clarified to reflect this more
>>>> accurately. Maybe something along the lines(?):
>>>>
>>>>   "Then the RTO recovery in slow start could cause burst
>>>>   losses again that severely degrades performance [POLICER16]."
>>>>
>>>>>>
>>>>>> Example 3:
>>>>>>   "If the reordering degree is beyond DupThresh, the DUPACK-
>>>>>>    counting can cause a spurious fast recovery and unnecessary
>>>>>>    congestion window reduction.  To mitigate the issue, [RFC4653]
>>>>>>    adjusts DupThresh to half of the inflight size to tolerate the
>>>>>>    higher degree of reordering.  However if more than half of the
>>>>>>    inflight is lost, then the sender has to resort to RTO recovery."
>>>>>>
>>>>>> This seems to be somewhat incorrect description of TCP-NCR specified in
>>>>>> RFC 4653. TCP-NCR uses Extended Limited Transmit that keeps on sending
>>>>>> new data segments on DupAcks that makes it likely to avoid an RTO in
>>>>>> the given example scenario, if not too many of the the new data
>>>>>> segments triggered by Extended Limited Transmit are lost.
>>>>> sorry I don't see how the text is wrong describing RFC4653,
>>>>> specifically the algorithm in adjusting ssthresh
>>>>
>>>> To my understanding RFC4653 initializes DupThresh to half of the inflight
>>>> size in the beginning of the Extended Limited Transmit. Then on each
>>>> DupAck it adjusts (recalculates) DupThresh again such that ideally a cwnd
>>>> worth of DupAcks are received before packet loss is declared (or
>>>> reordering detected). So, if I am not incorrect, loss of a half of the
>>>> inflight does not necessarily result in RTO recovery with TCP-NCR.
>>> Could you suggest the text you'd like on NCR description.
>>
>> I'm not an expert nor closely acquainted with NCR. There might be many
>> different packet loss patterns that may affect the behavior. So, my
>> advice is to simply drop the last sentence starting with "However ...",
>> because it seems incorrect and replace the second but last sentence:
>>
>>    To mitigate the issue, [RFC4653]
>>    adjusts DupThresh to half of the inflight size to tolerate the
>>    higher degree of reordering.
>>
>> -->
>>
>>    To mitigate the issue, TCP-NCR [RFC4653]
>>    increases the DupThresh from the current fixed value of three duplicate
>>    ACKs [RFC5681] to approximately a congestion window of data having left
>>    the network.
>
> [RACK-TLP-team:]
>
> Sure, we will revise accordingly.

Tnx.

>
>>>>>>
>>>>>> Sec. 3.5:
>>>>>>
>>>>>>   "For example, consider a simple case where one
>>>>>>   segment was sent with an RTO of 1 second, and then the application
>>>>>>   writes more data, causing a second and third segment to be sent right
>>>>>>   before the RTO of the first segment expires.  Suppose only the first
>>>>>>   segment is lost.  Without RACK, upon RTO expiration the sender marks
>>>>>>   all three segments as lost and retransmits the first segment.  When
>>>>>>   the sender receives the ACK that selectively acknowledges the second
>>>>>>   segment, the sender spuriously retransmits the third segment."
>>>>>>
>>>>>> This seems incorrect. When the sender receives the ACK that selectively
>>>>>> acknowledges the second segment, it is a DupAck as per RFC 6675 and does
>>>>>> not increase cwnd and cwnd remains as 1 MSS and pipe is 1 MSS. So, the
>>>>>> rexmit of the third segment is not allowad until the cumulative ACK of
>>>>>> the first segment arrives.
>>>>> I don't see where RFC6675 forbids growing cwnd. Even if it does, I
>>>>> don't think it's a good thing (in RTO-slow-start) as DUPACK clearly
>>>>> indicates a delivery has been made.
>>>>
>>>> SACKed sequences with DUpAcks indicate that those sequences were
>>>> delivered but it does not tell when they were sent. The basic principle
>>>> of slow start is to reliably determine the available network capacity
>>>> during slow start. Therefore, slow start must ensure it uses only
>>>> segments sent during the slow start to increase cwnd. Otherwise, a TCP
>>>> sender may encounter exactly the problem of unnecessary retransmission
>>>> envisioned in this example of RACK-TCP draft (and increase cwnd on not
>>>> valid Acks).
>>>>
>>>> RFC 6675 does re-specify DupAck with SACK option but it does not include
>>>> the rule for slow start. Slow start is specified in RFC 5681. It is
>>>> crystal clear in allowing increase in cwnd only on cumulative
>>>> Acks, i.e., forbidding to increase cwnd on DupAcks (RFC 5681, Sec 3,
>>>> page 6:
>>>>
>>>>    During slow start, a TCP increments cwnd by at most SMSS bytes for
>>>>    each ACK received that cumulatively acknowledges new data.
>>>>
>>>> Maybe this example in the RACK-TLP draft was inspired by a incorrect
>>>> implementation of SACK-based loss recovery?
>>>>
>>>> FYI: when we were finalizing RFC 6675 I suggested including also an
>>>> algorithm for RTO recovery with SACK in RFC 6675. The reason was exactly
>>>> that it might be not easy to gather info from multiple documents and
>>>> hence help the implementor to have all necessary advice in a single
>>>> document. This unfortunately did not get realized though.
>>>
>>> I am honestly lost quibbling these RFC6675 implementations and feeling
>>> pedantic to these standards largely serve as guiding principles
>>> instead of line-by-line code. Is cwnd one packet bigger or smaller in
>>> this example making any different in advancing the Internet's
>>> capability to do better loss detection. I do not think so.
>>
>> Sorry, I don't understand what was quibbling here. I believe I did not
>> argue anything about cwnd size with this example at hand in Sec. 3.5.
>> My point is that it describes incorrect slow start behavior. Correctly
>> implemented slow start does not have the problem illustriated in the
>> example.
>>
>>> At this point please suggest a text you like to change.
>>
>> That is simple. Please remove the description of the behavior without
>> RACK. Or correct it: the behavior would be exactly the same as with RACK,
>> but the reason for not unnecessarily retransmitting third segment can be
>> described to be different.
>> And, please correct also the description with RACK. With RACK, third
>> segment does not get unnecessarly rexmitted for the reason already
>> indicated and also because cwnd=1 MSS. And, no new segments are allowed
>> either by cwnd=1.
>>
>
> [RACK-TLP-team:]
>
> Thanks. Perhaps it makes sense to consider a slightly different RTO
> scenario to illustrate the point about how RACK RTO behavior can avoid
> spurious retransmissions.
>
> Here is the proposed new text:
>
> "3.5.  An Example of RACK-TLP in Action: RTO
>
>   In addition to enhancing fast recovery, RACK improves the accuracy of
>   RTO recovery by reducing spurious retransmissions.
>
>   Without RACK, upon RTO timer expiration the sender marks all the
>   unacknowledged segments lost.  This approach can lead to spurious
>   retransmissions.  For example, consider a simple case where one
>   segment was sent with an RTO of 1 second, and then the application
>   writes more data, causing a second and third segment to be sent right
>   before the RTO of the first segment expires.  Suppose none of the
>   segments were lost.  Without RACK, if there is a spurious RTO then
>   the sender marks all three segments as lost and retransmits the
>   first segment. If the ACK for the original copy of the first segment
>   arrives right after the spurious RTO retransmission, then the
>   sender continues slow start and spuriously retransmits the second
>   and third segments, since it (erroneously) presumed they are lost.

That seems fine. Would it be useful to add an additional note saying that 
if F-RTO is in use, then further unnecessary retransmissions can be 
avoided?

>   With RACK, upon RTO timer expiration the only segment automatically
>   marked lost is the first segment (since it was sent an RTO ago); for
>   all the other segments RACK only marks the segment lost if at least
>   one round trip has elapsed since the segment was transmitted.
>   Consider the previous example scenario, this time with RACK.  With
>   RACK, when the RTO expires the sender only marks the first segment as
>   lost, and retransmits that segment.  The other two very recently sent
>   segments are not marked lost, because they were sent less than one
>   round trip ago and there were no ACKs providing evidence that they
>   were lost. Upon receiving the ACK for the RTO retransmission the
>   RACK sender would not yet retransmit the second or third segment. but
>   rather would rearm the RTO timer and wait for a new RTO interval to elapse
>   before marking the second or third segments as lost.
> "

This behavior with RACK-TLP does not quite match with my reading of 
the draft in all parts, so I have some questions.

First, wouldn't the PTO timer be first running for the first segment 
and when the sender sends the 2nd and 3rd segment it would rearm PTO for 
the 3rd segment? My understanding based on Sec 8 is that RTO cannot be 
armed until PTO fires? And when the PTO fires, the sender would 
retransmit the 3rd segment? Maybe I am missing something?

Anyways, an RTO timer may expire in this scenario (e.g., if there is no 
Ack for TLP).
Let's assume the RTO timer expires as described. When the first Ack after 
the rexmit of the 1st segment arrives, isn't that the late ACK for the 
original copy of the first segment, not the ACK for the RTO 
retransmission? Anyways, when this first Ack arrives after the RTO, the 
second and third segment are not marked as lost as described, which 
nicely prevents unnecessary retransmissions of those segments.
However, if the 2nd and 3rd segments were lost, wouldn't rearming RTO 
with exponentially backed-off timer value (2*RTO) make the recovery of 
the 2nd and 3rd segment very slow? After 2*RTO the timer expires and 
the 2nd segment is rexmitted and only after one additional RTT when the 
ack for the 2nd segment arrives, the sender rexmits the 3rd segment?
Wouldn't it be much more efficient to disable RACK loss detection during 
an RTO recovery and employ F-RTO? Maybe I am missing something again?


Then to the problem with the pseudocode I mentioned in the beginning of 
this message. If I am not incorrect it seems to me that setting a value 
to TLP.end_seq in TLP_send_probe() is off by one byte? Maybe something 
similar is wrong somewhere else too, didn't check. Please apologise me if 
I missed something.

Let's use an example with RACK-TLP recovery:

Send P1, ..., P9, P10
[P1,.., P9 delayed, P10 dropped]
/* I believe P1-9 delayed is not important here */

After 2*SRTT PTO expires
P10 rexmitted by TLP
  [ TLP_send_probe(): set TLP.end_seq = SND.NXT (= SEG.P11) ]

  [after TLP, the  application possibly provides more data to send]

Ack for original P1 arrives
(P11 sent, if more data by application)
Ack for original P2 arrives
(P12 sent, if more data by application)
...
Ack for original P9 arrives
(P19 sent, if more data by application)

Ack for P10 triggered by TLP rexmit of P10 arrives
  [TLP_process_ack(ACK):
   ...
   If ACK's ack. number > TLP.end_seq
    returns False
    since ACK's ack. number = SEQ.ACK = TLP.end_seq = SEG.P11]

   Loss of P10 is undetected and no reaction to congestion]


My understanding of draft text is that a loss of P10 is not concealed.
Please advise.

Best regards,

/Markku

> best regards,
>
> neal
>
>> Please remove also the last paragraph of the Sec. 3.5, it being also
>> incorrect description of behavior.
>>
>> BR,
>>
>> /Markku
>>
>>>>
>>>> BR,
>>>>
>>>> /Markku
>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> /Markku
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 16 Nov 2020, The IESG wrote:
>>>>>>
>>>>>>>
>>>>>>> The IESG has received a request from the TCP Maintenance and Minor Extensions
>>>>>>> WG (tcpm) to consider the following document: - 'The RACK-TLP loss detection
>>>>>>> algorithm for TCP'
>>>>>>>  <draft-ietf-tcpm-rack-13.txt> as Proposed Standard
>>>>>>>
>>>>>>> The IESG plans to make a decision in the next few weeks, and solicits final
>>>>>>> comments on this action. Please send substantive comments to the
>>>>>>> last-call@ietf.org mailing lists by 2020-11-30. Exceptionally, comments may
>>>>>>> be sent to iesg@ietf.org instead. In either case, please retain the beginning
>>>>>>> of the Subject line to allow automated sorting.
>>>>>>>
>>>>>>> Abstract
>>>>>>>
>>>>>>>
>>>>>>>   This document presents the RACK-TLP loss detection algorithm for TCP.
>>>>>>>   RACK-TLP uses per-segment transmit timestamps and selective
>>>>>>>   acknowledgements (SACK) and has two parts: RACK ("Recent
>>>>>>>   ACKnowledgment") starts fast recovery quickly using time-based
>>>>>>>   inferences derived from ACK feedback.  TLP ("Tail Loss Probe")
>>>>>>>   leverages RACK and sends a probe packet to trigger ACK feedback to
>>>>>>>   avoid retransmission timeout (RTO) events.  Compared to the widely
>>>>>>>   used DUPACK threshold approach, RACK-TLP detects losses more
>>>>>>>   efficiently when there are application-limited flights of data, lost
>>>>>>>   retransmissions, or data packet reordering events.  It is intended to
>>>>>>>   be an alternative to the DUPACK threshold approach.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The file can be obtained via
>>>>>>> https://datatracker.ietf.org/doc/draft-ietf-tcpm-rack/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> No IPR declarations have been submitted directly on this I-D.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> tcpm mailing list
>>>>>>> tcpm@ietf.org
>>>>>>> https://www.ietf.org/mailman/listinfo/tcpm
>>>>>>>
>>>>>
>>>
>