Re: [tcpm] [Last-Call] Last Call: <draft-ietf-tcpm-rack-13.txt> (TheRACK-TLPlossdetection algorithm for TCP) to Proposed Standard

Markku Kojo <kojo@cs.helsinki.fi> Tue, 08 December 2020 18:06 UTC

Return-Path: <kojo@cs.helsinki.fi>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 189D93A107F; Tue, 8 Dec 2020 10:06:53 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Level:
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cs.helsinki.fi
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id q_Dnr53C_CPL; Tue, 8 Dec 2020 10:06:49 -0800 (PST)
Received: from script.cs.helsinki.fi (script.cs.helsinki.fi [128.214.11.1]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B2A2A3A107D; Tue, 8 Dec 2020 10:06:47 -0800 (PST)
X-DKIM: Courier DKIM Filter v0.50+pk-2017-10-25 mail.cs.helsinki.fi Tue, 08 Dec 2020 20:06:44 +0200
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.helsinki.fi; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version:content-type:content-id; s=dkim20130528; bh=vmejLp NLtk34SPBpAhrdroiXV8uhZZF4JUQ0fuylZig=; b=cB7S3jeeQmgOAzoUkWpUKN A+16FtiLC6lQD5kp2FjAxkMcBuwWXdCoVXuEIqk4/p1ZEHi3cq/KVegfv4xqOHbA PXfjQ2t70bgZ+a54t4syIo9HMW7rMqSvFKAUaraKfgkVpGeR9YXrB7N6sZX0twOW 2MOZWexc9SnVYdzYc+42U=
Received: from hp8x-60 (88-113-50-238.elisa-laajakaista.fi [88.113.50.238]) (AUTH: PLAIN kojo, TLS: TLSv1/SSLv3,256bits,AES256-GCM-SHA384) by mail.cs.helsinki.fi with ESMTPSA; Tue, 08 Dec 2020 20:06:44 +0200 id 00000000005A09C2.000000005FCFC0B4.00005953
Date: Tue, 08 Dec 2020 20:06:43 +0200
From: Markku Kojo <kojo@cs.helsinki.fi>
To: Michael Welzl <michawe@ifi.uio.no>
cc: Yuchung Cheng <ycheng@google.com>, "tcpm@ietf.org Extensions" <tcpm@ietf.org>, draft-ietf-tcpm-rack@ietf.org, Michael Tuexen <tuexen@fh-muenster.de>, draft-ietf-tcpm-rack.all@ietf.org, last-call@ietf.org, tcpm-chairs <tcpm-chairs@ietf.org>
In-Reply-To: <E83D1F6B-7B1C-4148-B432-28DFC8149733@ifi.uio.no>
Message-ID: <alpine.DEB.2.21.2012081434070.5180@hp8x-60.cs.helsinki.fi>
References: <alpine.DEB.2.21.2012071227390.5180@hp8x-60.cs.helsinki.fi> <E83D1F6B-7B1C-4148-B432-28DFC8149733@ifi.uio.no>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=_script-22898-1607450804-0001-2"
Content-ID: <alpine.DEB.2.21.2012082004570.5180@hp8x-60.cs.helsinki.fi>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/6A_vpePLkWcfQCpw9ZzatmHqavg>
Subject: Re: [tcpm] [Last-Call] Last Call: <draft-ietf-tcpm-rack-13.txt> (TheRACK-TLPlossdetection algorithm for TCP) to Proposed Standard
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 08 Dec 2020 18:06:53 -0000

Hi Michael,

I am top posting, too (as this is a quite short answer to one point only).

Thanks for the suggestion. I was also thinking of using pipe among first 
options. Unfortunately, it is not that usable either, if you wish the 
outcome to be "halving again" as you  envisioned.

In the example scenario, RACK-TLP first declares segments P1-19 lost and 
segment P20 SACKed (my quick example marked segment P20 incorrectly lost, 
sorry). When the loss of the segment P1 is detected, P2&3 are marked SACKed.
This means that at this point pipe is empty, so it would result in 
resetting cwnd (= 1), which is safe and the same as when detecting the 
lost rexmit with RTO, but not what you are aiming for (halving again).

Cheers,

/Markku

  On Mon, 7 Dec 2020, Michael Welzl wrote:

> Hi all,
>
> thanks for this discussion, and Markku for this long email, I found it very interesting to read.  Sorry for top posting but:
>
> Sent from my iPhone
> :-)
>
> ....   so, just a comment about the backoff upon a lost rexmit in recovery: I‘d say that halving again is the right thing to do as it‘s certainly in the spirit of the cong.avoid & control paper which, iirc, motivates halving ssthresh with an exponential backoff. this just allows to truly backoff exponentially instead of halving once, followed by cwnd=1.
>
> regarding how to do it because flightsize may be wrong: perhaps i‘m missing something, i‘m not writing this with rfcs and rack-tlp open on the side....  but: surely rack-tlp also still has the „pipe“ variable?  wouldn‘t cwnd = ssthresh = pipe / 2 be the right thing to do?
>
> cheers,
> michael
>
>> On 7 Dec 2020, at 17:07, Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org> wrote:
>>
>> Hi Yuchung,
>>
>> thanks for your reply. My major point is that IMHO this RACK-TLP specification should give the necessary advice w.r.t congestion control in cases when such advice is not available in the RFC series or is not that easy to interpreted correctly from the existing RFCs.
>>
>> Even though the CC principles were available in RFC series I believe you agree with me that getting detailed CC actions correct is sometimes quite hard, especially if one needs to interpret and apply them from another spec than what one is implementing. Please see more details inline below.
>>
>> In addition, we need to remember that this document is not meant only for TCP experts following the tcpm list and having deep understanding of congestion control but also those implementing TCP, for the very first time, for various appliances, for example. They do not have first hand experience in implementing TCP congestion control and deserve clear advice what to do.
>>
>>> On Fri, 4 Dec 2020, Yuchung Cheng wrote:
>>>
>>>> On Fri, Dec 4, 2020 at 5:02 AM Markku Kojo <kojo@cs.helsinki.fi> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I know this is a bit late but I didn't have time earlier to take look at
>>>> this draft.
>>>>
>>>> Given that this RFC to be is standards track and RECOMMENDED to replace
>>>> current DupAck-based loss detection, it is important that the spec is
>>>> clear on its advice to those implementing it. Current text seems to
>>>> lack important advice w.r.t congestion control, and even though
>>>> the spec tries to decouple loss detection from congestion control
>>>> and does not intend to modify existing standard congestion control
>>>> some of the examples advice incorrect congestion control actions.
>>>> Therefore, I think it is worth to correct the mistakes and take
>>>> yet another look at a few implications of this specification.
>>> As you noted, the intention is to decouple the two as much as possible
>>>
>>> Unlike the 20 years ago where TCP loss detection and congestion
>>> control are essentially glued in one piece, the decoupling of the two
>>> (including modularizing congestion controls in implementations) has
>>> helped fueled many great inventions of new congestion controls.
>>> Codifying so-called-default C.C. reactions in the loss detection is a
>>> step backward that the authors try their best to avoid.
>>
>> While I fully agree with the general principle of decoupling loss detection from congestion control when it is possible without leaving open questions, I find it hard to get congestion control right with this spec in certain cases I raised just by following the current standards track CC specifications. The reason for this is that RACK-TLP introduces new ways to detect loss (i.e., not present in any earlier standard track RFC) and the current CC specifications do not provide correct CC actions for such cases as I try to point out below.
>>
>>> To keep the
>>> document less "abstract / unclear" as many WGLC reviewers commented,
>>> we use examples to illustrate that includes CC actions. But the
>>> details of these CC actions are likely to become obsolete as CC
>>> continues to advance hopefully.
>>
>> Agreed. But I would appreciate if the CC actions in the examples would correctly follow what is specified in the the current CC RFCs. And, I would suggest explicitly citing the RFC(s) that each of the examples is illustriating so that there is no doubt which CC variant the example is valid with. Then there is no problem with the correctness of the example either even if the cited RFC becomes later obsoleted.
>>
>>>>
>>>> Sec. 3.4 (and elsewhere when discussing recovering a dropped
>>>> retransmission):
>>>>
>>>> It is very useful that RACK-TLP allows for recovering dropped rexmits.
>>>> However, it seems that the spec ignores the fact that loss of a
>>>> retransmission is a loss in a successive window that requires reacting
>>>> to congestion twice as per RFC 5681. This advice must be included in
>>>> the specification because with RACK-TLP recovery of dropped rexmit
>>>> takes place during the fast recovery which is very different
>>>> from the other standard algorithms and therefore easy to miss
>>>> when implementing this spec.
>>>
>>> per RFC5681 sec 4.3 https://tools.ietf.org/html/rfc5681#section-4.3
>>> "Loss in two successive windows of data, or the loss of a
>>> retransmission, should be taken as two indications of congestion and,
>>>  therefore, cwnd (and ssthresh) MUST be lowered twice in this case."
>>>
>>> RACK-TLP is a loss detection algorithm. RFC5681 is crystal clear on
>>> this so I am not sure what clause you suggest to add to RACK-TLP.
>>
>> Right, this is the CC *principle* in RFC 5681 I am refering to but I am afraid it is not enough to help one to correctly implement such lowering of cwnd (and ssthresh) twice when a loss of a retransmission is detected during Fast Recovery. Nor do RFCs clearly advice *when* this reduction must take place.
>>
>> Congestion control principles tell us that congestion must be reacted immediately when detected. But at the same time, standards track CC specifications react to congestion only once during Fast Recovery because the losses in the current window, which Fast Recovery repairs, occured during the same RTT. That is, the current CC specifications do not handle lost rexmits during Fast Recovery but, instead, the correct CC reaction to a loss of a rexmit is automatically achieved by those specifications via RTO when cwnd is explicitly reset upon RTO.
>>
>> Furthermore, the problem I am trying to point out is that there is no correct rule/formula available in the standards track RFCs that would give the correct way to reduce cwnd when the loss of rexmit is detected with RACK-TLP.
>>
>> I suggest everyone reading this message stops reading at this point for a while before continuing reading and figures out themselves what they think would be the correct equation to use in the standards RFC series to find the new halved value for cwnd when RACK-TLP detects a loss of a rexmit during Fast Recovery. I would appreciate a comment on the tcpm list from those who think they found the correct answer immediately and easily, and what was the formula to use.
>>
>> ...
>>
>> I think the best advice one may find by looking at RFC 6675 (and RFC 5681) is to set
>>
>> ssthresh = cwnd = (FlightSize / 2) (RFC 6675, Sec 5, algorithm step 4.2)
>>
>> Now, let's modify the example in Sec 3.4 of the draft:
>>
>> 1. Send P0, P1, P2, ..., P20
>>  [Assume P1, ..., P20 dropped by network]
>>
>> 2.   Receive P0, ACK P0
>> 3a.  2RTTs after (2), TLP timer fires
>> 3b.  TLP: retransmits P20
>> ...
>> 5a.  Receive SACK for P20
>> 5b.  RACK: marks P1, ..., P20 lost
>>     set cwnd=ssthresh=FlightSize/2=10
>> 5c.  Retransmit P1, P2 (and some more depending on how CC implemented)
>>     [P1 retransmission dropped by network]
>>
>>     Receive SACK P2 & P3 7a.  RACK: marks P1 retransmission lost
>>     As per RFC 6675: set cwnd=ssthresh=FlightSize/2=20/2=10
>> 7b.  Retransmit P1
>>     ...
>>
>> So, is the new value of the cwnd (10MSS) correct and halved twice? If not, where is the correct formula to do it?
>>
>> Before RFC 5681 was published we had a discussion on FlighSize and that during Fast Recovery it does not reflect the amount of segments in flight correctly but may also have a too large value. It was decided not to try to correct it because it only has an impact when RTO fires during Fast Recovery and in such a case cwnd is reset to 1 MSS. Having too large ssthresh for RTO recovery in some cases was not considered that bad because a TCP sender anyway takes to most conservative CC action with the cwnd and would slow start from cwnd = 1 MSS. But now when RACK-TLP enables detecting loss of a rexmit during Fast Recovery we have an unresolved problem.
>>
>>
>>>> Sec 9.3:
>>>>
>>>> In Section 9.3 it is stated that the only modification to the existing
>>>> congestion control algorithms is that one outstanding loss probe
>>>> can be sent even if the congestion window is fully used. This is
>>>> fine, but the spec lacks the advice that if a new data segment is sent
>>>> this extra segment MUST NOT be included when calculating the new value
>>>> of ssthresh as per the equation (4) of RFC 5681. Such segment is an
>>>> extra segment not allowed by cwnd, so it must be excluded from
>>>> FlightSize, if the TLP probe detects loss or if there is no ack
>>>> and RTO is needed to trigger loss recovery.
>>>
>>> Why exclude TLP (or any data) from FlightSize? The congestion control
>>> needs precise accounting of the flight size to react to congestion
>>> properly.
>>
>> Because FlightSize does not always reflect the correct amount of data allowed by cwnd. When a TCP sender is not already in loss recovery and it detects loss, this loss indicates the congestion point for the TCP sender, i.e., how much data it can have outstanding. It is this amount of data that it must use in calculating the new value of cwnd (and ssthresh), so it must not include any data sent beyond the congestion point. When TLP sends a new data segment, it is beyond the congestion point and must not be included. Same holds for the segments sent via Limited Transmit: they are allowed to be send out by the packet conservation rule (DupAck indicates a pkt has left the network, but does not allow increasing cwnd), i.e., the actual amount of data in flight remains the same.
>>
>>>> In these cases the temporary over-commit is not accounted for as DupAck
>>>> does not decrease FlightSize and in case of an RTO the next ACK comes too
>>>> late. This is similar to the rule in RFC 5681 and RFC 6675 that prohibits
>>>> including the segments transmitted via Limitid Transmit in the
>>>> calculation of ssthresh.
>>>>
>>>> In Section 9.3 a few example scenarios are used to illustriate the
>>>> intended operation of RACK-TLP.
>>>>
>>>>  In the first example a sender has a congestion window (cwnd) of 20
>>>>  segments on a SACK-enabled connection.  It sends 10 data segments
>>>>  and all of them are lost.
>>>>
>>>> The text claims that without RACK-TLP the ending cwnd would be 4 segments
>>>> due to congestion window validation. This is incorrect.
>>>> As per RFC 7661 the sender MUST exit the non-validated phase upon an
>>>> RTO. Therefore the ending cwnd would be 5 segments (or 5 1/2 segments if
>>>> the TCP sender uses the equation (4) of RFC 5681).
>>>>
>>>> The operation with RACK-TLP would inevitably result in congestion
>>>> collapse if RACK-TLP behaves as described in the example because
>>>> it restores the previous cwnd of 10 segments after the fast recovery
>>>> and would not react to congestion at all! I think this is not the
>>>> intended behavior by this spec but a mistake in the example.
>>>> The ssthresh calculated in the beginning of loss recovery should
>>>> be 5 segments as per RFC 6675 (and RFC 5681).
>>> To clarify, would this text look more clear?
>>>
>>> 'an ending cwnd set to the slow start threshold of 5 segments (half of
>>> the original congestion window of 10 segments)'
>>
>> This is correct, but please replace:
>>
>> (half of the original congestion window of 10 segments)
>> -->
>> (half of> the original FlightSize of 10 segments)
>>
>> cwnd in the example was 20 segments.
>>
>> Please also correct the ending cwnd for "without RACK" scenario.
>> I poined out wrong equation number IN RFC 5681 and INCORRECT cwnd value, my apologies. It MAENT equation (3) AND that results in ending cwnd of 5 and 2/5 MSS (not 5 and 1/2 MSS).
>> NB: and if a TCP sender implements entering CA when cwnd > ssthresh, then ending cwnd would be 6 and 1/6 MSS).
>>
>>>>
>>>> Furthermore, it seems that this example with RACK-TLP refers to using
>>>> PRR_SSRB which effectively implements regular slow start in this
>>>> case(?). From congestion control point of view this is correct because
>>>> the entire flight of data as well as ack clock was lost.
>>>>
>>>> However, as correctly discussed in Sec 2, congestion window must be reset
>>>> to 1 MSS when an entire flight of data is and Ack clock is lost. But how
>>>> can an implementor know what to do if she/he is not implementing the
>>>> experimental PRR algrorithm? This spec articulates specifying an
>>>> alternative for DupAck counting, indicating that TLP is used to trigger
>>>> Fast Retransmit & Fast Recovery only, not a loss recovery in slow start.
>>>> This means that without an additional advise an implementation of this
>>>> spec would just halve the cwnd and ssthresh and send a potentially very
>>>> large burst of segments in the beginning of the Fast Recovery because
>>>> there is no ack clock. So, this spec begs for an advise (MUST) when to
>>>> slow start and reset cwnd and when not, or at least a discussion of
>>>> this problem and some sort of advise what to do and what to avoid.
>>>> And, maybe a recommendation to implement it with PRR?
>>>
>>> It's wise to decouple loss detection (RACK-TLP) vs congestion/burst
>>> control (when to slow-start). The use of PRR is just an example to
>>> illustrate and not meant for a recommendation.
>>
>> I understand the use of PRR was just an example, but my point is that if one wants to implement RACK-TLP but does not intend to implement PRR but RFC 6675 then we do not have a rule in RFC 6675 to correctly implement CC for the case when an entire flight is lost and loss is detected with TLP. Congestion control principle for this is clear and also stated in this draft but IMHO it is not enough to ensure correct implementation.
>>
>> To my understanding we only have implementation experience for RACK-TLP only togerher with PRR, which has the necessary rule to handle this kind of scenario correctly.
>>
>> So, my question is how can one implement CC correctly without PRR such a scenario where entire inflight is lost?
>> Which rule and where in the RFC series gives the necessary guidance to reset cwnd and slow start when TCP detets loss of an entire flight?
>>
>>> Section 3 has a lengthy section to elaborate the key point of RACK-TLP
>>> is to maximize the chance of fast recovery. How C.C. governs the
>>> transmission dynamics after losses are detected are out of scope of
>>> this document in our authors' opinions.
>>>
>>>
>>>>
>>>> Another question relates to the use of TLP and adjusting timer(s) upon
>>>> timeout. In the same example discussed above, it is clear that PTO
>>>> that fires TLP is just a more aggressive retransmit timer with
>>>> an alternative data segment to (re)transmit.
>>>>
>>>> Therefore, as per RFC 2914 (BCP 41), Sec 9.1, when PTO expires, it is in
>>>> effect a retransmission timout and the timer(s) must be backed-off.
>>>> This is not adviced in this specification. Whether it is the TCP RTO
>>>> or PTO that should be backed-off is an open question.  Otherwise,
>>>> if the congestion is persistent and further transmission are also lost,
>>>> RACK-TLP would not react to congestion properly but would keep
>>>> retransmitting with "constant" timer value because new RTT estimate
>>>> cannot be obtained.
>>>> On a buffer bloated and heavily congested bottleneck this would easily
>>>> result in sending at least one unnecessary retransmission per one
>>>> delivered segment which is not advisable (e.g., when there are a huge
>>>> number of applications sharing a constrained bottleneck and these
>>>> applications are sending only one (or a few) segments and then
>>>> waiting for an reply from the peer before sending another request).
>>>
>>> Thanks for pointing to the RFC.  After TLP, RTO timers will
>>> exp-backoff (as usual) for stability reasons mentioned in sec 9.3
>>> (didn't find 9.1 relevant).
>>
>> My apologies for refering to the wrong section of RFC 2914, Yes, I meant Sec 9.3.
>>
>>> In your scenario, you presuppose the
>>> retransmission is unnecessary so obviously TLP is not good. Consider
>>> what happens without TLP where all the senders fire RTO spuriously and
>>> blow up the network. It is equally unfortunate behavior. "bdp
>>> insufficient of many flows" is a congestion control problem
>>
>> If (without TLP) RTO is spurious, it may result in unnecessary retransmissions. But we have F-RTO (RFC 5682) and Eifel (RFC 3522) to detect and resolve it without TLP, so I don't find it as a problem.
>>
>> To clarify more, what I am concerned about. Think about a scenario where a (narrow) bottleneck becomes heavily congested by a huge number of competing senders such that the available capacity per sender is less than 1 segment (or << 1 MSS).
>> This is a situation that network first enters before congestion collapse gets realized. So, it is extremely important that all CC and timer mechanisms handle it properly. Regular TCP handles it via RFC 6298 by backing off RTO expotentially and keeping this backed-off RTO until an new ACK is received for new data. This saves RACK-TLP from full congestion collapse. But consider what happens: even though RTO is backed off, each time a TCP sender manages to get one segment through (with cwnd = 1 MSS) it always first arms PTO with more or less constant value of 2*SRTT. If the bottleneck is buffer bloated the actual RTT easily exceeds 2*SRTT and TLP becomes spurious. After a spurious TLP, RTO expires (maybe more than once before exponential back-off of RTO results in large enough value) and a new RTT sample is not received. So, SRTT remains unchanged and even if sometimes a new sample is received, SRTT gets very slowly adjusted. As a result, each TCP sender would keep on sending a spurious TLP for each new segment resulting in at least 50% of the packets being unnecessary rexmitted and the utilization of the bottleneck is < 50%. This would not be a full congestion collapse but has unwanted symptoms towards congestion collapse (Note: there is no clear line for the level of reduction in delivery of useful data is considered as congestion collapse).
>>
>>
>>>>
>>>> Additional notes:
>>>>
>>>> Sec 2.2:
>>>>
>>>> Example 2:
>>>> "Lost retransmissions cause a  resort to RTO recovery, since
>>>>  DUPACK-counting does not detect the loss of the retransmissions.
>>>>  Then the slow start after RTO recovery could cause burst losses
>>>>  again that severely degrades performance [POLICER16]."
>>>>
>>>> RTO reovery is done in slow start. The last sentence is confusing as
>>>> there is no (new) slow-start after RTO recovery (or more precisely
>>>> slow start continues until cwnd > ssthresh). Do you mean: if/when slow
>>>> start still continues after RTO Recovery has repaired lost segments,
>>>> it may cause burst losses again?
>>> I mean the slow start after (the start of) RTO recovery. HTH
>>
>> Tnx. I'd appreciate if the text could be clarified to reflect this more accurately. Maybe something along the lines(?):
>>
>> "Then the RTO recovery in slow start could cause burst
>> losses again that severely degrades performance [POLICER16]."
>>
>>>>
>>>> Example 3:
>>>>  "If the reordering degree is beyond DupThresh, the DUPACK-
>>>>   counting can cause a spurious fast recovery and unnecessary
>>>>   congestion window reduction.  To mitigate the issue, [RFC4653]
>>>>   adjusts DupThresh to half of the inflight size to tolerate the
>>>>   higher degree of reordering.  However if more than half of the
>>>>   inflight is lost, then the sender has to resort to RTO recovery."
>>>>
>>>> This seems to be somewhat incorrect description of TCP-NCR specified in
>>>> RFC 4653. TCP-NCR uses Extended Limited Transmit that keeps on sending
>>>> new data segments on DupAcks that makes it likely to avoid an RTO in
>>>> the given example scenario, if not too many of the the new data
>>>> segments triggered by Extended Limited Transmit are lost.
>>> sorry I don't see how the text is wrong describing RFC4653,
>>> specifically the algorithm in adjusting ssthresh
>>
>> To my understanding RFC4653 initializes DupThresh to half of the inflight size in the beginning of the Extended Limited Transmit. Then on each DupAck it adjusts (recalculates) DupThresh again such that ideally a cwnd worth of DupAcks are received before packet loss is declared (or reordering detected). So, if I am not incorrect, loss of a half of the inflight does not necessarily result in RTO recovery with TCP-NCR.
>>
>>>>
>>>> Sec. 3.5:
>>>>
>>>>  "For example, consider a simple case where one
>>>>  segment was sent with an RTO of 1 second, and then the application
>>>>  writes more data, causing a second and third segment to be sent right
>>>>  before the RTO of the first segment expires.  Suppose only the first
>>>>  segment is lost.  Without RACK, upon RTO expiration the sender marks
>>>>  all three segments as lost and retransmits the first segment.  When
>>>>  the sender receives the ACK that selectively acknowledges the second
>>>>  segment, the sender spuriously retransmits the third segment."
>>>>
>>>> This seems incorrect. When the sender receives the ACK that selectively
>>>> acknowledges the second segment, it is a DupAck as per RFC 6675 and does
>>>> not increase cwnd and cwnd remains as 1 MSS and pipe is 1 MSS. So, the
>>>> rexmit of the third segment is not allowad until the cumulative ACK of
>>>> the first segment arrives.
>>> I don't see where RFC6675 forbids growing cwnd. Even if it does, I
>>> don't think it's a good thing (in RTO-slow-start) as DUPACK clearly
>>> indicates a delivery has been made.
>>
>> SACKed sequences with DUpAcks indicate that those sequences were delivered but it does not tell when they were sent. The basic principle of slow start is to reliably determine the available network capacity during slow start. Therefore, slow start must ensure it uses only segments sent during the slow start to increase cwnd. Otherwise, a TCP sender may encounter exactly the problem of unnecessary retransmission envisioned in this example of RACK-TCP draft (and increase cwnd on not valid Acks).
>>
>> RFC 6675 does re-specify DupAck with SACK option but it does not include the rule for slow start. Slow start is specified in RFC 5681. It is crystal clear in allowing increase in cwnd only on cumulative Acks, i.e., forbidding to increase cwnd on DupAcks (RFC 5681, Sec 3,
>> page 6:
>>
>>  During slow start, a TCP increments cwnd by at most SMSS bytes for
>>  each ACK received that cumulatively acknowledges new data.
>>
>> Maybe this example in the RACK-TLP draft was inspired by a incorrect implementation of SACK-based loss recovery?
>>
>> FYI: when we were finalizing RFC 6675 I suggested including also an algorithm for RTO recovery with SACK in RFC 6675. The reason was exactly that it might be not easy to gather info from multiple documents and hence help the implementor to have all necessary advice in a single document. This unfortunately did not get realized though.
>>
>> BR,
>>
>> /Markku
>>
>>>>
>>>> Best regards,
>>>>
>>>> /Markku
>>>>
>>>>
>>>>
>>>>> On Mon, 16 Nov 2020, The IESG wrote:
>>>>
>>>>>
>>>>> The IESG has received a request from the TCP Maintenance and Minor Extensions
>>>>> WG (tcpm) to consider the following document: - 'The RACK-TLP loss detection
>>>>> algorithm for TCP'
>>>>> <draft-ietf-tcpm-rack-13.txt> as Proposed Standard
>>>>>
>>>>> The IESG plans to make a decision in the next few weeks, and solicits final
>>>>> comments on this action. Please send substantive comments to the
>>>>> last-call@ietf.org mailing lists by 2020-11-30. Exceptionally, comments may
>>>>> be sent to iesg@ietf.org instead. In either case, please retain the beginning
>>>>> of the Subject line to allow automated sorting.
>>>>>
>>>>> Abstract
>>>>>
>>>>>
>>>>>  This document presents the RACK-TLP loss detection algorithm for TCP.
>>>>>  RACK-TLP uses per-segment transmit timestamps and selective
>>>>>  acknowledgements (SACK) and has two parts: RACK ("Recent
>>>>>  ACKnowledgment") starts fast recovery quickly using time-based
>>>>>  inferences derived from ACK feedback.  TLP ("Tail Loss Probe")
>>>>>  leverages RACK and sends a probe packet to trigger ACK feedback to
>>>>>  avoid retransmission timeout (RTO) events.  Compared to the widely
>>>>>  used DUPACK threshold approach, RACK-TLP detects losses more
>>>>>  efficiently when there are application-limited flights of data, lost
>>>>>  retransmissions, or data packet reordering events.  It is intended to
>>>>>  be an alternative to the DUPACK threshold approach.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The file can be obtained via
>>>>> https://datatracker.ietf.org/doc/draft-ietf-tcpm-rack/
>>>>>
>>>>>
>>>>>
>>>>> No IPR declarations have been submitted directly on this I-D.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> tcpm mailing list
>>>>> tcpm@ietf.org
>>>>> https://www.ietf.org/mailman/listinfo/tcpm
>>>>>
>>>
>>
>> --
>> last-call mailing list
>> last-call@ietf.org
>> https://www.ietf.org/mailman/listinfo/last-call
>
>