Re: [tcpm] Review (Re: WGLC for draft-ietf-tcpm-rack-08)

Yuchung Cheng <ycheng@google.com> Mon, 13 July 2020 22:07 UTC

Return-Path: <ycheng@google.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 917513A0888 for <tcpm@ietfa.amsl.com>; Mon, 13 Jul 2020 15:07:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -17.589
X-Spam-Level:
X-Spam-Status: No, score=-17.589 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id p6KHHr-4JdYv for <tcpm@ietfa.amsl.com>; Mon, 13 Jul 2020 15:07:18 -0700 (PDT)
Received: from mail-ua1-x936.google.com (mail-ua1-x936.google.com [IPv6:2607:f8b0:4864:20::936]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 038F73A0D35 for <tcpm@ietf.org>; Mon, 13 Jul 2020 15:07:00 -0700 (PDT)
Received: by mail-ua1-x936.google.com with SMTP id u33so2446009uad.9 for <tcpm@ietf.org>; Mon, 13 Jul 2020 15:07:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=7q7lu0uAH5eozFtgmT9IAAaKvrjg8AlrbP0v/YaGSYo=; b=LTlAek5A3IPbhESPPYOwys8616ZoI74Az6kkHYa0kHd92xs5I8mm3Xm7+2EkxuPHTL 6d9gfHSvWnArsBYIw5hcQS5VDALArTMAmpWHfmv6YT3u32yIqso2E8zl779Fb7FqAViF hM/ZlrXT+ZpBqnoXxetNVV/8JjDkPN0ud6a+8T7HiajYeynPeERBw8aIBtr/rh6cElFj p/xNlEJTmK64IGh+U4mg36KeCpbDy1XYRXRtXJWpA98nFG+h+deiLnlVvscAht9MPdN6 /cAJHU5qKaj6EVnaVoJfI6VWpivzamoZAT6Rtl7BlYGYqIFLDdIUjzrU2tj4RnlrfsdO HZjQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=7q7lu0uAH5eozFtgmT9IAAaKvrjg8AlrbP0v/YaGSYo=; b=ifwa8lfZV0SW7K4px+aNIeb5HSgBrtt4x45PvWA+xCCn4knjZ6Qb4cMfclQhzYeQS0 /U35FX5tLcGh9hjZb1CrrvsKumGEB6HeKjCkOzydALNCj866qctnkTzghk1KVNBFQlur i7FT5x+Pf6oZQHe73IqSA9J7U6NS/QCIlqt8SkcyGSd4GeCf54BenMgtbNdOCPfGx6mR Ud3C4gATef+eBnOlwZb2z6Mwmyburmhu8omx+qVoHrVJi6n2rzMxUurixy1y5uEywDTg y8jlsh9n4fpGvtsVhJy85pu25BnNPub0+RiEJe6125AHhTWWsxbvaPpZ95MIxPymcXUG ps0w==
X-Gm-Message-State: AOAM530fOYccO0CQBfNoPpjl0NHSiRrdnRAVazhlttBAwzveDDQl52aT kKZX46nS++HSn6ihj/s4D26Xd+njThlj68QROXpBuYlF
X-Google-Smtp-Source: ABdhPJyfyvkTiDZwNsQ/z05NPas82LP+2X263di6V4M1508OgD3rbNJPa4rNG9NX7KxjQcBtT0qp2F0XWSoVz/iD7ps=
X-Received: by 2002:a9f:31b3:: with SMTP id v48mr1297962uad.87.1594678019254; Mon, 13 Jul 2020 15:06:59 -0700 (PDT)
MIME-Version: 1.0
References: <3D4D034B-7A72-4313-8FB6-CB689A167E91@fh-muenster.de> <2c0558c37ec99da2b98f73eb5a79d8e6@tenghardt.net>
In-Reply-To: <2c0558c37ec99da2b98f73eb5a79d8e6@tenghardt.net>
From: Yuchung Cheng <ycheng@google.com>
Date: Mon, 13 Jul 2020 15:06:21 -0700
Message-ID: <CAK6E8=dPsV_ADZ9iADLo32aPA=g1FZH2g9c1VSjPU8nk7QQOpg@mail.gmail.com>
To: Theresa Enghardt <ietf@tenghardt.net>
Cc: tcpm IETF list <tcpm@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000dee6db05aa59e896"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/OBkgFf3O2M3aCTOtVB_mntQYaTs>
X-Mailman-Approved-At: Tue, 14 Jul 2020 08:03:40 -0700
Subject: Re: [tcpm] Review (Re: WGLC for draft-ietf-tcpm-rack-08)
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Jul 2020 22:07:25 -0000

On Sat, Apr 4, 2020 at 9:28 AM Theresa Enghardt <ietf@tenghardt.net> wrote:

>

> Dear TCPM,

>

> I reviewed draft-ietf-tcpm-rack-08 focusing on whether it is clear and

> easy to follow when seeking to understand how one would implement this

> and/or how an existing implementation works. As a secondary concern, I

> noted some editorial comments and suggestions of how one could make the

> document more accessible.

>

> Overall, I think the document explains the scope and idea of the

> algorithms really well and the algorithm details that are provided are

> already quite useful. So thank you for this draft.

> However, I did find a couple of points where some more clarification

> would help, and a few consistency as well as editorial issues. I would

> appreciate it if these were fixed before this document is published as a

> Proposed Standard RFC.

Thank you for such an excellent super detailed review. Sorry for taking

such a long time to respond. We have made significant edits to our draft

and published a newer version.

https://tools.ietf.org/html/draft-ietf-tcpm-rack-09


detailed responses inline.




>

> Please find my review below:

>

>

> Section 2, Introduction:

>

> “In recent years we have been observing several increasingly common loss

> and reordering patterns in the Internet: …”

> As far as I can tell, this list actually combines the description of the

> loss patterns with the limitations that existing loss recovery

> mechanisms have, rather than describe limitations that are inherent to

> the patterns or to TCP. So I think it makes sense to add something like

> the following to this introductory sentence: “Given these patterns,

> existing loss recovery mechanisms have the following limitations:”

> Alternatively, you could move the paragraph starting with “Despite TCP

> stacks …” above the list, so the list would already be in the context of

> limitations of existing loss recovery mechanisms.

We like your suggestion and want to make further changes to clarify. We
rewrote the first couple sections and split into existing mechanisms, their
shortcoming in modern networks, and the design rationale of RACK-TLP.


>

> Please consider adding a half-sentence here explaining what a tail drop

> is to make the document more accessible.

We avoid that confusing word 'tail drop' altogether now b/c it may mean
different things in different context (router queues vs application data
frame). instead we describe the drop behavior directly "Packet drops at the
end of an application data flight. "

>

> “Also, these algorithms, including RFCs, rarely address the interactions

> with other algorithms.”

> I think it’s worth clarifying the scope of the problem: Is this about

> using multiple different algorithms at the same time on the same sender?

> Or is this about interaction between different senders using different

> algorithms?

As other reviewers have also reflected this is unclear and not
informational. We remove it.

>

> “The goal of RACK is to solve all the problems above …”

> Perhaps this paragraph should mention that RACK has limitations, too.

> For example, this paragraph could already touch upon the problem of

> reordering and spurious retransmissions, or it could briefly summarize

> some of the disadvantages discussed in Section 8.2.

Good suggestion. We now have a whole new subsection 'Reordering resilience
with a time threshold' detailing the pros and cons of RACK design.


>

> The introduction, as well as the abstract, need to mention that this

> document actually introduces two algorithms, RACK and TLP, and what

> their relationship is. For example, while they can be implemented

> separately, are there any dependencies between them? Why is it

> beneficial to use them together?

Other reviewers have expressed similar comments. We now mention them
together in abstract and the high level design section together, along with
their relationship in the Requirement section


> Note that Section 8.5.1 has wording that suggests that using RACK

> implies TLP and Section 9 implies that it’s possible to use TLP without

> RACK. Please consider adding some of these points to the introduction

> and make sure they’re consistent in the document.

> Moreover, Section 8.4 discusses the relationship between RACK and TLP at

> the very end. Please consider moving (some of) this discussion to

> Section 2 or include a reference to the discussion here.

This gets cleaned up now by removing section 9 entirely and rewriting the
design section. Please take a look.

>

> Section 3, Overview:

>

> "RACK can be applied to both fast recovery and timeout recovery ..."

> Timeout recovery refers to recovery after an RTO, correct? Does RACK

> work the same way for both fast recovery and timeout recovery? This

> reads a bit confusing to me, because the whole point of RACK seems to be

> to avoid RTOs, and if an RTO happens, we already know that there has

> been a packet loss.

To help explain, we added a new subsection 'An Example of RACK-TLP in
Action: RTO' in the design, along with a subsection 'Upon RTO expiration'
in the RACK detailed algorithm section. Let us know if that helps.

>

> Section 4, Design Rationale for Reordering Tolerance:

>

> "TCP congestion control implicitly assumes the feedback from ACKs are

> from the same bottleneck"

> Please consider making the wording a bit clearer: ”... feedback from all

> ACKs in a connection are from the same bottleneck” - unless I’m missing

> something here.

We removed this unclear sentence since it's really not critical to explain
RACK reordering. We replace this with a new subsection of 'RACK reordering
window adaptation'

>

> "How long the sender waits for such potential reordering events to

> settle is determined by the current reordering window."

> In this paragraph, it seems to me like the document jumps from general

> considerations on the impact of RACK on packet reordering to specific

> design considerations for RACK to mitigate these potential impacts. I

> think it’s a good idea to make this transition clearer, e.g., by adding

> text like: "RACK deals with reordering by estimating a reordering

> window. The way RACK deals with reordering disincentivizes excessive

> reordering because of the following constraints on the reordering

> window: ..."

> If the reordering window is a new concept that RACK introduces, or a

> concept that was not widely known before, it might be worth explicitly

> introducing it as a concept.

Sorry for the messy text. We split RACK into the general design plus a
specific section on reordering. We do not claim the reordering window is a
new concept as many people have proposed similar idea. Rather we explain
the new part is to combine the per-segment timestamping and reordering
window to achieve better loss / reordering handling. This is detailed in
the beginning of the new 'RACK-TLP high-level design' section


>

> “If no reordering has been observed, then RACK SHOULD honor the classic

> 3-DUPACK rule for initiating fast recovery.  One simple way to implement

> this is to temporarily override the reorder window to 0.”

> Just to make sure I understand correctly: Honoring the classic 3-DUPACK

> rule means detecting a packet lost after receiving 3 dup acks, in

> addition to detecting packets lost based on timestamps, basically

> whichever occurs first? It’s not obvious to me how this can be

> implemented by overriding the reorder window to 0. So perhaps I’m

> misunderstanding something here and some clarification would help.

Agreed the text is confusing without reading the code. The new subsection
'RACK reordering window adaptation explicitly lists the rules of 3-dupacks
and reordering window.

When the reordering window is zero, RACK will mark the first unacked packet
lost on the first DUPACK. The rule is to hold up until the third DUPACK (or
3 MSS worth of bytes been sacked).

>

> "...to adaptively estimate the duration of reordering events."

> Perhaps it makes sense to add “and adjust the reordering window

> accordingly” to this sentence, as this part talks about mandates on the

> reordering window.

>

> "As a flow starts, either condition 1 or condition 2 or both would

> trigger RACK to start the recovery process quickly."

> Prior to this sentence, I assumed that the above points were mandates,

> i.e., all of these criteria must be satisfied by the implementation. But

> now the text says they are in fact conditions when to declare a packet

> lost? Are 3 and 4 also conditions?

We removed this sentence. What it meant originally was that at startup,
RACK tends to error on the side of spurious retransmission with a lower
initial reordering window. Since the majority of flows are short, the
spurious rtx isn't a big concern if their paths have high reordering. But
we realize the following sentences 'For short flows, the low initial ...'
delivers that message already.

>

> “It also relaxes ordering constraints to allow sending flights of TCP

> packets on different paths dynamically for better load-balancing (e.g.

> flowlets).”

> Please consider adding a reference for “flowlets”.

After reviewing it again, we realize it does not help as there are many
ways to achieve load-balancing which is out of scope of this document. So
we remove this sentence.

>

> As you discuss design rationale for the reordering window in this

> section, perhaps it’s worth adding a reference to Section 8.3, which

> contains more discussion of the reordering window and its default

> values.

Added

>

> Section 6.1, Definitions of variables:

>

> (Editorial) This appears to be the only subsection of Section 6. Can

> this subsection be collapsed into Section 6?

>

> As you’re discussing packets and sequence numbers, I think it’s worth

> adding a few words on how to map TCP segments to packets, e.g., that an

> implementation might want to store the first and last sequence number of

> each TCP segment and consider this a packet. This might also be helpful

> to make the connection between definitions that are based on sequence

> numbers and others that are based on packets.

>

> Please add a definition of Packet.end_seq here, as this variable is used

> in later sections, but not defined here.

We replaced all references of a packet carrying sequence numbers to
'segment', as Gorry suggested.

We added Segment.end_seq definition. Thanks for catching that



>

> For better overview, it might make sense to group variables, e.g., all

> RTTs, all variables directly related to reordering, all sequence

> numbers.

>

> “The RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and

> RACK.min_RTT variables are kept in the per-connection TCP control block.

> RACK.packet and RACK.ack_ts are used as local variables in the

> algorithm.”

> Aren’t RACK.packet and RACK.ack_ts also per-connection? If so, shouldn’t

> the choice of whether to put these in the per-connection TCP control

> block or to use them as local variables up to the implementer? (Unless

> I’m missing something here, in which case please clarify what the

> difference is between these variables.)

Sure we now split into per-connection variables and per segment variables
and grouped related states together.


>

> Section 7.2, Upon receiving an ACK:

>

> Step 1:

>

> "Use the RTT measurements obtained via [RFC6298] or [RFC7323] to update

> the estimated minimum RTT in RACK.min_RTT."

> Does this document mandate how to collect individual RTT samples, in

> which case maybe it’s worth highlighting this with a MUST or SHOULD?

> Especially as one of the following sentences says "This document does

> not specify an exact approach", but this might only refer to how to

> derive the minimum RTT, not the individual RTT samples. Please clarify.

We added a SHOULD: the sender SHOULD track a simple ...

>

> Do you perform Step 1 even for packets that you later find to be a

> spurious retransmission, as this step occurs before the check for

> spurious retransmissions?

> Text in later sections reads as if RACK.min_RTT is not updated for

> spurious retransmissions. In this case, it might be worth merging this

> step with Step 2 and updating all stats after the check for spurious

> retransmissions.

For spuriously retransmitted packet, the ACK with TCP timestamp can still
generate a valid RTT measurement, hence it should be incorporated in the
min RTT update in Step 1.

note that in Step 2, RACK.min_RTT is not updated as it is updated in Step
1. The RTT that's not updated is the RACK.rtt, which keeps the most recent
RTT measurement.

>

> Step 2:

>

> The document first says to record the most recent Packet.xmit_ts if it

> is ahead of RACK.xmit_ts. Again this is before the check for spurious

> retransmissions, at least in the text. The pseudocode performs the check

> first, which makes more sense to me. I think the text should be changed

> to reflect the pseudocode, if this is indeed the intended order of

> steps.

Sure, we've reversed the text order to match the pseudo code better


>

> "... the sender would not be able to update its RACK.min_rtt using the

> (ambiguous) RTT samples from retransmissions"

> Possibly this whole example is intended as an exercise for the reader,

> but just to be clear: Here the point is that the sender is unable to

> tell whether an ACK for M is from the original transmission or the

> retransmission, right? And therefore it can’t update the RTT? Please

> consider making this example a bit more explicit to make it easier to

> understand.

Sorry about the verbose confusing text. We replaced it with "Note that the
second check is a heuristic when Timestamp option is either not available
or when the round trip time is less than the timestamp clock granularity.

>

> Step 3:

>

> Maybe consider substituting “false retransmission” with “spurious

> retransmission”, as these terms seem to refer to the same concept.

done. replaced all such words throughput

>

> Step 4:

>

> “RACK starts initially with a conservative window of min_RTT/4"

> This is RACK.min_RTT, right? Please use the variable name.

>

> What is the design rationale for increasing reo_wnd by min_RTT/4, and

> not more or less, and for persisting for 16 loss recoveries, and not

> more or less?


>

> SRTT refers to the connection's SRTT, defined similarly as in Section

> 7.5 for TLP, right? Maybe it’s worth adding this as a definition in

> Section 6 already.

SRTT is now referenced at the Definitions section

>

> "Note that extensions that require additional TCP features (e.g. DSACK)

> would work if the feature functions simply return false."

> I’m taking this to basically mean “If you don't implement DSACK or if

> you don't get DSACK, just return false from these functions and the

> algorithm below still works”, but I’m unsure if this is correct. Does

> “extensions that require additional TCP features” refer to the

> DSACK-based increases of the reordering window? Because “extensions” to

> me sounds like something that would be defined in another section or

> another RFC, and that would be explicitly marked as an extension.

> If the meaning above is correct, please consider making this sentence

> specific to the example of the DSACK-based increase in reordering

> window.

We now clarified that in the subsection of the design rationale section

'RACK reordering window adaptation'

'If the connection supports Duplicate Selective Acknowledgement (DSACK)
information [RFC2883], the RACK reordering window SHOULD leverage that to
adaptively estimate the duration of reordering events.'

>

> The pseudocode says that RACK.min_RTT is updated here - but wasn't this

> step 1 already? Or is RACK.min_RTT updated twice?

Good catch. We removed the redundant pseudo code line.



>

> "If SND.UNA < RACK.rtt_seq:

>                                  RACK.dsack = false /* React to DSACK
once per round trip */"

> It seems like step 4 is the only part that uses RACK.rtt_seq, but the

> definition in Section 6.1 says “SND.NXT when RACK.rtt is updated.” Isn’t

> RACK.rtt updated in Step 2 already? Is this definition (still) correct?

>

> Another question about the above pseudocode: Why does this mean that the

> algorithm reacts once per round trip? What does the round trip have to

> do with SND.UNA moving above RACK.rtt_seq?

>

> Why do you set RACK.reo_wnd_persist to -1 here, and not to 0?

>


> Step 5:

>

> "the sender MAY install a "reordering settling" timer set to fire at the

> earliest moment at which it is safe to conclude that some packet is

> lost"

> This sounds like a really good idea, so why not make this a SHOULD?

Changed to RECOMMENDED now. A few reviewers made the same comment too.

"For timely loss detection, the sender is RECOMMENDED to install a
reordering settling timer. This timer expires at the earliest moment when
RACK would conclude that all the unacknowledged segments within the
reordering window were lost."


>

> "To be more robust to reordering, RACK uses a more conservative RTT

> value to decide if an unacknowledged packet should be considered lost,

> RACK.rtt"

> Is it a SHOULD or a MUST to be this conservative? Or are implementations

> also free to use RACK.min_RTT?

We removed this sentence entirely since it is now obsolete. The reordering
window setting is now discussed in the new design rationale section.

>

> s/RACK.time_ts/RACK.xmit_ts/

Nice catch, revised.

>

> For the implementation optimization, maybe just briefly say explicitly

> that the outcome of your optimization is that you only check packets

> that have been sent after RACK.xmit_ts, so the advantage is immediately

> obvious to the reader.

Great idea. Added "checking only segments that have been sent after
RACK.xmit_ts is more efficient than scanning the entire SACK scoreboard,
especially when the inflight is large. " at the beginning of the paragraph
now.

>

> Section 7.3, Tail Loss Probe

>

> (editorial) Should this be its own section? After all, it is its own

> separate algorithm?

We like that suggestion. So TLP is now a whole separate section

>

> Please consider substituting  "Common causes of RTOs

> include:" with "Common causes of RTOs, which TLP is intended to

> mitigate, include:" to make the purpose of this section more clear.

This section should be about TLP details. So we've integrated the RTO
causes into the new Motivation subsection upfront.


>

> Section 7.4, Tail Loss Probe: An Example

>

> Please consider substituting "Following is an example of

> TLP" with "Following is an example of TLP used with RACK".

We've changed and moved the example to a RACK-TLP at the "An Example of
RACK-TLP in Action: Fast Recovery" subsection

>

> Section 7.5, Tail Loss Probe Algorithm Details

>

> TLPRxtOut, TLPHighRxt and WCDelAckT seem very different from the

> variable names used otherwise the document, i.e., they are CamelCase,

> while underscores are used in most other variables of the document. Also

> these variable names are really hard to read and pronounce. Unless

> there’s a good reason to call them exactly that (have they been defined

> and used in other documents, in which case maybe it’s worth adding a

> reference?), maybe consider making these variable names more similar.

We renamed these variables to follow the styles of RACK variable names.


>

> (editorial) The PTO definition should go on a new line, otherwise it's

> easy to overlook.

Fixed


>

> Section 7.5.1, Phase 1: Scheduling a loss probe:

>

> (editorial) There is a lot of nesting here, i.e., “phase 1”, “step 1”,

> point 1., 2., 3.. Maybe consider restructuring this part, e.g., let

> phases be different subsections, not subsubsections.

>

> (editorial) In the RACK algorithm, you put the description first and

> then the pseudocode. Here, it's reversed. Maybe it's worth unifying your

> approach? FWIW, I prefer putting the description first because it makes

> the pseudocode easier to follow.

>

> "A sender should schedule a PTO only if all of the following conditions

> are met"

> Is this a “SHOULD NOT schedule a PTO unless all of the following

> conditions are met”?

yes

>

> "2. The connection has no SACKed sequences in the SACK scoreboard"

> What is the reason for this constraint?


>

> "The RECOMMENDED value for WCDelAckT is 200ms."

> Why this value and not another?

Added an explanation. It’s a value used by Linux for many years, likely
inherited from BSD (but not certain).

>

> s/If FlightSize = 1/If FlightSize == 1/

>

> Section 7.5.2, Phase 2: Sending a loss probe:

>

> SMSS is used here, but not defined. Please expand and add a definition.

Added

>

> Please clarify here already that while you allow only one TLP

> retransmission to be in flight at a time, you allow multiple TLP new

> data to be in flight. This is said later, but it is already relevant for

> understanding this part.

>

> As you don't allow multiple TLP retransmits to be in flight: Does this

> mean the PTO could fire but then you do not send any data, because there

> is no new data available but there is already a TLP retransmit in

> flight?

Thanks for catching this -- we decided that there should be one outstanding
TLP regardless of new or retransmission. This is safer from congestion
control point of view. As a result we’ve modified this part of the
algorithm. Therefore PTO can fire but not send any new (or old) data, by
the guard of TLP.end_seq.

>

> "the sender MUST arm the RTO timer, not the PTO timer, at the end of

> TLP_send_probe() if FlightSize is not zero."

> Please add this to the pseudocode.

The pseudo code TLP_send_probe has that (in the bottom)

>

> "Checking TLPRxtOut prior to sending the loss probe is also critical to

> avoid TLP loops if an application writes periodically at an interval

> less than PTO."

> What is a TLP loop? And why would it occur if an application writes

> periodically at an interval less than PTO? Please consider adding a

> definition.

This text is removed since TLP is now restricted to one outstanding probe
at most.

>

> Section 7.5.3: Phase 3: ACK processing

>

> Isn’t this phase a special case of phase 1? “Phase” implies that these

> occur one after the other.

We removed all the phases words for clarity

>

> Section 7.6: TLP recovery detection

>

> Is this still part of the TLP algorithm? If so, why is it a new

> subsection and not in the “TLP algorithm details” section, and not

> referenced from it? Or is this an optional addition/extension?

>

> To make the purpose of this algorithm clearer early on, please consider

> briefly stating in the first paragraph that this introduces an algorithm

> to detect whether a packet loss occurred or not, i.e., whether both the

> original packet and the TLP arrived, or whether one of them got lost,

> therefore, it's a loss, therefore, you have to reduce cwnd.

It was. Now we have folded this part into the main TLP algorithm / Section.


>

> Section 7.6.3, Detecting recoveries accomplished by loss probes:

>

> What is the relationship between this Step 1, 2, 3, and Phase 1, 2, 3

> above? Consider rethinking your structure to make clear what happens

> when. If “Step” and “Phase” refer to the same concept, please consider

> unifying your terminology.

>

> For condition 1: Why does this include "the segment contains no

> data" and "the segment is not a window update"? (Why) Is it not possible

> to detect recovery if either of these is true?

>

> Condition 1 refers to a dup ack for the TLP retransmission, right? And

> receiving this dup ack implies the TLP retransmission was spurious? (I’m

> just trying to make sure I understand correctly, but maybe it's worth

> adding this to the draft to make it easier to follow.)

>

> Condition 2 uses DSACK, but you previously say "Since a significant

> fraction of the hosts that support SACK do not support duplicate

> selective acknowledgments (D-SACKs) [RFC2883] the TLP algorithm for

> detecting such lost segments relies only on basic SACK support

> [RFC2018]."

> This seems to be at odds. Isn't it rather "with DSACK you can make this

> detection more accurate, if you don't have it you have to rely on

> condition 1 only"? Please consider rephrasing the sentence starting with

> “Since a significant fraction…”

>

> The last sentence of step 1 says:  "... so the sender considers the TLP

> episode to be done, and records that fact by setting TLPRxtOut to

> false."

> Isn't this already part of step 2, "Mark the end of a TLP retransmission

> episode and detect losses"? Should this part be moved to step 2?

Your comments suggest the TLP algorithm part is poorly written. We’ve
completely rewritten it so hopefully it’s easier to read. The part that was
confusing is the TLP recovery is focused on a special case when only one
(last) packet was lost and repaired by TLP. This triggers particular type
of ACKs depending on DSACK support or not. Let us know if the new section
helps.


>

> "...it SHOULD invoke a congestion control response equivalent to fast

> recovery."

> Is this really a SHOULD or is this a MUST? Because the document also

> says, "If the TLP sender does not receive such an indication before the

> end of the TLP retransmission episode, then it MUST estimate that either

> the original data segment or the TLP retransmission were lost, and

> congestion control MUST react appropriately to that loss as it would any

> other loss."

We have revised and changed to SHOULD

>

> Section 8.1, Advantages:

>

> "Suppose the transmission of each packet is at least RACK.reo_wnd (1

> millisecond by default)"

> Isn't the window RTT_min/4 by default? I think this is what it says

> above. I am not seeing the 1 millisecond mentioned in any other part of

> the document.

1msec was used in the older drafts that we've removed since. We've removed
all mentions of it.

>

> "This example illustrates how RACK is able to repair certain drops at

> the tail of a transaction without any timer."

> What do you mean by “without any timer”? RACK does potentially use

> timers, doesn't it? Did you mean a specific timer here, like "without

> PTO" or "without RTO"?

again old text while RACK timer was not (yet) introduced. s/any timer/RTO
recovery/

>

> Section 8.3, Adjusting the reordering window:

>

> "... the Performance evaluation section ..."

> Maybe consider adding a reference to this section.

This subsection is now folded into the detailed RACK algorithm to give the
rationale behind RACK.min_RTT/4 to avoid forward reference.

>

> Section 8.4, Relationships with other loss recovery algorithms

>

> At the very end, this section discusses the relationship between RACK

> and TLP. As this document introduces both algorithms, consider moving

> this discussion somewhere further up in the document. Alternatively,

> please at least mention the existence of this discussion earlier in the

> document and add a reference to the section of this discussion.

We've moved this part and folded into the new 'High level design' section
of RACK-TLP

>

> Section 8.5, Interaction with congestion control

>

> "RACK may detect losses faster or slower than the conventional duplicate

> ACK threshold approach does.  RACK can detect losses faster by not

> requiring three DUPACKs, so congestion control may reduce the congestion

> window earlier."

> Does this text imply that RACK can be used without the three dup ack

> threshold? This seems to be at odds with other parts of the document,

> which clearly state that RACK implies reacting to three dup acks.

> If RACK indeed implies using the three dup ack threshold, is this

> paragraph still true?

>

> Similar to my comment on Section 2, please consider briefly clarifying

> whether “interaction” means using two algorithms within the same sender,

> or whether this is about coexistence of different senders.

Sorry about the confusion. This text was written before we baked the
3-dupack rule into the RACK reordering window without a proper update here.
We've revised to

"If the packet losses happen after the reordering window has been increased
by DSACK, RACK-TLP may take longer to detect losses than the pure
DUPACK-counting approach. In this case TCP with pure loss-based congestion
control may continue to increase the congestion window upon receiving ACKs
during this time, making the sender more aggressive."

To further clarify: RACK-TLP in theory does NOT need to respect the
3-DUPACK rule. The 3-DUPACK rule is added simply for better compatibility.

>

> Section 8.5.1, Example: interactions with congestion control

>

> "With RACK, a sender would send the TLP after 2*RTT"

> This reads as if RACK does imply TLP?

yes we make this clear in the Requirement section now.

>

> Section 8.7, RACK for other transport protocols

>

> "The algorithm can be simplified by skipping step 3 "

> Please consider just briefly clarifying if you mean step 3 of RACK,

> "Detect packet reordering". In fact, why does it mean that? Even if you

> have unique transmission or packet identifiers, reordering might still

> occur, right? So don’t we still have to detect packet reordering?

Sorry about the confusion. Again the step 3 was mis-cited from a very old
version of the draft. We've cleaned this part up.

>

> Section 9, Experiments and Performance Evaluations

>

> “We plan to expand our experiments to regions with worse connectivity,

> in particular on networks with strong traffic policing."

> Just curious: Have these experiments actually taken place?

We have expanded the experiment and eventually rolled out RACK-TLP to the
entire Google (inside and Internet-facing) a few years ago. The results
were fairly similar to the initial smaller-scale experiment. The values
used in this draft is based on the result.

>

> Section 10, Security Considerations

>

> "An interesting scenario is ACK-splitting attacks"

> Maybe start the paragraph with something saying "In fact, RACK lowers

> the risk profile because it prevents the following attack"? But then it

> does change the risk profile, which seems to be at odds with the first

> sentence of this section.

Good point. Revised.

>

>

> This concludes my review. Again, thank you for this draft.

We (all the authors) sincerely appreciate your superb thorough and detailed
review!