[tcpm] Review (Re: WGLC for draft-ietf-tcpm-rack-08)

Theresa Enghardt <ietf@tenghardt.net> Sat, 04 April 2020 16:28 UTC

Return-Path: <ietf@tenghardt.net>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 952383A0F9C for <tcpm@ietfa.amsl.com>; Sat, 4 Apr 2020 09:28:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.099
X-Spam-Level:
X-Spam-Status: No, score=-2.099 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=tenghardt.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id y_2TTju5mNgc for <tcpm@ietfa.amsl.com>; Sat, 4 Apr 2020 09:28:06 -0700 (PDT)
Received: from mail.hemio.de (mail.hemio.de [136.243.12.180]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8DB043A0F9A for <tcpm@ietf.org>; Sat, 4 Apr 2020 09:28:06 -0700 (PDT)
Received: from user.client.invalid (localhost [136.243.12.180]) by mail.hemio.de (Postfix) with ESMTPSA id D0395C5 for <tcpm@ietf.org>; Sat, 4 Apr 2020 18:28:04 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tenghardt.net; s=20170414; t=1586017684; bh=JgfXnEn8WrPVGdkLQvzq8GnezV03p1F7vzTMgGps4WI=; h=Date:From:To:Subject:In-Reply-To:References:From; b=MnSNgS5Yw8qRRpaB53H1mInw9g00qaMg5vmyP+dLgCheNscpsNz6tfm/MBCkXw5ER RyE7G4/IHog9FO8Me933pEavuav2J9fZgPGOIzcEfdyXPwM8R0DTHFfXSLA/NEE54Y rOkFovsijGQORg8ebtiqtD7P6zgju1gG1oDptmP1G0zmA+jDmtBTwCNyNsFwzJiARS K2QmHpTBMgA61aLz+t3k/EAk3C5Mx70So6w5abkTYr5Xo5kRqYN7kn8ibvRgzdDcUd bsVD7whO1xHym1LAcXb0ZSBUL8Vs35e66MOeqU/frOtxavtT4DC5ZiHZd1dpvTeT9H WG1PcFzMGbupw==
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Date: Sat, 04 Apr 2020 18:28:04 +0200
From: Theresa Enghardt <ietf@tenghardt.net>
To: tcpm IETF list <tcpm@ietf.org>
In-Reply-To: <3D4D034B-7A72-4313-8FB6-CB689A167E91@fh-muenster.de>
References: <3D4D034B-7A72-4313-8FB6-CB689A167E91@fh-muenster.de>
Message-ID: <2c0558c37ec99da2b98f73eb5a79d8e6@tenghardt.net>
X-Sender: ietf@tenghardt.net
User-Agent: Roundcube Webmail/1.3.10
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/tsAGkzW_ALpODRQqhs0fw_9KOjM>
Subject: [tcpm] Review (Re: WGLC for draft-ietf-tcpm-rack-08)
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 04 Apr 2020 16:28:10 -0000

Dear TCPM,

I reviewed draft-ietf-tcpm-rack-08 focusing on whether it is clear and 
easy to follow when seeking to understand how one would implement this 
and/or how an existing implementation works. As a secondary concern, I 
noted some editorial comments and suggestions of how one could make the 
document more accessible.

Overall, I think the document explains the scope and idea of the 
algorithms really well and the algorithm details that are provided are 
already quite useful. So thank you for this draft.
However, I did find a couple of points where some more clarification 
would help, and a few consistency as well as editorial issues. I would 
appreciate it if these were fixed before this document is published as a 
Proposed Standard RFC.

Please find my review below:


Section 2, Introduction:

“In recent years we have been observing several increasingly common loss 
and reordering patterns in the Internet: …”
As far as I can tell, this list actually combines the description of the 
loss patterns with the limitations that existing loss recovery 
mechanisms have, rather than describe limitations that are inherent to 
the patterns or to TCP. So I think it makes sense to add something like 
the following to this introductory sentence: “Given these patterns, 
existing loss recovery mechanisms have the following limitations:” 
Alternatively, you could move the paragraph starting with “Despite TCP 
stacks …” above the list, so the list would already be in the context of 
limitations of existing loss recovery mechanisms.

Please consider adding a half-sentence here explaining what a tail drop 
is to make the document more accessible.

“Also, these algorithms, including RFCs, rarely address the interactions 
with other algorithms.”
I think it’s worth clarifying the scope of the problem: Is this about 
using multiple different algorithms at the same time on the same sender? 
Or is this about interaction between different senders using different 
algorithms?

“The goal of RACK is to solve all the problems above …”
Perhaps this paragraph should mention that RACK has limitations, too. 
For example, this paragraph could already touch upon the problem of 
reordering and spurious retransmissions, or it could briefly summarize 
some of the disadvantages discussed in Section 8.2.

The introduction, as well as the abstract, need to mention that this 
document actually introduces two algorithms, RACK and TLP, and what 
their relationship is. For example, while they can be implemented 
separately, are there any dependencies between them? Why is it 
beneficial to use them together?
Note that Section 8.5.1 has wording that suggests that using RACK 
implies TLP and Section 9 implies that it’s possible to use TLP without 
RACK. Please consider adding some of these points to the introduction 
and make sure they’re consistent in the document.
Moreover, Section 8.4 discusses the relationship between RACK and TLP at 
the very end. Please consider moving (some of) this discussion to 
Section 2 or include a reference to the discussion here.

Section 3, Overview:

"RACK can be applied to both fast recovery and timeout recovery ..."
Timeout recovery refers to recovery after an RTO, correct? Does RACK 
work the same way for both fast recovery and timeout recovery? This 
reads a bit confusing to me, because the whole point of RACK seems to be 
to avoid RTOs, and if an RTO happens, we already know that there has 
been a packet loss.

Section 4, Design Rationale for Reordering Tolerance:

"TCP congestion control implicitly assumes the feedback from ACKs are 
from the same bottleneck"
Please consider making the wording a bit clearer: ”... feedback from all 
ACKs in a connection are from the same bottleneck” - unless I’m missing 
something here.

"How long the sender waits for such potential reordering events to 
settle is determined by the current reordering window."
In this paragraph, it seems to me like the document jumps from general 
considerations on the impact of RACK on packet reordering to specific 
design considerations for RACK to mitigate these potential impacts. I 
think it’s a good idea to make this transition clearer, e.g., by adding 
text like: "RACK deals with reordering by estimating a reordering 
window. The way RACK deals with reordering disincentivizes excessive 
reordering because of the following constraints on the reordering 
window: ..."
If the reordering window is a new concept that RACK introduces, or a 
concept that was not widely known before, it might be worth explicitly 
introducing it as a concept.

“If no reordering has been observed, then RACK SHOULD honor the classic 
3-DUPACK rule for initiating fast recovery.  One simple way to implement 
this is to temporarily override the reorder window to 0.”
Just to make sure I understand correctly: Honoring the classic 3-DUPACK 
rule means detecting a packet lost after receiving 3 dup acks, in 
addition to detecting packets lost based on timestamps, basically 
whichever occurs first? It’s not obvious to me how this can be 
implemented by overriding the reorder window to 0. So perhaps I’m 
misunderstanding something here and some clarification would help.

"...to adaptively estimate the duration of reordering events." 
Perhaps it makes sense to add “and adjust the reordering window 
accordingly” to this sentence, as this part talks about mandates on the 
reordering window.

"As a flow starts, either condition 1 or condition 2 or both would 
trigger RACK to start the recovery process quickly." 
Prior to this sentence, I assumed that the above points were mandates, 
i.e., all of these criteria must be satisfied by the implementation. But 
now the text says they are in fact conditions when to declare a packet 
lost? Are 3 and 4 also conditions?

“It also relaxes ordering constraints to allow sending flights of TCP 
packets on different paths dynamically for better load-balancing (e.g. 
flowlets).”
Please consider adding a reference for “flowlets”.

As you discuss design rationale for the reordering window in this 
section, perhaps it’s worth adding a reference to Section 8.3, which 
contains more discussion of the reordering window and its default 
values.

Section 6.1, Definitions of variables:

(Editorial) This appears to be the only subsection of Section 6. Can 
this subsection be collapsed into Section 6?

As you’re discussing packets and sequence numbers, I think it’s worth 
adding a few words on how to map TCP segments to packets, e.g., that an 
implementation might want to store the first and last sequence number of 
each TCP segment and consider this a packet. This might also be helpful 
to make the connection between definitions that are based on sequence 
numbers and others that are based on packets.

Please add a definition of Packet.end_seq here, as this variable is used 
in later sections, but not defined here.

For better overview, it might make sense to group variables, e.g., all 
RTTs, all variables directly related to reordering, all sequence 
numbers.

“The RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and 
RACK.min_RTT variables are kept in the per-connection TCP control block. 
RACK.packet and RACK.ack_ts are used as local variables in the 
algorithm.”
Aren’t RACK.packet and RACK.ack_ts also per-connection? If so, shouldn’t 
the choice of whether to put these in the per-connection TCP control 
block or to use them as local variables up to the implementer? (Unless 
I’m missing something here, in which case please clarify what the 
difference is between these variables.)

Section 7.2, Upon receiving an ACK:

Step 1: 

"Use the RTT measurements obtained via [RFC6298] or [RFC7323] to update 
the estimated minimum RTT in RACK.min_RTT."
Does this document mandate how to collect individual RTT samples, in 
which case maybe it’s worth highlighting this with a MUST or SHOULD?
Especially as one of the following sentences says "This document does 
not specify an exact approach", but this might only refer to how to 
derive the minimum RTT, not the individual RTT samples. Please clarify.

Do you perform Step 1 even for packets that you later find to be a 
spurious retransmission, as this step occurs before the check for 
spurious retransmissions?
Text in later sections reads as if RACK.min_RTT is not updated for 
spurious retransmissions. In this case, it might be worth merging this 
step with Step 2 and updating all stats after the check for spurious 
retransmissions.

Step 2:

The document first says to record the most recent Packet.xmit_ts if it 
is ahead of RACK.xmit_ts. Again this is before the check for spurious 
retransmissions, at least in the text. The pseudocode performs the check 
first, which makes more sense to me. I think the text should be changed 
to reflect the pseudocode, if this is indeed the intended order of 
steps.

"... the sender would not be able to update its RACK.min_rtt using the 
(ambiguous) RTT samples from retransmissions"
Possibly this whole example is intended as an exercise for the reader, 
but just to be clear: Here the point is that the sender is unable to 
tell whether an ACK for M is from the original transmission or the 
retransmission, right? And therefore it can’t update the RTT? Please 
consider making this example a bit more explicit to make it easier to 
understand.

Step 3:

Maybe consider substituting “false retransmission” with “spurious 
retransmission”, as these terms seem to refer to the same concept.

Step 4:

“RACK starts initially with a conservative window of min_RTT/4"
This is RACK.min_RTT, right? Please use the variable name.

What is the design rationale for increasing reo_wnd by min_RTT/4, and 
not more or less, and for persisting for 16 loss recoveries, and not 
more or less?

SRTT refers to the connection's SRTT, defined similarly as in Section 
7.5 for TLP, right? Maybe it’s worth adding this as a definition in 
Section 6 already.

"Note that extensions that require additional TCP features (e.g. DSACK) 
would work if the feature functions simply return false."
I’m taking this to basically mean “If you don't implement DSACK or if 
you don't get DSACK, just return false from these functions and the 
algorithm below still works”, but I’m unsure if this is correct. Does 
“extensions that require additional TCP features” refer to the 
DSACK-based increases of the reordering window? Because “extensions” to 
me sounds like something that would be defined in another section or 
another RFC, and that would be explicitly marked as an extension.
If the meaning above is correct, please consider making this sentence 
specific to the example of the DSACK-based increase in reordering 
window.

The pseudocode says that RACK.min_RTT is updated here - but wasn't this 
step 1 already? Or is RACK.min_RTT updated twice?

"If SND.UNA < RACK.rtt_seq:
				 RACK.dsack = false /* React to DSACK once per round trip */"
It seems like step 4 is the only part that uses RACK.rtt_seq, but the 
definition in Section 6.1 says “SND.NXT when RACK.rtt is updated.” Isn’t 
RACK.rtt updated in Step 2 already? Is this definition (still) correct?

Another question about the above pseudocode: Why does this mean that the 
algorithm reacts once per round trip? What does the round trip have to 
do with SND.UNA moving above RACK.rtt_seq?

Why do you set RACK.reo_wnd_persist to -1 here, and not to 0?

Step 5:

"the sender MAY install a "reordering settling" timer set to fire at the 
earliest moment at which it is safe to conclude that some packet is 
lost"
This sounds like a really good idea, so why not make this a SHOULD?

"To be more robust to reordering, RACK uses a more conservative RTT 
value to decide if an unacknowledged packet should be considered lost, 
RACK.rtt"
Is it a SHOULD or a MUST to be this conservative? Or are implementations 
also free to use RACK.min_RTT?

s/RACK.time_ts/RACK.xmit_ts/

For the implementation optimization, maybe just briefly say explicitly 
that the outcome of your optimization is that you only check packets 
that have been sent after RACK.xmit_ts, so the advantage is immediately 
obvious to the reader.

Section 7.3, Tail Loss Probe

(editorial) Should this be its own section? After all, it is its own 
separate algorithm?

Please consider substituting  "Common causes of RTOs 
include:" with "Common causes of RTOs, which TLP is intended to 
mitigate, include:" to make the purpose of this section more clear.

Section 7.4, Tail Loss Probe: An Example

Please consider substituting "Following is an example of 
TLP" with "Following is an example of TLP used with RACK".

Section 7.5, Tail Loss Probe Algorithm Details

TLPRxtOut, TLPHighRxt and WCDelAckT seem very different from the 
variable names used otherwise the document, i.e., they are CamelCase, 
while underscores are used in most other variables of the document. Also 
these variable names are really hard to read and pronounce. Unless 
there’s a good reason to call them exactly that (have they been defined 
and used in other documents, in which case maybe it’s worth adding a 
reference?), maybe consider making these variable names more similar.

(editorial) The PTO definition should go on a new line, otherwise it's 
easy to overlook.

Section 7.5.1, Phase 1: Scheduling a loss probe:

(editorial) There is a lot of nesting here, i.e., “phase 1”, “step 1”, 
point 1., 2., 3.. Maybe consider restructuring this part, e.g., let 
phases be different subsections, not subsubsections.

(editorial) In the RACK algorithm, you put the description first and 
then the pseudocode. Here, it's reversed. Maybe it's worth unifying your 
approach? FWIW, I prefer putting the description first because it makes 
the pseudocode easier to follow.

"A sender should schedule a PTO only if all of the following conditions 
are met"
Is this a “SHOULD NOT schedule a PTO unless all of the following 
conditions are met”?

"2. The connection has no SACKed sequences in the SACK scoreboard"
What is the reason for this constraint?

"The RECOMMENDED value for WCDelAckT is 200ms."
Why this value and not another?

s/If FlightSize = 1/If FlightSize == 1/

Section 7.5.2, Phase 2: Sending a loss probe:

SMSS is used here, but not defined. Please expand and add a definition.

Please clarify here already that while you allow only one TLP 
retransmission to be in flight at a time, you allow multiple TLP new 
data to be in flight. This is said later, but it is already relevant for 
understanding this part.

As you don't allow multiple TLP retransmits to be in flight: Does this 
mean the PTO could fire but then you do not send any data, because there 
is no new data available but there is already a TLP retransmit in 
flight?

"the sender MUST arm the RTO timer, not the PTO timer, at the end of 
TLP_send_probe() if FlightSize is not zero."
Please add this to the pseudocode.

"Checking TLPRxtOut prior to sending the loss probe is also critical to 
avoid TLP loops if an application writes periodically at an interval 
less than PTO."
What is a TLP loop? And why would it occur if an application writes 
periodically at an interval less than PTO? Please consider adding a 
definition.

Section 7.5.3: Phase 3: ACK processing

Isn’t this phase a special case of phase 1? “Phase” implies that these 
occur one after the other.

Section 7.6: TLP recovery detection

Is this still part of the TLP algorithm? If so, why is it a new 
subsection and not in the “TLP algorithm details” section, and not 
referenced from it? Or is this an optional addition/extension?

To make the purpose of this algorithm clearer early on, please consider 
briefly stating in the first paragraph that this introduces an algorithm 
to detect whether a packet loss occurred or not, i.e., whether both the 
original packet and the TLP arrived, or whether one of them got lost, 
therefore, it's a loss, therefore, you have to reduce cwnd.

Section 7.6.3, Detecting recoveries accomplished by loss probes:

What is the relationship between this Step 1, 2, 3, and Phase 1, 2, 3 
above? Consider rethinking your structure to make clear what happens 
when. If “Step” and “Phase” refer to the same concept, please consider 
unifying your terminology.

For condition 1: Why does this include "the segment contains no 
data" and "the segment is not a window update"? (Why) Is it not possible 
to detect recovery if either of these is true?

Condition 1 refers to a dup ack for the TLP retransmission, right? And 
receiving this dup ack implies the TLP retransmission was spurious? (I’m 
just trying to make sure I understand correctly, but maybe it's worth 
adding this to the draft to make it easier to follow.)

Condition 2 uses DSACK, but you previously say "Since a significant 
fraction of the hosts that support SACK do not support duplicate 
selective acknowledgments (D-SACKs) [RFC2883] the TLP algorithm for 
detecting such lost segments relies only on basic SACK support 
[RFC2018]."
This seems to be at odds. Isn't it rather "with DSACK you can make this 
detection more accurate, if you don't have it you have to rely on 
condition 1 only"? Please consider rephrasing the sentence starting with 
“Since a significant fraction…”

The last sentence of step 1 says:  "... so the sender considers the TLP 
episode to be done, and records that fact by setting TLPRxtOut to 
false." 
Isn't this already part of step 2, "Mark the end of a TLP retransmission 
episode and detect losses"? Should this part be moved to step 2?

"...it SHOULD invoke a congestion control response equivalent to fast 
recovery."
Is this really a SHOULD or is this a MUST? Because the document also 
says, "If the TLP sender does not receive such an indication before the 
end of the TLP retransmission episode, then it MUST estimate that either 
the original data segment or the TLP retransmission were lost, and 
congestion control MUST react appropriately to that loss as it would any 
other loss."

Section 8.1, Advantages:

"Suppose the transmission of each packet is at least RACK.reo_wnd (1 
millisecond by default)"
Isn't the window RTT_min/4 by default? I think this is what it says 
above. I am not seeing the 1 millisecond mentioned in any other part of 
the document.

"This example illustrates how RACK is able to repair certain drops at 
the tail of a transaction without any timer."
What do you mean by “without any timer”? RACK does potentially use 
timers, doesn't it? Did you mean a specific timer here, like "without 
PTO" or "without RTO"?

Section 8.3, Adjusting the reordering window:

"... the Performance evaluation section ..."
Maybe consider adding a reference to this section.

Section 8.4, Relationships with other loss recovery algorithms

At the very end, this section discusses the relationship between RACK 
and TLP. As this document introduces both algorithms, consider moving 
this discussion somewhere further up in the document. Alternatively, 
please at least mention the existence of this discussion earlier in the 
document and add a reference to the section of this discussion.

Section 8.5, Interaction with congestion control

"RACK may detect losses faster or slower than the conventional duplicate 
ACK threshold approach does.  RACK can detect losses faster by not 
requiring three DUPACKs, so congestion control may reduce the congestion 
window earlier."
Does this text imply that RACK can be used without the three dup ack 
threshold? This seems to be at odds with other parts of the document, 
which clearly state that RACK implies reacting to three dup acks.
If RACK indeed implies using the three dup ack threshold, is this 
paragraph still true?

Similar to my comment on Section 2, please consider briefly clarifying 
whether “interaction” means using two algorithms within the same sender, 
or whether this is about coexistence of different senders.

Section 8.5.1, Example: interactions with congestion control

"With RACK, a sender would send the TLP after 2*RTT"
This reads as if RACK does imply TLP?

Section 8.7, RACK for other transport protocols

"The algorithm can be simplified by skipping step 3 "
Please consider just briefly clarifying if you mean step 3 of RACK, 
"Detect packet reordering". In fact, why does it mean that? Even if you 
have unique transmission or packet identifiers, reordering might still 
occur, right? So don’t we still have to detect packet reordering?

Section 9, Experiments and Performance Evaluations

“We plan to expand our experiments to regions with worse connectivity, 
in particular on networks with strong traffic policing." 
Just curious: Have these experiments actually taken place?

Section 10, Security Considerations

"An interesting scenario is ACK-splitting attacks"
Maybe start the paragraph with something saying "In fact, RACK lowers 
the risk profile because it prevents the following attack"? But then it 
does change the risk profile, which seems to be at odds with the first 
sentence of this section.


This concludes my review. Again, thank you for this draft.

Best,
Theresa Enghardt