Re: [tcpm] TLP questions

Praveen Balasubramanian <pravb@microsoft.com> Thu, 10 May 2018 23:29 UTC

From: Praveen Balasubramanian <pravb@microsoft.com>
To: Neal Cardwell <ncardwell@google.com>
CC: Yuchung Cheng <ycheng@google.com>, "tcpm@ietf.org" <tcpm@ietf.org>, Nandita Dukkipati <nanditad@google.com>, Priyaranjan Jha <priyarjha@google.com>
Thread-Topic: TLP questions
Thread-Index: AdPnMP+M1F4UuxWjQ1WYRsZ1iHoRuAAkko1gAAQzxgAAOCIeEA==
Date: Thu, 10 May 2018 23:29:01 +0000
Message-ID: <CY4PR21MB06301845A5898A725A9E5FD5B6980@CY4PR21MB0630.namprd21.prod.outlook.com>
References: <CY4PR21MB063011EB9ABCD23BABC2EDC0B6990@CY4PR21MB0630.namprd21.prod.outlook.com> <CY4PR21MB0630AF5B03B8C260AD72E366B6990@CY4PR21MB0630.namprd21.prod.outlook.com> <CADVnQyk04js7VaFdUKFYg6h8yE2ZzoDMG_EPeS_hKYb_tnesww@mail.gmail.com>
In-Reply-To: <CADVnQyk04js7VaFdUKFYg6h8yE2ZzoDMG_EPeS_hKYb_tnesww@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
received-spf: None (protection.outlook.com: microsoft.com does not designate permitted sender hosts)
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: multipart/alternative; boundary="_000_CY4PR21MB06301845A5898A725A9E5FD5B6980CY4PR21MB0630namp_"
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-Network-Message-Id: 8aa97248-416a-4204-0810-08d5b6cdd00b
X-MS-Exchange-CrossTenant-originalarrivaltime: 10 May 2018 23:29:01.7027 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY4PR21MB0789
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/29cNCFUyU4gRj6qiukiiW0oAKtQ>
Subject: Re: [tcpm] TLP questions
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 10 May 2018 23:29:07 -0000

Thanks Neal for the detailed response along with the historical context. Looking forward to draft 04 updates.

A few more questions and comments.

> then typically RACK will install a timer based on the reordering window, and when that timer fires it will mark some packets lost and enter fast recovery
The "reordering settling" timer is defined as optional in the draft. The Windows implementation currently does not use this timer. Until we add such a timer, we plan to not prevent TLP even if SACK scoreboard is not empty. However if recovery is triggered, we’ll cancel the PTO and arm an RTO. Do you see any issues with this approach? Since the draft makes the "reordering settling" timer optional, I think it should suggest this alternative approach.

> So the 2ms is to allow for real-world jitter in the network and end hosts
Currently the Windows implementation doesn’t use TLP and RACK for connections with < 10 msec RTT. So until we change this logic, we will skip adding the 2 ms jitter protection.

> By "a previously unsent segment" we mean basically "the next segment (of MSS or fewer bytes) that the sender would normally send if it had available cwnd at this time
This is better but still not crisp enough to determine what exactly the Linux implementation does. Since you said that Linux does not roll back SND.NXT upon RTO, does this imply that "a previously unsent segment", is the one starting at SND.NXT?

> By "the last segment" we mean "the highest-sequence segment (of MSS or fewer bytes) that has already been transmitted and not ACKed or SACKed
Again given that SND.NXT is not rolled back this would presumably the first UN(S)ACKed MSS or fewer bytes walking back from SND.NXT (I am not suggesting an actual walk back, just for illustration). I am a bit confused that you used the “or SACKed” since you previously required that PTO not be armed if any SACK blocks were present. If that is followed, then "the last segment" would pretty much always be MSS or fewer bytes just before (non-rolled back) SND.NXT.

Thanks

From: Neal Cardwell [mailto:ncardwell@google.com]
Sent: Wednesday, May 9, 2018 1:27 PM
To: Praveen Balasubramanian <pravb@microsoft.com>
Cc: Yuchung Cheng <ycheng@google.com>; tcpm@ietf.org; Nandita Dukkipati <nanditad@google.com>; Priyaranjan Jha <priyarjha@google.com>
Subject: Re: TLP questions

On Wed, May 9, 2018 at 2:29 PM Praveen Balasubramanian <pravb@microsoft.com<mailto:pravb@microsoft.com>> wrote:
Including tcpm as this can be a review of the TLP portions of draft-ietf-tcpm-rack-03.

From: Praveen Balasubramanian
Sent: Tuesday, May 8, 2018 6:13 PM
To: Yuchung Cheng <ycheng@google.com<mailto:ycheng@google.com>>; Neal Cardwell <ncardwell@google.com<mailto:ncardwell@google.com>>
Subject: TLP questions

Hey folks, I have some questions on draft-ietf-tcpm-rack-03 mainly around TLP.

Praveen, thank you for all of these excellent questions, and thanks for bringing this to the tcpm list for discussion!

  1.  “Open state: the sender's loss recovery state machine is in its  normal, default state: there are no SACKed sequence ranges in the  SACK scoreboard, and neither fast recovery, timeout-based recovery, nor ECN-based cwnd reduction are underway. “

Open state is defined and then never used.
Yes, thanks for noticing. The phrase "Open state" is used in earlier revisions of the TLP and RACK drafts as a precondition for sending a Tail Loss Probe. We removed its use in draft-ietf-tcpm-rack-03 but forgot to remove the definition in draft-ietf-tcpm-rack-03. We noticed shortly after shipping the -03 text and already removed that definition in our internal draft for draft-ietf-tcpm-rack-04.

I assume you require that TLP be scheduled only for connections in open state?
That was the original (and long-time) implementation. The "Open" state is the Linux TCP stack term for the default congestion control state where there's nothing in the SACK scoreboard, no ECN cwnd reduction is underway, and we're not doing Fast Recovery or RTO recovery. Originally the Linux TCP code would only schedule a TLP in the "Open" state. But recently the TCP team at Google has analyzed scenarios where this limitation seemed too strict. In particular, we were analyzing a case in the public Linux netdev mailling list ("Re: Linux ECN Handling") where the connection was in the middle of a DCTCP ECN-triggered cwnd reduction, and then suffered a ~300ms timeout that would have been vastly shorter if there had been a TLP for recovery. So we changed the Linux TCP code to allow TLP in ECN-triggered cwnd reductions (b4f70c3d4ec32 "tcp: allow TLP in ECN CWR").

This means that the Linux TCP code is allowing TLP in the ECN cwnd reduction state as well as the "Open" state.  We have tried to update the draft in draft-ietf-tcpm-rack-03 and, based on your feedback, will try to make this more crisp/clear in draft-ietf-tcpm-rack-04. Please see below for a proposed update to this part of the text.

I wonder why there is the requirement on “no SACKed sequence ranges in the  SACK scoreboard”.
This condition dated from the original TLP design, where originally the motivation was to keep the behavior cautious and simple. Now that we have RACK, as a practical matter, we still don't really want a TLP in such cases. Rather, if there is some SACKed sequence range but we are not yet in loss recovery, then typically RACK will install a timer based on the reordering window, and when that timer fires it will mark some packets lost and enter fast recovery.

For this reason, the Linux TCP implementation of TLP+RACK still has a "no SACKed sequence ranges in the SACK scoreboard" as a precondition for scheduling a TLP, and IMHO this still makes sense, so I would proposed that in draft-ietf-tcpm-rack-04 we document that by adding a clause to the list of preconditions for scheduling a TLP ("5.4.1. Phase 1: Scheduling a loss probe... A sender should schedule a PTO only if all of the following conditions are met"):

     The connection has no SACKed sequences in the SACK scoreboard

Please see below for where this proposed addition fits into the surrounding text.

  1.  PTO += 2ms

If we already are being conservative by waiting 2*SRTT, then how is adding 2 msec going to help? Was this added due to a real world issue?
Yes, this is due to real-world issues. In datacenters, with Linux TCP the SRTT could be much lower than 100usec, so even 2*SRTT is a very tight timer, and could result in lots of spurious TLPs, and is not even supported by current retransmission timer mechanisms in common OSes. So the 2ms is to allow for real-world jitter in the network and end hosts. We tried to describe some of these issues in the -03 draft:

   Similarly, current end-system processing latencies and timer

   granularities can easily delay ACKs, so senders SHOULD add at least

   2ms to a computed PTO value (and MAY add more if the sending host OS

   timer granularity is more coarse than 1ms).

I guess we can add network jitter to this as well, so I'd propose something like the following for -04:

Similarly, network delay variations and end-system processing
latencies and timer granularities can easily delay ACKs beyond 2*SRTT,
so senders SHOULD add at least 2ms to a computed PTO value
(and MAY add more if the sending host OS timer granularity is more
coarse than 1ms).
Instead of 2ms we could add something like RTTVAR from RFC 6298, but that would add complexity and delay, especially if the receiver is manifesting ~200ms delayed ACKs. And so we have not yet seen evidence that that would be a good trade-off.

  1.  “If a previously unsent segment exists AND
         the receive window allows new data to be sent:
           Transmit that new segment
           FlightSize += SMSS
       Else:
           Retransmit the last segment”
This needs to be crisper about what is meant by “previously unsent” and “last segment”. For example the sender could have sent a large amount of data and then taken a full RTO. In this case if PTO fires, do “previously unsent” and “last segment” refer to MSS size segments straddling just before and  after SND.NXT? OR do they straddle around the largest sent sequence number ever in the connection lifetime?

Very good questions. I like your suggestion to be crisper in this section. The Linux TCP stack does not "rewind" SND.NXT upon RTO, so in the Linux TCP stack the SND.NXT point and "largest sent sequence number ever in the connection lifetime" are basically the same point. That is the framework in which we were thinking for those lines.

In my mind...

By "a previously unsent segment" we mean basically "the next segment (of MSS or fewer bytes) that the sender would normally send if it had available cwnd at this time." That is something that presumably every production-quality TCP stack has very quick access to.

By "the last segment" we mean "the highest-sequence segment (of MSS or fewer bytes) that has already been transmitted and not ACKed or SACKed." I imagine this should also be generally very quick to access (at least it can be quickly accessed in two generations of the Linux TCP write queue). Let us know if not, and we can discuss.

Does that help clarify those parts? If so, we can update the text to incorporate something like that (suggestions?).

  1.  “On each incoming ACK, the sender should cancel any existing loss  probe timer.”

Even on duplicate ACKs?
Excellent question. This line is out of date, and does not reflect recent fixes our team made in the Linux TCP stack (df92c8394e6e "tcp: fix xmit timer to only be reset if data ACKed/SACKed").  In fact the TLP/RTO timers should only be rearmed (cancelled and reinstalled at a later time) if certain kinds of real "forward progress" are made. On our internal draft for -04 I have proposed we change this to:

  Phase 3: ACK processing

  On each incoming ACK, the sender should check the conditions in Step 1 of Phase 1 to see if it should schedule (or reschedule) the loss probe timer.

And then, putting this all together, I would propose something like the following text for Step 1 of Phase 1:

----
Phase 1: Scheduling a loss probe

Step 1: Check conditions for scheduling a PTO.

A sender should check to see if it should schedule a PTO in the following situations:

  1.  After transmitting new data
  2.  Upon receiving an ACK that cumulatively acknowledges data
  3.  Upon receiving a SACK that selectively acknowledges data that was last sent before the segment with SEG.SEQ=SND.UNA was last (re)transmitted

A sender should schedule a PTO only if all of the following conditions are met:

  1.  The connection supports SACK [RFC2018]
  2.  The connection has no SACKed sequences in the SACK scoreboard
  3.  The connection is not in loss recovery
  4.  The most recently transmitted data was not itself a TLP probe (i.e. a sender MUST NOT send consecutive or back-to-back TLP probes).

If a PTO can be scheduled according to these conditions, the sender should schedule a PTO. If there was a previously scheduled PTO or RTO pending, then that pending PTO or RTO should first be cancelled, and then the new PTO should be scheduled.

If a PTO cannot be scheduled according to these conditions, then the sender MUST arm the RTO timer if there is unacknowledged data in flight.

----

How does that sound?

Praveen, thank you again for all these excellent questions and points!

cheers,
neal

Re: [tcpm] TLP questions Praveen Balasubramanian
Re: [tcpm] TLP questions Neal Cardwell
Re: [tcpm] TLP questions Yoshifumi Nishida
Re: [tcpm] TLP questions Praveen Balasubramanian
Re: [tcpm] TLP questions Neal Cardwell
Re: [tcpm] TLP questions Neal Cardwell
Re: [tcpm] TLP questions Praveen Balasubramanian
Re: [tcpm] TLP questions Yuchung Cheng
Re: [tcpm] TLP questions Yoshifumi Nishida