Re: [tcpm] TLP questions

Neal Cardwell <ncardwell@google.com> Wed, 09 May 2018 20:27 UTC

Return-Path: <ncardwell@google.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F16D9128954 for <tcpm@ietfa.amsl.com>; Wed, 9 May 2018 13:27:28 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -17.5
X-Spam-Level:
X-Spam-Status: No, score=-17.5 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01, T_KAM_HTML_FONT_INVALID=0.01, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Sj_8yRtdrnLO for <tcpm@ietfa.amsl.com>; Wed, 9 May 2018 13:27:25 -0700 (PDT)
Received: from mail-wr0-x22f.google.com (mail-wr0-x22f.google.com [IPv6:2a00:1450:400c:c0c::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3129012D7F4 for <tcpm@ietf.org>; Wed, 9 May 2018 13:27:25 -0700 (PDT)
Received: by mail-wr0-x22f.google.com with SMTP id h5-v6so12836006wrm.4 for <tcpm@ietf.org>; Wed, 09 May 2018 13:27:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=7uAeuRxtWbaSK1Tjaw1ogtuZUryA1HypRbbPmiJ94vI=; b=bGTpUW9dvq17iT3SIIqDNy2YrqSZdXnRa5Zyi+YxHoyKqGLbYmMJc5VzwqRIaG4Xhd gXpgkRFomt4ZYIWFO8dlvTQstlVBkObRVw4TziinsIOgOzIYDvyxXtSI6jwJeMgaiais DT4X/34uE7CBgNGnzWaCTNTt0euY6L+Fvx9DfaWK0nH3w+cys6wYOtCvOVX4ZDDpzZnq odJq85f/TQeo8bWqDcIdiQAks69J7kW6wHCtPiMCG1Rav37zhpCGPW6+KQj9AlWKBzqZ lh/bmdbrsRPHYZr2Iw99fmvaGVs0PnwYzWhgLdMWZw+DTxOpbZsWPKX3VvK45VW83MSN ITWg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=7uAeuRxtWbaSK1Tjaw1ogtuZUryA1HypRbbPmiJ94vI=; b=ESQLFclX36rxx/meZVv6ZllnvT03Oud/VD8uO8BG7Xc45ZZJ/z/hu+AsI7kIkal8C1 uAdLbhJCT69Mgh/BwFP1NE5OLcbX0rUoNrHEg5ICwjdFASBBZ2BcNj5/ecBfvBPCKPu7 wKBsC58sdiQtTd0nzB3A+tQ7miKxYorInC+Jr5O+0WxDK07KtNmNBMPX5AMcznlQ6Xzr Btnd3FQHiAMqp4KOBvJTwNy2nzaeUZZGJZVjJ5D1g6a46wVOuzp1PTKvhN3CmjOESBVP INKD9b1NTNSRLcY0exB4VdqTUDfz13YASeUcNjJbwEoPNwoQuRZNqJwebfcV18CyiQMW hoQQ==
X-Gm-Message-State: ALQs6tBlVFYYFb6UtJuB1H7cHItmLfEgsBVb46q+LvtzXPA3pZeGH/fX tA+7Sazv3OLO6sejqjYwH53Q7iDA8y9Ix4R96kJmCnnv/Ao=
X-Google-Smtp-Source: AB8JxZoTS3G6z/ADQEeEwfiYSVup2rzBF62+rX7hMwBDIJ1DG8Xgg99FfR4/mLiHmxV3jY3hT28XKv1KtWz65yL0C6w=
X-Received: by 2002:adf:91a2:: with SMTP id 31-v6mr37226846wri.124.1525897643238; Wed, 09 May 2018 13:27:23 -0700 (PDT)
MIME-Version: 1.0
References: <CY4PR21MB063011EB9ABCD23BABC2EDC0B6990@CY4PR21MB0630.namprd21.prod.outlook.com> <CY4PR21MB0630AF5B03B8C260AD72E366B6990@CY4PR21MB0630.namprd21.prod.outlook.com>
In-Reply-To: <CY4PR21MB0630AF5B03B8C260AD72E366B6990@CY4PR21MB0630.namprd21.prod.outlook.com>
From: Neal Cardwell <ncardwell@google.com>
Date: Wed, 09 May 2018 20:27:06 +0000
Message-ID: <CADVnQyk04js7VaFdUKFYg6h8yE2ZzoDMG_EPeS_hKYb_tnesww@mail.gmail.com>
To: Praveen Balasubramanian <pravb@microsoft.com>
Cc: Yuchung Cheng <ycheng@google.com>, "tcpm@ietf.org" <tcpm@ietf.org>, Nandita Dukkipati <nanditad@google.com>, Priyaranjan Jha <priyarjha@google.com>
Content-Type: multipart/alternative; boundary="000000000000fdf30c056bcbba45"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/V-kgHeSOusdSQIxXViwL_iGH9qU>
Subject: Re: [tcpm] TLP questions
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 09 May 2018 20:27:29 -0000

On Wed, May 9, 2018 at 2:29 PM Praveen Balasubramanian <pravb@microsoft.com>
wrote:

> Including tcpm as this can be a review of the TLP portions of
> draft-ietf-tcpm-rack-03.
>
>
>
> *From:* Praveen Balasubramanian
> *Sent:* Tuesday, May 8, 2018 6:13 PM
> *To:* Yuchung Cheng <ycheng@google.com>; Neal Cardwell <
> ncardwell@google.com>
> *Subject:* TLP questions
>
>
>
> Hey folks, I have some questions on draft-ietf-tcpm-rack-03 mainly around
> TLP.
>

Praveen, thank you for all of these excellent questions, and thanks for
bringing this to the tcpm list for discussion!


>
>
>    1. “Open state: the sender's loss recovery state machine is in its
>     normal, default state: there are no SACKed sequence ranges in the  SACK
>    scoreboard, and neither fast recovery, timeout-based recovery, nor
>    ECN-based cwnd reduction are underway. “
>
> Open state is defined and then never used.
>
Yes, thanks for noticing. The phrase "Open state" is used in earlier
revisions of the TLP and RACK drafts as a precondition for sending a Tail
Loss Probe. We removed its use in draft-ietf-tcpm-rack-03 but forgot to
remove the definition in draft-ietf-tcpm-rack-03. We noticed shortly after
shipping the -03 text and already removed that definition in our internal
draft for draft-ietf-tcpm-rack-04.


> I assume you require that TLP be scheduled only for connections in open
> state?
>
That was the original (and long-time) implementation. The "Open" state is
the Linux TCP stack term for the default congestion control state where
there's nothing in the SACK scoreboard, no ECN cwnd reduction is underway,
and we're not doing Fast Recovery or RTO recovery. Originally the Linux TCP
code would only schedule a TLP in the "Open" state. But recently the TCP
team at Google has analyzed scenarios where this limitation seemed too
strict. In particular, we were analyzing a case in the public Linux netdev
mailling list ("Re: Linux ECN Handling") where the connection was in the
middle of a DCTCP ECN-triggered cwnd reduction, and then suffered a ~300ms
timeout that would have been vastly shorter if there had been a TLP for
recovery. So we changed the Linux TCP code to allow TLP in ECN-triggered
cwnd reductions (b4f70c3d4ec32 "tcp: allow TLP in ECN CWR").

This means that the Linux TCP code is allowing TLP in the ECN cwnd
reduction state as well as the "Open" state.  We have tried to update the
draft in draft-ietf-tcpm-rack-03 and, based on your feedback, will try to
make this more crisp/clear in draft-ietf-tcpm-rack-04. Please see below for
a proposed update to this part of the text.

> I wonder why there is the requirement on “no SACKed sequence ranges in the
>  SACK scoreboard”.
>
This condition dated from the original TLP design, where originally the
motivation was to keep the behavior cautious and simple. Now that we have
RACK, as a practical matter, we still don't really want a TLP in such
cases. Rather, if there is some SACKed sequence range but we are not yet in
loss recovery, then typically RACK will install a timer based on the
reordering window, and when that timer fires it will mark some packets lost
and enter fast recovery.

For this reason, the Linux TCP implementation of TLP+RACK still has a "no
SACKed sequence ranges in the SACK scoreboard" as a precondition for
scheduling a TLP, and IMHO this still makes sense, so I would proposed that
in draft-ietf-tcpm-rack-04 we document that by adding a clause to the list
of preconditions for scheduling a TLP ("5.4.1. Phase 1: Scheduling a loss
probe... A sender should schedule a PTO only if all of the following
conditions are met"):

     The connection has no SACKed sequences in the SACK scoreboard

Please see below for where this proposed addition fits into the surrounding
text.

>
>    1. PTO += 2ms
>
> If we already are being conservative by waiting 2*SRTT, then how is adding
> 2 msec going to help? Was this added due to a real world issue?
>
Yes, this is due to real-world issues. In datacenters, with Linux TCP the
SRTT could be much lower than 100usec, so even 2*SRTT is a very tight
timer, and could result in lots of spurious TLPs, and is not even supported
by current retransmission timer mechanisms in common OSes. So the 2ms is to
allow for real-world jitter in the network and end hosts. We tried to
describe some of these issues in the -03 draft:

   Similarly, current end-system processing latencies and timer
   granularities can easily delay ACKs, so senders SHOULD add at least
   2ms to a computed PTO value (and MAY add more if the sending host OS
   timer granularity is more coarse than 1ms).


I guess we can add network jitter to this as well, so I'd propose something
like the following for -04:

* Similarly, network delay variations and end-system processing*
* latencies and timer granularities can easily delay ACKs beyond 2*SRTT,*
* so senders SHOULD add at least 2ms to a computed PTO value*
* (and MAY add more if the sending host OS timer granularity is more*

* coarse than 1ms).*
Instead of 2ms we could add something like RTTVAR from RFC 6298, but that
would add complexity and delay, especially if the receiver is manifesting
~200ms delayed ACKs. And so we have not yet seen evidence that that would
be a good trade-off.


>    1. “If a previously unsent segment exists AND
>
>          the receive window allows new data to be sent:
>
>            Transmit that new segment
>
>            FlightSize += SMSS
>
>        Else:
>
>            Retransmit the last segment”
>
> This needs to be crisper about what is meant by “previously unsent” and
> “last segment”. For example the sender could have sent a large amount of
> data and then taken a full RTO. In this case if PTO fires, do “previously
> unsent” and “last segment” refer to MSS size segments straddling just
> before and  after SND.NXT? OR do they straddle around the largest sent
> sequence number ever in the connection lifetime?
>

Very good questions. I like your suggestion to be crisper in this section.
The Linux TCP stack does not "rewind" SND.NXT upon RTO, so in the Linux TCP
stack the SND.NXT point and "largest sent sequence number ever in the
connection lifetime" are basically the same point. That is the framework in
which we were thinking for those lines.

In my mind...

By "a previously unsent segment" we mean basically "the next segment (of
MSS or fewer bytes) that the sender would normally send if it had available
cwnd at this time." That is something that presumably every
production-quality TCP stack has very quick access to.

By "the last segment" we mean "the highest-sequence segment (of MSS or
fewer bytes) that has already been transmitted and not ACKed or SACKed." I
imagine this should also be generally very quick to access (at least it can
be quickly accessed in two generations of the Linux TCP write queue). Let
us know if not, and we can discuss.

Does that help clarify those parts? If so, we can update the text to
incorporate something like that (suggestions?).


>
>    1. “On each incoming ACK, the sender should cancel any existing loss
>    probe timer.”
>
> Even on duplicate ACKs?
>
Excellent question. This line is out of date, and does not reflect recent
fixes our team made in the Linux TCP stack (df92c8394e6e "tcp: fix xmit
timer to only be reset if data ACKed/SACKed").  In fact the TLP/RTO timers
should only be rearmed (cancelled and reinstalled at a later time) if
certain kinds of real "forward progress" are made. On our internal draft
for -04 I have proposed we change this to:

  Phase 3: ACK processing

  On each incoming ACK, the sender should check the conditions in Step 1 of
Phase 1 to see if it should schedule (or reschedule) the loss probe timer.

And then, putting this all together, I would propose something like the
following text for Step 1 of Phase 1:

----










*Phase 1: Scheduling a loss probeStep 1: Check conditions for scheduling a
PTO.A sender should check to see if it should schedule a PTO in the
following situations: 1. After transmitting new data2. Upon receiving an
ACK that cumulatively acknowledges data3. Upon receiving a SACK that
selectively acknowledges data that was last sent before the segment with
SEG.SEQ=SND.UNA was last (re)transmittedA sender should schedule a PTO only
if all of the following conditions are met: 1. The connection supports SACK
[RFC2018]2. The connection has no SACKed sequences in the SACK scoreboard3.
The connection is not in loss recovery4. The most recently transmitted data
was not itself a TLP probe (i.e. a sender MUST NOT send consecutive or
back-to-back TLP probes).If a PTO can be scheduled according to these
conditions, the sender should schedule a PTO. If there was a previously
scheduled PTO or RTO pending, then that pending PTO or RTO should first be
cancelled, and then the new PTO should be scheduled.If a PTO cannot be
scheduled according to these conditions, then the sender MUST arm the RTO
timer if there is unacknowledged data in flight.*
----

How does that sound?

Praveen, thank you again for all these excellent questions and points!

cheers,
neal