Re: [tcpm] TLP questions

Yoshifumi Nishida <nishida@sfc.wide.ad.jp> Thu, 10 May 2018 23:08 UTC

Return-Path: <nishida@sfc.wide.ad.jp>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 084B112E037 for <tcpm@ietfa.amsl.com>; Thu, 10 May 2018 16:08:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id im3TqhOSCgv2 for <tcpm@ietfa.amsl.com>; Thu, 10 May 2018 16:08:04 -0700 (PDT)
Received: from mail.sfc.wide.ad.jp (shonan.sfc.wide.ad.jp [203.178.142.130]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1564712DB72 for <tcpm@ietf.org>; Thu, 10 May 2018 16:08:03 -0700 (PDT)
Received: from mail-it0-f51.google.com (mail-it0-f51.google.com [209.85.214.51]) by mail.sfc.wide.ad.jp (Postfix) with ESMTPSA id B66532783BF for <tcpm@ietf.org>; Fri, 11 May 2018 08:08:01 +0900 (JST)
Received: by mail-it0-f51.google.com with SMTP id 70-v6so2696ity.2 for <tcpm@ietf.org>; Thu, 10 May 2018 16:08:01 -0700 (PDT)
X-Gm-Message-State: ALKqPwf02LgOTkX/inoxsmLZwKxYjhahRPB5fI9dZTpd0HgloRf5x7gK rR5t6EToTjDO0KMhRdVy+pfhCyjz/2LowNEsXpE=
X-Google-Smtp-Source: AB8JxZo1kFcdGleZtXelX9yNXwxz0QI2cLWcmPhHQg6awehdQ6s5Tn76SsLLZSuCXGeUJqorfp0v6JYlFaN9JYwF/VY=
X-Received: by 2002:a24:3555:: with SMTP id k82-v6mr934243ita.49.1525993680601; Thu, 10 May 2018 16:08:00 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a4f:43c6:0:0:0:0:0 with HTTP; Thu, 10 May 2018 16:08:00 -0700 (PDT)
In-Reply-To: <CADVnQyk04js7VaFdUKFYg6h8yE2ZzoDMG_EPeS_hKYb_tnesww@mail.gmail.com>
References: <CY4PR21MB063011EB9ABCD23BABC2EDC0B6990@CY4PR21MB0630.namprd21.prod.outlook.com> <CY4PR21MB0630AF5B03B8C260AD72E366B6990@CY4PR21MB0630.namprd21.prod.outlook.com> <CADVnQyk04js7VaFdUKFYg6h8yE2ZzoDMG_EPeS_hKYb_tnesww@mail.gmail.com>
From: Yoshifumi Nishida <nishida@sfc.wide.ad.jp>
Date: Thu, 10 May 2018 16:08:00 -0700
X-Gmail-Original-Message-ID: <CAO249ydhwbnGvJNdHwJGBO6h_++mHKY6Xe+n+vFX4vg9rsvuhQ@mail.gmail.com>
Message-ID: <CAO249ydhwbnGvJNdHwJGBO6h_++mHKY6Xe+n+vFX4vg9rsvuhQ@mail.gmail.com>
To: Neal Cardwell <ncardwell@google.com>
Cc: Praveen Balasubramanian <pravb@microsoft.com>, Priyaranjan Jha <priyarjha@google.com>, "tcpm@ietf.org" <tcpm@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/aBb4iZlegdsMpyGbEyQHlw_XKGY>
Subject: Re: [tcpm] TLP questions
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 10 May 2018 23:08:07 -0000

Hi Neal,

On Wed, May 9, 2018 at 1:27 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Wed, May 9, 2018 at 2:29 PM Praveen Balasubramanian <pravb@microsoft.com>
> wrote:
>>
>> Including tcpm as this can be a review of the TLP portions of
>> draft-ietf-tcpm-rack-03.
>>
>>
>>
>> From: Praveen Balasubramanian
>> Sent: Tuesday, May 8, 2018 6:13 PM
>> To: Yuchung Cheng <ycheng@google.com>; Neal Cardwell
>> <ncardwell@google.com>
>> Subject: TLP questions
>>
>>
>>
>> Hey folks, I have some questions on draft-ietf-tcpm-rack-03 mainly around
>> TLP.
>
>
> Praveen, thank you for all of these excellent questions, and thanks for
> bringing this to the tcpm list for discussion!
>
>>
>>
>>
>> “Open state: the sender's loss recovery state machine is in its  normal,
>> default state: there are no SACKed sequence ranges in the  SACK scoreboard,
>> and neither fast recovery, timeout-based recovery, nor ECN-based cwnd
>> reduction are underway. “
>>
>> Open state is defined and then never used.
>
> Yes, thanks for noticing. The phrase "Open state" is used in earlier
> revisions of the TLP and RACK drafts as a precondition for sending a Tail
> Loss Probe. We removed its use in draft-ietf-tcpm-rack-03 but forgot to
> remove the definition in draft-ietf-tcpm-rack-03. We noticed shortly after
> shipping the -03 text and already removed that definition in our internal
> draft for draft-ietf-tcpm-rack-04.
>
>>
>> I assume you require that TLP be scheduled only for connections in open
>> state?
>
> That was the original (and long-time) implementation. The "Open" state is
> the Linux TCP stack term for the default congestion control state where
> there's nothing in the SACK scoreboard, no ECN cwnd reduction is underway,
> and we're not doing Fast Recovery or RTO recovery. Originally the Linux TCP
> code would only schedule a TLP in the "Open" state. But recently the TCP
> team at Google has analyzed scenarios where this limitation seemed too
> strict. In particular, we were analyzing a case in the public Linux netdev
> mailling list ("Re: Linux ECN Handling") where the connection was in the
> middle of a DCTCP ECN-triggered cwnd reduction, and then suffered a ~300ms
> timeout that would have been vastly shorter if there had been a TLP for
> recovery. So we changed the Linux TCP code to allow TLP in ECN-triggered
> cwnd reductions (b4f70c3d4ec32 "tcp: allow TLP in ECN CWR").
>
> This means that the Linux TCP code is allowing TLP in the ECN cwnd reduction
> state as well as the "Open" state.  We have tried to update the draft in
> draft-ietf-tcpm-rack-03 and, based on your feedback, will try to make this
> more crisp/clear in draft-ietf-tcpm-rack-04. Please see below for a proposed
> update to this part of the text.
>>
>> I wonder why there is the requirement on “no SACKed sequence ranges in the
>> SACK scoreboard”.
>
> This condition dated from the original TLP design, where originally the
> motivation was to keep the behavior cautious and simple. Now that we have
> RACK, as a practical matter, we still don't really want a TLP in such cases.
> Rather, if there is some SACKed sequence range but we are not yet in loss
> recovery, then typically RACK will install a timer based on the reordering
> window, and when that timer fires it will mark some packets lost and enter
> fast recovery.
>
> For this reason, the Linux TCP implementation of TLP+RACK still has a "no
> SACKed sequence ranges in the SACK scoreboard" as a precondition for
> scheduling a TLP, and IMHO this still makes sense, so I would proposed that
> in draft-ietf-tcpm-rack-04 we document that by adding a clause to the list
> of preconditions for scheduling a TLP ("5.4.1. Phase 1: Scheduling a loss
> probe... A sender should schedule a PTO only if all of the following
> conditions are met"):
>
>      The connection has no SACKed sequences in the SACK scoreboard
>
> Please see below for where this proposed addition fits into the surrounding
> text.
>>
>> PTO += 2ms
>>
>> If we already are being conservative by waiting 2*SRTT, then how is adding
>> 2 msec going to help? Was this added due to a real world issue?
>
> Yes, this is due to real-world issues. In datacenters, with Linux TCP the
> SRTT could be much lower than 100usec, so even 2*SRTT is a very tight timer,
> and could result in lots of spurious TLPs, and is not even supported by
> current retransmission timer mechanisms in common OSes. So the 2ms is to
> allow for real-world jitter in the network and end hosts. We tried to
> describe some of these issues in the -03 draft:
>
>    Similarly, current end-system processing latencies and timer
>    granularities can easily delay ACKs, so senders SHOULD add at least
>    2ms to a computed PTO value (and MAY add more if the sending host OS
>    timer granularity is more coarse than 1ms).
>
>
> I guess we can add network jitter to this as well, so I'd propose something
> like the following for -04:
>
> Similarly, network delay variations and end-system processing
> latencies and timer granularities can easily delay ACKs beyond 2*SRTT,
> so senders SHOULD add at least 2ms to a computed PTO value
> (and MAY add more if the sending host OS timer granularity is more
> coarse than 1ms).
>
> Instead of 2ms we could add something like RTTVAR from RFC 6298, but that
> would add complexity and delay, especially if the receiver is manifesting
> ~200ms delayed ACKs. And so we have not yet seen evidence that that would be
> a good trade-off.
>
>> “If a previously unsent segment exists AND
>>
>>          the receive window allows new data to be sent:
>>
>>            Transmit that new segment
>>
>>            FlightSize += SMSS
>>
>>        Else:
>>
>>            Retransmit the last segment”
>>
>> This needs to be crisper about what is meant by “previously unsent” and
>> “last segment”. For example the sender could have sent a large amount of
>> data and then taken a full RTO. In this case if PTO fires, do “previously
>> unsent” and “last segment” refer to MSS size segments straddling just before
>> and  after SND.NXT? OR do they straddle around the largest sent sequence
>> number ever in the connection lifetime?
>
>
> Very good questions. I like your suggestion to be crisper in this section.
> The Linux TCP stack does not "rewind" SND.NXT upon RTO, so in the Linux TCP
> stack the SND.NXT point and "largest sent sequence number ever in the
> connection lifetime" are basically the same point. That is the framework in
> which we were thinking for those lines.
>
> In my mind...
>
> By "a previously unsent segment" we mean basically "the next segment (of MSS
> or fewer bytes) that the sender would normally send if it had available cwnd
> at this time." That is something that presumably every production-quality
> TCP stack has very quick access to.
>
> By "the last segment" we mean "the highest-sequence segment (of MSS or fewer
> bytes) that has already been transmitted and not ACKed or SACKed." I imagine
> this should also be generally very quick to access (at least it can be
> quickly accessed in two generations of the Linux TCP write queue). Let us
> know if not, and we can discuss.
>
> Does that help clarify those parts? If so, we can update the text to
> incorporate something like that (suggestions?).

I am thinking that we can just say "send one (SMSS or smaller) segment
that contains up to highest sequence number it has sent."
I am a bit wondering if not ACKed or SACKed requirement should be
mandatory, although it will be a bit redundant.

>>
>> “On each incoming ACK, the sender should cancel any existing loss  probe
>> timer.”
>>
>> Even on duplicate ACKs?
>
> Excellent question. This line is out of date, and does not reflect recent
> fixes our team made in the Linux TCP stack (df92c8394e6e "tcp: fix xmit
> timer to only be reset if data ACKed/SACKed").  In fact the TLP/RTO timers
> should only be rearmed (cancelled and reinstalled at a later time) if
> certain kinds of real "forward progress" are made. On our internal draft for
> -04 I have proposed we change this to:
>
>   Phase 3: ACK processing
>
>   On each incoming ACK, the sender should check the conditions in Step 1 of
> Phase 1 to see if it should schedule (or reschedule) the loss probe timer.
>
> And then, putting this all together, I would propose something like the
> following text for Step 1 of Phase 1:
>
> ----
>
> Phase 1: Scheduling a loss probe
>
>
> Step 1: Check conditions for scheduling a PTO.
>
> A sender should check to see if it should schedule a PTO in the following
> situations:
>
> After transmitting new data
>
> Upon receiving an ACK that cumulatively acknowledges data
>
> Upon receiving a SACK that selectively acknowledges data that was last sent
> before the segment with SEG.SEQ=SND.UNA was last (re)transmitted
>
>
> A sender should schedule a PTO only if all of the following conditions are
> met:
>
> The connection supports SACK [RFC2018]

Do we need to check this? I mean RACK already presumes SACK nodes. I
thought TLP does as well.
Or, do we want to use TLP with non-SACK nodes?

> The connection has no SACKed sequences in the SACK scoreboard
>
> The connection is not in loss recovery
>
> The most recently transmitted data was not itself a TLP probe (i.e. a sender
> MUST NOT send consecutive or back-to-back TLP probes).

I think the third condition might be a bit ambiguous and may require
another variable. If we check TLPRtxOut and TLPHighRxt, might it be
sufficient?
--
Yoshi