Re: [tcpm] TLP questions
Yoshifumi Nishida <nishida@sfc.wide.ad.jp> Thu, 10 May 2018 23:08 UTC
Return-Path: <nishida@sfc.wide.ad.jp>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 084B112E037 for <tcpm@ietfa.amsl.com>; Thu, 10 May 2018 16:08:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id im3TqhOSCgv2 for <tcpm@ietfa.amsl.com>; Thu, 10 May 2018 16:08:04 -0700 (PDT)
Received: from mail.sfc.wide.ad.jp (shonan.sfc.wide.ad.jp [203.178.142.130]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1564712DB72 for <tcpm@ietf.org>; Thu, 10 May 2018 16:08:03 -0700 (PDT)
Received: from mail-it0-f51.google.com (mail-it0-f51.google.com [209.85.214.51]) by mail.sfc.wide.ad.jp (Postfix) with ESMTPSA id B66532783BF for <tcpm@ietf.org>; Fri, 11 May 2018 08:08:01 +0900 (JST)
Received: by mail-it0-f51.google.com with SMTP id 70-v6so2696ity.2 for <tcpm@ietf.org>; Thu, 10 May 2018 16:08:01 -0700 (PDT)
X-Gm-Message-State: ALKqPwf02LgOTkX/inoxsmLZwKxYjhahRPB5fI9dZTpd0HgloRf5x7gK rR5t6EToTjDO0KMhRdVy+pfhCyjz/2LowNEsXpE=
X-Google-Smtp-Source: AB8JxZo1kFcdGleZtXelX9yNXwxz0QI2cLWcmPhHQg6awehdQ6s5Tn76SsLLZSuCXGeUJqorfp0v6JYlFaN9JYwF/VY=
X-Received: by 2002:a24:3555:: with SMTP id k82-v6mr934243ita.49.1525993680601; Thu, 10 May 2018 16:08:00 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a4f:43c6:0:0:0:0:0 with HTTP; Thu, 10 May 2018 16:08:00 -0700 (PDT)
In-Reply-To: <CADVnQyk04js7VaFdUKFYg6h8yE2ZzoDMG_EPeS_hKYb_tnesww@mail.gmail.com>
References: <CY4PR21MB063011EB9ABCD23BABC2EDC0B6990@CY4PR21MB0630.namprd21.prod.outlook.com> <CY4PR21MB0630AF5B03B8C260AD72E366B6990@CY4PR21MB0630.namprd21.prod.outlook.com> <CADVnQyk04js7VaFdUKFYg6h8yE2ZzoDMG_EPeS_hKYb_tnesww@mail.gmail.com>
From: Yoshifumi Nishida <nishida@sfc.wide.ad.jp>
Date: Thu, 10 May 2018 16:08:00 -0700
X-Gmail-Original-Message-ID: <CAO249ydhwbnGvJNdHwJGBO6h_++mHKY6Xe+n+vFX4vg9rsvuhQ@mail.gmail.com>
Message-ID: <CAO249ydhwbnGvJNdHwJGBO6h_++mHKY6Xe+n+vFX4vg9rsvuhQ@mail.gmail.com>
To: Neal Cardwell <ncardwell@google.com>
Cc: Praveen Balasubramanian <pravb@microsoft.com>, Priyaranjan Jha <priyarjha@google.com>, "tcpm@ietf.org" <tcpm@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/aBb4iZlegdsMpyGbEyQHlw_XKGY>
Subject: Re: [tcpm] TLP questions
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 10 May 2018 23:08:07 -0000
Hi Neal, On Wed, May 9, 2018 at 1:27 PM, Neal Cardwell <ncardwell@google.com> wrote: > On Wed, May 9, 2018 at 2:29 PM Praveen Balasubramanian <pravb@microsoft.com> > wrote: >> >> Including tcpm as this can be a review of the TLP portions of >> draft-ietf-tcpm-rack-03. >> >> >> >> From: Praveen Balasubramanian >> Sent: Tuesday, May 8, 2018 6:13 PM >> To: Yuchung Cheng <ycheng@google.com>; Neal Cardwell >> <ncardwell@google.com> >> Subject: TLP questions >> >> >> >> Hey folks, I have some questions on draft-ietf-tcpm-rack-03 mainly around >> TLP. > > > Praveen, thank you for all of these excellent questions, and thanks for > bringing this to the tcpm list for discussion! > >> >> >> >> “Open state: the sender's loss recovery state machine is in its normal, >> default state: there are no SACKed sequence ranges in the SACK scoreboard, >> and neither fast recovery, timeout-based recovery, nor ECN-based cwnd >> reduction are underway. “ >> >> Open state is defined and then never used. > > Yes, thanks for noticing. The phrase "Open state" is used in earlier > revisions of the TLP and RACK drafts as a precondition for sending a Tail > Loss Probe. We removed its use in draft-ietf-tcpm-rack-03 but forgot to > remove the definition in draft-ietf-tcpm-rack-03. We noticed shortly after > shipping the -03 text and already removed that definition in our internal > draft for draft-ietf-tcpm-rack-04. > >> >> I assume you require that TLP be scheduled only for connections in open >> state? > > That was the original (and long-time) implementation. The "Open" state is > the Linux TCP stack term for the default congestion control state where > there's nothing in the SACK scoreboard, no ECN cwnd reduction is underway, > and we're not doing Fast Recovery or RTO recovery. Originally the Linux TCP > code would only schedule a TLP in the "Open" state. But recently the TCP > team at Google has analyzed scenarios where this limitation seemed too > strict. In particular, we were analyzing a case in the public Linux netdev > mailling list ("Re: Linux ECN Handling") where the connection was in the > middle of a DCTCP ECN-triggered cwnd reduction, and then suffered a ~300ms > timeout that would have been vastly shorter if there had been a TLP for > recovery. So we changed the Linux TCP code to allow TLP in ECN-triggered > cwnd reductions (b4f70c3d4ec32 "tcp: allow TLP in ECN CWR"). > > This means that the Linux TCP code is allowing TLP in the ECN cwnd reduction > state as well as the "Open" state. We have tried to update the draft in > draft-ietf-tcpm-rack-03 and, based on your feedback, will try to make this > more crisp/clear in draft-ietf-tcpm-rack-04. Please see below for a proposed > update to this part of the text. >> >> I wonder why there is the requirement on “no SACKed sequence ranges in the >> SACK scoreboard”. > > This condition dated from the original TLP design, where originally the > motivation was to keep the behavior cautious and simple. Now that we have > RACK, as a practical matter, we still don't really want a TLP in such cases. > Rather, if there is some SACKed sequence range but we are not yet in loss > recovery, then typically RACK will install a timer based on the reordering > window, and when that timer fires it will mark some packets lost and enter > fast recovery. > > For this reason, the Linux TCP implementation of TLP+RACK still has a "no > SACKed sequence ranges in the SACK scoreboard" as a precondition for > scheduling a TLP, and IMHO this still makes sense, so I would proposed that > in draft-ietf-tcpm-rack-04 we document that by adding a clause to the list > of preconditions for scheduling a TLP ("5.4.1. Phase 1: Scheduling a loss > probe... A sender should schedule a PTO only if all of the following > conditions are met"): > > The connection has no SACKed sequences in the SACK scoreboard > > Please see below for where this proposed addition fits into the surrounding > text. >> >> PTO += 2ms >> >> If we already are being conservative by waiting 2*SRTT, then how is adding >> 2 msec going to help? Was this added due to a real world issue? > > Yes, this is due to real-world issues. In datacenters, with Linux TCP the > SRTT could be much lower than 100usec, so even 2*SRTT is a very tight timer, > and could result in lots of spurious TLPs, and is not even supported by > current retransmission timer mechanisms in common OSes. So the 2ms is to > allow for real-world jitter in the network and end hosts. We tried to > describe some of these issues in the -03 draft: > > Similarly, current end-system processing latencies and timer > granularities can easily delay ACKs, so senders SHOULD add at least > 2ms to a computed PTO value (and MAY add more if the sending host OS > timer granularity is more coarse than 1ms). > > > I guess we can add network jitter to this as well, so I'd propose something > like the following for -04: > > Similarly, network delay variations and end-system processing > latencies and timer granularities can easily delay ACKs beyond 2*SRTT, > so senders SHOULD add at least 2ms to a computed PTO value > (and MAY add more if the sending host OS timer granularity is more > coarse than 1ms). > > Instead of 2ms we could add something like RTTVAR from RFC 6298, but that > would add complexity and delay, especially if the receiver is manifesting > ~200ms delayed ACKs. And so we have not yet seen evidence that that would be > a good trade-off. > >> “If a previously unsent segment exists AND >> >> the receive window allows new data to be sent: >> >> Transmit that new segment >> >> FlightSize += SMSS >> >> Else: >> >> Retransmit the last segment” >> >> This needs to be crisper about what is meant by “previously unsent” and >> “last segment”. For example the sender could have sent a large amount of >> data and then taken a full RTO. In this case if PTO fires, do “previously >> unsent” and “last segment” refer to MSS size segments straddling just before >> and after SND.NXT? OR do they straddle around the largest sent sequence >> number ever in the connection lifetime? > > > Very good questions. I like your suggestion to be crisper in this section. > The Linux TCP stack does not "rewind" SND.NXT upon RTO, so in the Linux TCP > stack the SND.NXT point and "largest sent sequence number ever in the > connection lifetime" are basically the same point. That is the framework in > which we were thinking for those lines. > > In my mind... > > By "a previously unsent segment" we mean basically "the next segment (of MSS > or fewer bytes) that the sender would normally send if it had available cwnd > at this time." That is something that presumably every production-quality > TCP stack has very quick access to. > > By "the last segment" we mean "the highest-sequence segment (of MSS or fewer > bytes) that has already been transmitted and not ACKed or SACKed." I imagine > this should also be generally very quick to access (at least it can be > quickly accessed in two generations of the Linux TCP write queue). Let us > know if not, and we can discuss. > > Does that help clarify those parts? If so, we can update the text to > incorporate something like that (suggestions?). I am thinking that we can just say "send one (SMSS or smaller) segment that contains up to highest sequence number it has sent." I am a bit wondering if not ACKed or SACKed requirement should be mandatory, although it will be a bit redundant. >> >> “On each incoming ACK, the sender should cancel any existing loss probe >> timer.” >> >> Even on duplicate ACKs? > > Excellent question. This line is out of date, and does not reflect recent > fixes our team made in the Linux TCP stack (df92c8394e6e "tcp: fix xmit > timer to only be reset if data ACKed/SACKed"). In fact the TLP/RTO timers > should only be rearmed (cancelled and reinstalled at a later time) if > certain kinds of real "forward progress" are made. On our internal draft for > -04 I have proposed we change this to: > > Phase 3: ACK processing > > On each incoming ACK, the sender should check the conditions in Step 1 of > Phase 1 to see if it should schedule (or reschedule) the loss probe timer. > > And then, putting this all together, I would propose something like the > following text for Step 1 of Phase 1: > > ---- > > Phase 1: Scheduling a loss probe > > > Step 1: Check conditions for scheduling a PTO. > > A sender should check to see if it should schedule a PTO in the following > situations: > > After transmitting new data > > Upon receiving an ACK that cumulatively acknowledges data > > Upon receiving a SACK that selectively acknowledges data that was last sent > before the segment with SEG.SEQ=SND.UNA was last (re)transmitted > > > A sender should schedule a PTO only if all of the following conditions are > met: > > The connection supports SACK [RFC2018] Do we need to check this? I mean RACK already presumes SACK nodes. I thought TLP does as well. Or, do we want to use TLP with non-SACK nodes? > The connection has no SACKed sequences in the SACK scoreboard > > The connection is not in loss recovery > > The most recently transmitted data was not itself a TLP probe (i.e. a sender > MUST NOT send consecutive or back-to-back TLP probes). I think the third condition might be a bit ambiguous and may require another variable. If we check TLPRtxOut and TLPHighRxt, might it be sufficient? -- Yoshi
- Re: [tcpm] TLP questions Praveen Balasubramanian
- Re: [tcpm] TLP questions Neal Cardwell
- Re: [tcpm] TLP questions Yoshifumi Nishida
- Re: [tcpm] TLP questions Praveen Balasubramanian
- Re: [tcpm] TLP questions Neal Cardwell
- Re: [tcpm] TLP questions Neal Cardwell
- Re: [tcpm] TLP questions Praveen Balasubramanian
- Re: [tcpm] TLP questions Yuchung Cheng
- Re: [tcpm] TLP questions Yoshifumi Nishida