Re: [tcpm] Review of draft-wang-tcpm-low-latency-opt-00

Wei Wang <weiwan@google.com> Fri, 04 August 2017 16:55 UTC

Return-Path: <weiwan@google.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0832A1321EA for <tcpm@ietfa.amsl.com>; Fri, 4 Aug 2017 09:55:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.7
X-Spam-Level:
X-Spam-Status: No, score=-2.7 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EpUEPCZ0Yupk for <tcpm@ietfa.amsl.com>; Fri, 4 Aug 2017 09:55:22 -0700 (PDT)
Received: from mail-vk0-x22e.google.com (mail-vk0-x22e.google.com [IPv6:2607:f8b0:400c:c05::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BF981131F23 for <tcpm@ietf.org>; Fri, 4 Aug 2017 09:55:21 -0700 (PDT)
Received: by mail-vk0-x22e.google.com with SMTP id u133so8023008vke.3 for <tcpm@ietf.org>; Fri, 04 Aug 2017 09:55:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=SWiJu6vu5N8mMHFarDe9NF9gOc1XnczGvCNMFywLc7U=; b=Hw8ju3V6Gil3N1lsNULVMU0ZoqdS2DppXFqa7lSv5uCtn4e6ARG5iO9cbXFTCE14W1 Dz4dAJJGmLLXO98vyujaglf1ATy/FiCwyWtvhxs9YYjokolWEmsUSjZRrLzZeGyW3l1s jn/zUyzqOaZozx5XDqOoNZyY5FF2IQfpWM33QHvI6cU6K8Fpu41ysBmcRns2mb1bHRgF T/lmBhsCjVi2R9tv6kzkxHfMnXbTeUjbwX8TYjWZRyvU2jkbWpTLb3wcStgbewRAglXy 9b3eHUoJicmbKPmk1mR8FxAK2LzO/ME0uvhqUqqY480ajAoMT+C1cnueJF9KMP9UWJ+A dxTg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=SWiJu6vu5N8mMHFarDe9NF9gOc1XnczGvCNMFywLc7U=; b=AT65j+RtyLOEBBaQsD+eME3yFgtbcgfBPCDIor5dFZptRck9ATi0jUo5fEyU8iQgkK UVad3tMCeTU81zkfOb+U8kOdc4fmuDImLTM/YXXlmVA0MNuj74BdFdSnjtD1CadQvc2V ULGeoIJseXmDfq5i+qFxqpXai66UV7468TetV+LAom9CWy0Z+BtvKJxbVG8y0ufdmHTc YIOe8pKmgt2soo7yDDGkSByG2rl4APwXBfrYR/uQZHJdT0vAxhP5rZd/9vr1a+dcZVMp n9tPaCPBP5z2KzvDZzS/tWLbr4ESvalD4g5562HoIa56CsHPUKPuszDVpP6s2mAtYWVf hAVA==
X-Gm-Message-State: AHYfb5i5vIdIb1Wyyao5/O/fjZfTuEn5ncSnyIZiEei+pX2AqxkHTt+b WWuEZoifGfkveyrZC7ukrBhhnvB12l/7
X-Received: by 10.31.129.78 with SMTP id c75mr1656516vkd.16.1501865720656; Fri, 04 Aug 2017 09:55:20 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.31.70.7 with HTTP; Fri, 4 Aug 2017 09:55:20 -0700 (PDT)
In-Reply-To: <d2570431-8c01-d7fc-5aa3-581d69836923@bobbriscoe.net>
References: <8abadc4d-4165-a5bc-23bb-e4f9258c695b@bobbriscoe.net> <CAK6E8=c4D0QTzMobMQXLZMU5JiBRXXPdYJ0KTqvg08t+G0VDxQ@mail.gmail.com> <CANn89iL+TC6sh=e+keb4Psxz+E6oHV3Mcvsay6UYL2qEKUT6bw@mail.gmail.com> <2131135f-b123-70f0-d464-dac6640d6cd2@bobbriscoe.net> <d2570431-8c01-d7fc-5aa3-581d69836923@bobbriscoe.net>
From: Wei Wang <weiwan@google.com>
Date: Fri, 04 Aug 2017 09:55:20 -0700
Message-ID: <CAEA6p_CN+w6XH-A=zNEc3SL9gnRF-oH5jKD4Kvkxb3=p_PTBUg@mail.gmail.com>
To: Bob Briscoe <ietf@bobbriscoe.net>
Cc: Eric Dumazet <edumazet@google.com>, Yuchung Cheng <ycheng@google.com>, Neal Cardwell <ncardwell@google.com>, tcpm IETF list <tcpm@ietf.org>
Content-Type: multipart/alternative; boundary="001a11411bcec82dc20555f05c98"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/-a8QLrwjzva1_CPN3FkvyAztkn4>
Subject: Re: [tcpm] Review of draft-wang-tcpm-low-latency-opt-00
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 04 Aug 2017 16:55:26 -0000

Hi Bob,

Thanks a lot for your review and detailed feedback on the draft.
Please see my comments inline below:

On Wed, Aug 2, 2017 at 8:54 AM, Bob Briscoe <ietf@bobbriscoe.net> wrote:

> Wei, Yuchung, Neal and Eric, as authors of draft-wang-tcpm-low-latency-
> opt-00,
>
> I promised a review. It questions the technical logic behind the draft, so
> I haven't bothered to give a detailed review of the wording of the draft,
> because that might be irrelevant if you agree with my arguments.
>
> *1/ MAD by configuration?*
>
>    o  If the user does not specify a MAD value, then the implementation
>       SHOULD NOT specify a MAD value in the Low Latency option.
>
> That sentence triggered my "anti-human-intervention" reflex. My train of
> thought went as follows:
>
> * Let's consider what advice we would give on what MAD value ought to be
> configured.
> * You say that MAD can be smaller in DCs. So I assume your advice would be
> that MAD should depend on RTT {Note 1} and clock granularity {Note 2}.
> * So why configure one value of MAD for all RTTs? That only makes sense in
> DC environments where the range of RTTs is small.
> * However, for the range of RTTs on the public Internet, why not calculate
> MAD from RTT and granularity, then standardize the calculation so that both
> ends arrive at the same result when starting from the same RTT and
> granularity parameters? (The sender and receiver might measure different
> smoothed (SRTT) values, but they will converge as the flow progresses.)
>
> Then the receiver only needs to communicate its clock granularity to the
> sender, and the fact that it is driving MAD off its SRTT. Then the sender
> can use a formula for RTO derived from the value of MAD that it calculates
> the receiver will be using. Then its RTO will be completely tailored to the
> RTT of the flow.
>

First of all, we recommend that operating system should have a per-route
MAD configuration API and a per-connection MAD configuration API. So
different connections could have different MAD values configured. It is not
one value for all.

And in my opinion, what MAD value should be set to is not only depending on
RTT and clock granularity. It also depends on how the application wants the
delayed ack behavior to be. Some application might only send data say every
1ms, so it will delay its ack up to 2ms so that it can always piggy back
the ack to the data.
That is why a per-connection MAD configuration makes sense for the
application to fine tune MAD according to its own demand.

And when user tries to set a new MAD value, we do boundary check to make
sure it is less than the current default MAD value. This is a safety check
to make sure user does not configure something that is worse than current
default value.

About your question in {Note 2} that why receiver does not communicate its
clock granularity to the sender, I don't really see a reason why receiver
side clock granularity is needed. Because the MAD value sent by receiver is
already a value that is rounded to the clock granularity. Say if a user
wants to set MAD to 1ms, and the clock granularity is 10ms, receiver will
send MAD value as 10ms. In the draft, we specify that:

      If specified, then the MAD value in the Low Latency option MUST be
      set, as close as possible, to the implementation's actual delayed
      ACK timeout for the connection.  Note that the actual maximum
      delayed ACK timeout of the connection may be larger than the
      actual user specified value because of implementation constraints
             (e.g. timer granularity limitations).



> Note: There are two different uses for the min RTO that need to be
> separated:
>     a) Before an initial RTT value has been measured, to determine the RTO
> during the 3WHS.
>     b) Once either end has measured the RTT for a connection.
> (a) needs to cope with the whole range of possible RTTs, whereas (b) is
> the subject of this email, because it can be tailored for the measured RTT.
>

Again, we don't think MAD value is only a function of RTT and clock
granularity.



>
> *2/ The problem, and its prevalence*
>
> With gradual removal of bufferbloat and more prevalent usage of CDNs,
> typical base RTTs on the public Internet now make the value of minRTO and
> of MAD look silly.
>
> As can be seen above, the problem is indeed that each end only has partial
> knowledge of the config of the other end.
> However, the problem is not just that MAD needs to be communicated to the
> other end so it can be hard-coded to a lower value.
> The problem is that MAD is hard-coded in the first place.
>
> The draft needs to say how prevalent the problem is (on the public
> Internet) where the sender has to wait for the receiver's delayed ACK timer
> at the end of a flow or between the end of a volley of packets and the
> start of the next.
>

Noted. We will add more contexts on how delayed ack works and why long
delayed ack time is hurting performance. We are also planning on adding
some history about why delayed ack was configured as a constant in the
first place and why the current constant value was chosen.


>
> The draft also needs to say what tradeoff is considered acceptable between
> a residual level of spurious retransmissions and lower timeout delay.
> Eliminating all spurious retransmissions is not the goal.
>

Noted.


>
> The draft also needs to say that introducing a new TCP Option is itself a
> problem (on the public Internet), because of middleboxes particularly
> proxies. Therefore a solution that does not need a new TCP Option would be
> preferable....
>
>
There is already a section in the draft that states the middle box issue:
        5. Middlebox Considerations
Is that portion a good enough explanation on this?


> Perhaps the solution for communicating timestamp resolution in
> draft-scheffenegger-tcpm-timestamp-negotiation-05 (which cites
> draft-trammell-tcpm-timestamp-interval-01) could be modified to also
> communicate:
> * TCP's clock granularity (closely related to TCP timestamp resolution),
> *  and the fact that the host is calculating MAD as a function of RTT and
> granularity.
> Then the existing timestamp option could be repurposed, which should
> drastically reduce deployment problems.
>

I am not sure if this is doable but will look into it.



>
> *3/ Only DC?*
>
> All the related work references are solely in the context of a DC. Pls
> include refs about this problem in a public Internet context. You will find
> there is a pretty good search engine at www.google.com.
>
> The only non-DC ref I can find about minRTO is [Psaras07], which is mainly
> about a proposal to apply minRTO if the sender expects the next ACK to be
> delayed. Nonetheless, the simulation experiment in Section 5.1 provides
> good evidence for how RTO latency is dependent on uncertainty about the MAD
> that the other end is using.
>
> [Psaras07] Psaras, I. & Tsaoussidis, V., "The TCP Minimum RTO Revisited,"
> In: Proc. 6th Int'l IFIP-TC6 Conference on Ad Hoc and Sensor Networks,
> Wireless Networks, Next Generation Internet NETWORKING'07 pp.981-991
> Springer-Verlag (2007)
> https://www.researchgate.net/publication/225442912_The_TCP_
> Minimum_RTO_Revisited
>

Noted. Thanks a lot for the pointers. Will look into them and add to the
draft.


>
>
> *4/ Status*
>
> Normally, I wouldn't want to hold up a draft that has been proven over
> years of practice, such as the technique in low-latency-opt, which has been
> proven in Google's DCs over the last few years. Whereas, my ideas are just
> that: ideas, not proven. However, the technique in low-latency-opt has only
> been proven in DC environments where the range of RTTs is limited. So, now
> that you are proposing to transplant it onto the public Internet, it also
> only has the status of an unproven idea.
>
> To be clear, as it stands, I do not think low-latency-opt is applicable to
> the public Internet.
>


Hmm... I think overall, this approach should not do any harm to the
network. It provides an additional feature to let the user configure the
MAD if the user cares about it. If not, they can leave it as the default
behavior as it is right now.
To your concerns about the RTT variation in the internet, first, as I
explained, this MAD value will be set per connection or per route.
Secondly, I would think it is doable to do some bound check or error
correction on the MAD value set by the user if we find that it is way below
RTT and does not make sense. But again, we don't think MAD value is only a
function of RTT. User should be able to configure it to a value suitable
for his/her need.
We want to make it as a standard so that all operating systems could
implement this in the same way so that they could understand each other.
One use case is that in a cloud environment where different operating
systems are running in the same DC, they should be able to interpret this
option with no issue.



>
> *5/ Nits*
> These nits depart from my promise not comment on details that could become
> irrelevant if you agree with my idea. Hey, whatever,...
>
> S.3.5:
>
> 	RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay)
>
> My immediate reaction to this was that G should not appear twice. However,
> perhaps you meant them to be G_s and G_r (sender and receiver)
> respectively. {Note 2}
>
>
As explained earlier, clock granularity of the receiver is already being
considered in the MAD value itself. In the above formula, both G are the
clock granularity on the sender side.



> S.3.5 & S.5. It seems unnecessary to prohibit values of MAD greater than
> the default (given some companies are already investing in commercial
> public space flight programmes, so TCP could need to routinely support RTTs
> that are longer than typical not just shorter).
>
>

Noted. Will take consideration of this.


>
>
Cheers
>
>
>
> Bob
>
>
> *{Note 1}*: On average, if not app-limited, the time between ACKs will be
> d_r*R_r/W_s where:
>    R is SRTT
>    d is the delayed ACK factor, e.g. d=2 for ACKing every other packet
>    W is the window in units of segments
>    subscripts X_r or X_s denote receiver or sender for the half-connection.
>
> So as long as the receiver can estimate the varying value of W at the
> sender, the receiver's MAD could be
>     MAD_r = max(k*d_r*R_r / W_s, G_r),
> The factor k (lower case) allows for some bunching of packets e.g. due to
> link layer aggregation or the residual effects of slow-start, which leaves
> some bunching even if SS uses pacing. Let's say k=2, but it would need to
> be checked empirically.
>
> For example, take R=100us, d=2, W=8 and G = 1us.
> Given d*R/W = 25us, MAD could be perhaps 50us (i.e. k=2). k might need to
> be greater, but there would certainly be no need for MAD to be 5ms, which
> is perhaps 100 times greater than necessary.
>
> *{Note 2}*: Why is there no field in the Low Latency option to
> communicate receiver clock granularity to the sender?
>
>
> Bob
>
> --
> ________________________________________________________________
> Bob Briscoe                               http://bobbriscoe.net/
>
>