Re: [tcpm] Review of draft-wang-tcpm-low-latency-opt-00

Neal Cardwell <ncardwell@google.com> Fri, 04 August 2017 22:21 UTC

MIME-Version: 1.0
In-Reply-To: <d2570431-8c01-d7fc-5aa3-581d69836923@bobbriscoe.net>
References: <8abadc4d-4165-a5bc-23bb-e4f9258c695b@bobbriscoe.net> <CAK6E8=c4D0QTzMobMQXLZMU5JiBRXXPdYJ0KTqvg08t+G0VDxQ@mail.gmail.com> <CANn89iL+TC6sh=e+keb4Psxz+E6oHV3Mcvsay6UYL2qEKUT6bw@mail.gmail.com> <2131135f-b123-70f0-d464-dac6640d6cd2@bobbriscoe.net> <d2570431-8c01-d7fc-5aa3-581d69836923@bobbriscoe.net>
From: Neal Cardwell <ncardwell@google.com>
Date: Fri, 04 Aug 2017 18:20:24 -0400
Message-ID: <CADVnQykz_pUqQLRmzpUd+E0R0iLWeZ3fZN=_K9Roee0zuz1x6A@mail.gmail.com>
To: Bob Briscoe <ietf@bobbriscoe.net>
Cc: Eric Dumazet <edumazet@google.com>, Yuchung Cheng <ycheng@google.com>, Wei Wang <weiwan@google.com>, tcpm IETF list <tcpm@ietf.org>
Content-Type: multipart/alternative; boundary="001a113fcc3019e8c40555f4e903"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/FmwyMH68zabbCRge9fimKvysgW4>
Subject: Re: [tcpm] Review of draft-wang-tcpm-low-latency-opt-00
Precedence: list

Thanks, Bob, for your detailed and thoughtful review! This is very
insightful and useful.

Sorry I'm coming to this discussion a little late. I wanted to add a few
points, beyond what Wei has already noted.

On Wed, Aug 2, 2017 at 11:54 AM, Bob Briscoe <ietf@bobbriscoe.net> wrote:

> Wei, Yuchung, Neal and Eric, as authors of draft-wang-tcpm-low-latency-op
> t-00,
>
> I promised a review. It questions the technical logic behind the draft, so
> I haven't bothered to give a detailed review of the wording of the draft,
> because that might be irrelevant if you agree with my arguments.
>
> *1/ MAD by configuration?*
>
>    o  If the user does not specify a MAD value, then the implementation
>       SHOULD NOT specify a MAD value in the Low Latency option.
>
> That sentence triggered my "anti-human-intervention" reflex. My train of
> thought went as follows:
>

Bob's remark about his "anti-human-intervention" reflex being
triggered got me thinking.

I, too, would like to minimize the amount of human (application)
intervention this proposal involves (to avoid errors, maintenance,
etc).

It occurs to me that actually at Google our experience has shown that
indeed apps have repeatedly made mistakes with this value, and we have
found it convenient to progressively narrow their freedom in tuning
this knob. To the point where actually in our deployment there is very
little freedom left. Because in reality the OS and TCP stack
developers know the timer granularity considerations, and the apps
don't (and tend to use values 5 years out of date). So we've found it
useful to have the OS tightly clamp the app's request for a MAD value.

So in the interests of simplicity and avoiding human intervention,
what if we do not have the MAD value as part of the API, but rather
just allow the API to express a single "please use MAD" bit? And then
the transport implementation uses the smallest value that it can
support on this end host.

Can we go further, and make MAD an automatic feature of the TCP
implementation (so the transport implementation hard-wires MAD to "on"
or "off")? My sense is that we don't want to go that far, and that
instead we want to still allow apps to decide whether to use the
"please use MAD" bit. Why? There may be middlebox or remote host
compatibility issues with MAD. So we want apps (like browsers) to be
able to do A/B experiments to validate that sending the MAD option on
SYNs does not cause problems. We don't want to turn on MAD in Linux
and then find compatibility issues, and have to wait for a client OS
upgrade to everyone's cell phone to turn off MAD; instead we want to
only have to wait for an app update.

So... suppose an app decides it is latency-sensitive and wants to
reduce ACK delays and negotiate a MAD value. And furthermore, the app
is either (a) doing A/B experiments, or (b) has already convinced
itself that MAD will work on this path.

Then the app could enable MAD with a simple API like:
   int mad = 1; // enable
   err = setsockopt(fd, SOL_TCP, TCP_MAD, &mad, sizeof(mad));

For better or for worse, that makes the TCP_MAD option much like the
TCP_NODELAY option. Both in the sense that latency sensitive apps
should remember to set this bit if they want low-latency behavior. And
in the sense that the APIs would look very similar. And TCP_NODELAY
and TCP_MAD would be sort of complimentary: TCP_NODELAY is the app
saying "I want low latency for my sends" and TCP_MAD is the app
saying "I want low latency for my ACKs". My guess is that most
low-latency apps will want both.

For the MAD API, I think this might be the "as simple as possible, but
no simpler point".

That said, that's an API issue. And I think for TCPM we should focus
more on the wire protocol issues.

> * Let's consider what advice we would give on what MAD value ought to be
> configured.
>

I would suggest that the advice be that when an app requests TCP_MAD,
then transport implementors would have the transport implementation
use the lowest feasible value based on the end host hardware/OS/app
capabilities and workloads. Our sense from our deployment at Google
is that for many current technologies and workloads this is probably
currently in the range of 5ms - 10ms.

But I don't think we should get bogged down in a discussion of what this
configured value ought to be. I think we should focus on the simplest
protocol mechanism that can convey to the remote host the minimum
info needed for the remote transport endpoint to achieve excellent
performance.

Here I think of the MSS option as a good analogy (and that's why we
suggested the name "MAD").

For MSS, the point is not to spend time discussing what MSS should be
used, or to come up with complicated formulas to derive MSS. The point
is to have a simple but general mechanism so that, no matter what the
MSS value is (or the underlying hardware constraints are), there is a
simple option that can convey a hint to the remote host. Then the
remote host can use that hint to tune its sending behavior to achieve
good performance.

Now substitute "MAD" in the place of "MSS" in the preceding paragraph. :-)

> * You say that MAD can be smaller in DCs. So I assume your advice would be
> that MAD should depend on RTT {Note 1} and clock granularity {Note 2}.
>

Personally I do not think that MAD should depend on RTT. And I don't think
the draft says that it should (though let me know if there is some spot I
didn't notice).

I'd vote for keeping MAD as simple as possible, which means keeping RTT out
of it. :-)

* So why configure one value of MAD for all RTTs? That only makes sense in
> DC environments where the range of RTTs is small.
>

I'd recommend one value of MAD for all RTTs for the sake of simplicity. If
we keep MAD as simple as possible, then it stays just about the practical
delay limitations of the end host (OS timers, CPU power, CPU load, app
behavior, end host queuing delays, etc). That is what we have found makes
sense in our deployment. And note that our deployment of a MAD-like option
covers RTTs that span quite a range, from <1 ms up to hundreds of ms.

Most OSes I know already have a constant that defines the maximum interval
over which they can delay their ACKs. We are basically just suggesting a
simple wire format for transport endpoints to advertise this existing value
as a hint.

> * However, for the range of RTTs on the public Internet, why not calculate
> MAD from RTT and granularity, then standardize the calculation so that both
> ends arrive at the same result when starting from the same RTT and
> granularity parameters? (The sender and receiver might measure different
> smoothed (SRTT) values, but they will converge as the flow progresses.)
>
> Then the receiver only needs to communicate its clock granularity to the
> sender, and the fact that it is driving MAD off its SRTT. Then the sender
> can use a formula for RTO derived from the value of MAD that it calculates
> the receiver will be using. Then its RTO will be completely tailored to the
> RTT of the flow.
>

A couple questions here:

- Why  should we add the complexity of making MAD dependent on RTT? I'm not
clear on what the argument would be for the benefit of introducing this
complexity.

- Even if the receiver only communicates its clock granularity to the
sender, and the fact that it is driving MAD off its SRTT, then there's a
the question of *how* it is deriving MAD. Presumably this could change, as
we come up with better ideas. So then we would want a version number field
to indicate which calculation is being used. It seems much simpler to me to
allow the end point to just communicate a numerical delay value, rather
than negotiate a version number of a formula that can take a clock
granularity and RTT as input and produce a delay as output.

- Introducing RTT as a dependence also introduces the question of what to
do when there is no RTT estimate (because all packets so far have been
retransmitted, with no timestamps). And as we discussed in Prague and you
mention here, the two sides often have slightly different RTT estimates.
There are probably other wrinkles as well.

>
> Note: There are two different uses for the min RTO that need to be
> separated:
>     a) Before an initial RTT value has been measured, to determine the RTO
> during the 3WHS.
>     b) Once either end has measured the RTT for a connection.
> (a) needs to cope with the whole range of possible RTTs, whereas (b) is
> the subject of this email, because it can be tailored for the measured RTT.
>
> *2/ The problem, and its prevalence*
>
> With gradual removal of bufferbloat and more prevalent usage of CDNs,
> typical base RTTs on the public Internet now make the value of minRTO and
> of MAD look silly.
>
> As can be seen above, the problem is indeed that each end only has partial
> knowledge of the config of the other end.
> However, the problem is not just that MAD needs to be communicated to the
> other end so it can be hard-coded to a lower value.
> The problem is that MAD is hard-coded in the first place.
>
> The draft needs to say how prevalent the problem is (on the public
> Internet) where the sender has to wait for the receiver's delayed ACK timer
> at the end of a flow or between the end of a volley of packets and the
> start of the next.
>
> The draft also needs to say what tradeoff is considered acceptable between
> a residual level of spurious retransmissions and lower timeout delay.
> Eliminating all spurious retransmissions is not the goal.
>
> The draft also needs to say that introducing a new TCP Option is itself a
> problem (on the public Internet), because of middleboxes particularly
> proxies. Therefore a solution that does not need a new TCP Option would be
> preferable....
>
> Perhaps the solution for communicating timestamp resolution in
> draft-scheffenegger-tcpm-timestamp-negotiation-05 (which cites
> draft-trammell-tcpm-timestamp-interval-01) could be modified to also
> communicate:
> * TCP's clock granularity (closely related to TCP timestamp resolution),
> *  and the fact that the host is calculating MAD as a function of RTT and
> granularity.
> Then the existing timestamp option could be repurposed, which should
> drastically reduce deployment problems.
>
> *3/ Only DC?*
>
> All the related work references are solely in the context of a DC. Pls
> include refs about this problem in a public Internet context. You will find
> there is a pretty good search engine at www.google.com.
>
> The only non-DC ref I can find about minRTO is [Psaras07], which is mainly
> about a proposal to apply minRTO if the sender expects the next ACK to be
> delayed. Nonetheless, the simulation experiment in Section 5.1 provides
> good evidence for how RTO latency is dependent on uncertainty about the MAD
> that the other end is using.
>
> [Psaras07] Psaras, I. & Tsaoussidis, V., "The TCP Minimum RTO Revisited,"
> In: Proc. 6th Int'l IFIP-TC6 Conference on Ad Hoc and Sensor Networks,
> Wireless Networks, Next Generation Internet NETWORKING'07 pp.981-991
> Springer-Verlag (2007)
> https://www.researchgate.net/publication/225442912_The_TCP_M
> inimum_RTO_Revisited
>

All great points. Thanks!

>
> *4/ Status*
>
> Normally, I wouldn't want to hold up a draft that has been proven over
> years of practice, such as the technique in low-latency-opt, which has been
> proven in Google's DCs over the last few years. Whereas, my ideas are just
> that: ideas, not proven. However, the technique in low-latency-opt has only
> been proven in DC environments where the range of RTTs is limited. So, now
> that you are proposing to transplant it onto the public Internet, it also
> only has the status of an unproven idea.
>
> To be clear, as it stands, I do not think low-latency-opt is applicable to
> the public Internet.
>

Can you please elaborate on this? Is this because you think there ought to
be a dependence on RTT?

>
>
> *5/ Nits*
> These nits depart from my promise not comment on details that could become
> irrelevant if you agree with my idea. Hey, whatever,...
>
> S.3.5:
>
> 	RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay)
>
> My immediate reaction to this was that G should not appear twice. However,
> perhaps you meant them to be G_s and G_r (sender and receiver)
> respectively. {Note 2}
>
> S.3.5 & S.5. It seems unnecessary to prohibit values of MAD greater than
> the default (given some companies are already investing in commercial
> public space flight programmes, so TCP could need to routinely support RTTs
> that are longer than typical not just shorter).
>
>
> Cheers
>
>
>
> Bob
>
>
> *{Note 1}*: On average, if not app-limited, the time between ACKs will be
> d_r*R_r/W_s where:
>    R is SRTT
>    d is the delayed ACK factor, e.g. d=2 for ACKing every other packet
>    W is the window in units of segments
>    subscripts X_r or X_s denote receiver or sender for the half-connection.
>
> So as long as the receiver can estimate the varying value of W at the
> sender, the receiver's MAD could be
>     MAD_r = max(k*d_r*R_r / W_s, G_r),
> The factor k (lower case) allows for some bunching of packets e.g. due to
> link layer aggregation or the residual effects of slow-start, which leaves
> some bunching even if SS uses pacing. Let's say k=2, but it would need to
> be checked empirically.
>
> For example, take R=100us, d=2, W=8 and G = 1us.
> Given d*R/W = 25us, MAD could be perhaps 50us (i.e. k=2). k might need to
> be greater, but there would certainly be no need for MAD to be 5ms, which
> is perhaps 100 times greater than necessary.
>

With currently popular OS implementations I'm aware of, 50us for a delayed
ACK timer is infeasible. Most have a minimum granularity of 1ms, or 10ms,
or even larger, for delayed ACKs. And part of the point of delayed ACKs is
to wait for applications to respond, so that data can be combined with the
ACK. And 50us does not give the app much time to respond.

Again, IMHO the MAD needs to incorporate hardware, software, and workload
constraints on the receiving end host.

>
> *{Note 2}*: Why is there no field in the Low Latency option to
> communicate receiver clock granularity to the sender?
>
>
The idea is that the MAD value is a function of many parameters on the end
host. The clock granularity is only one of them. The simplest way to convey
on the wire a MAD parameter that is a function of many other parameters is
just to convey the MAD value itself.

Bob, thanks again for your detailed and insightful feedback!

neal

[tcpm] New individual (int-area) draft minimizing… Bob Briscoe
[tcpm] Review of draft-wang-tcpm-low-latency-opt-… Bob Briscoe
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Wei Wang
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Neal Cardwell
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Bob Briscoe
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Bob Briscoe
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Jeremy Harris
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Jeremy Harris