Re: [tcpm] Review of draft-wang-tcpm-low-latency-opt-00

Bob Briscoe <ietf@bobbriscoe.net> Sun, 06 August 2017 17:39 UTC

To: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>, Yuchung Cheng <ycheng@google.com>, Wei Wang <weiwan@google.com>, tcpm IETF list <tcpm@ietf.org>
References: <8abadc4d-4165-a5bc-23bb-e4f9258c695b@bobbriscoe.net> <CAK6E8=c4D0QTzMobMQXLZMU5JiBRXXPdYJ0KTqvg08t+G0VDxQ@mail.gmail.com> <CANn89iL+TC6sh=e+keb4Psxz+E6oHV3Mcvsay6UYL2qEKUT6bw@mail.gmail.com> <2131135f-b123-70f0-d464-dac6640d6cd2@bobbriscoe.net> <d2570431-8c01-d7fc-5aa3-581d69836923@bobbriscoe.net> <CADVnQykz_pUqQLRmzpUd+E0R0iLWeZ3fZN=_K9Roee0zuz1x6A@mail.gmail.com>
From: Bob Briscoe <ietf@bobbriscoe.net>
Message-ID: <edfd5337-307c-2395-0bb1-83267d52088c@bobbriscoe.net>
Date: Sun, 06 Aug 2017 18:39:16 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <CADVnQykz_pUqQLRmzpUd+E0R0iLWeZ3fZN=_K9Roee0zuz1x6A@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------4DAD03512A616B65595DB32D"
Content-Language: en-GB
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/ruDlMq2QdejaAvFMt8CtOMEgqvM>
Subject: Re: [tcpm] Review of draft-wang-tcpm-low-latency-opt-00
Precedence: list

Neal,

On 04/08/17 23:20, Neal Cardwell wrote:
> Thanks, Bob, for your detailed and thoughtful review! This is very 
> insightful and useful.
>
> Sorry I'm coming to this discussion a little late. I wanted to add a 
> few points, beyond what Wei has already noted.
>
> On Wed, Aug 2, 2017 at 11:54 AM, Bob Briscoe <ietf@bobbriscoe.net 
> <mailto:ietf@bobbriscoe.net>> wrote:
>
>     Wei, Yuchung, Neal and Eric, as authors of
>     draft-wang-tcpm-low-latency-opt-00,
>
>     I promised a review. It questions the technical logic behind the
>     draft, so I haven't bothered to give a detailed review of the
>     wording of the draft, because that might be irrelevant if you
>     agree with my arguments.
>
>     *1/ MAD by configuration?**
>     *
>
>         o  If the user does not specify a MAD value, then the implementation
>            SHOULD NOT specify a MAD value in the Low Latency option.
>
>     That sentence triggered my "anti-human-intervention" reflex. My
>     train of thought went as follows:
>
>
> Bob's remark about his "anti-human-intervention" reflex being
> triggered got me thinking.
>
> I, too, would like to minimize the amount of human (application)
> intervention this proposal involves (to avoid errors, maintenance,
> etc).
>
> It occurs to me that actually at Google our experience has shown that
> indeed apps have repeatedly made mistakes with this value, and we have
> found it convenient to progressively narrow their freedom in tuning
> this knob. To the point where actually in our deployment there is very
> little freedom left. Because in reality the OS and TCP stack
> developers know the timer granularity considerations, and the apps
> don't (and tend to use values 5 years out of date). So we've found it
> useful to have the OS tightly clamp the app's request for a MAD value.
>
> So in the interests of simplicity and avoiding human intervention,
> what if we do not have the MAD value as part of the API, but rather
> just allow the API to express a single "please use MAD" bit? And then
> the transport implementation uses the smallest value that it can
> support on this end host.
>
> Can we go further, and make MAD an automatic feature of the TCP
> implementation (so the transport implementation hard-wires MAD to "on"
> or "off")? My sense is that we don't want to go that far, and that
> instead we want to still allow apps to decide whether to use the
> "please use MAD" bit. Why? There may be middlebox or remote host
> compatibility issues with MAD. So we want apps (like browsers) to be
> able to do A/B experiments to validate that sending the MAD option on
> SYNs does not cause problems. We don't want to turn on MAD in Linux
> and then find compatibility issues, and have to wait for a client OS
> upgrade to everyone's cell phone to turn off MAD; instead we want to
> only have to wait for an app update.
[BB]: If there are problems, they will be per path, not per app. So 
there could be a cache to record per-path black-holing of packets 
carrying the option (no need to record stripping the option, which would 
be benign). Then no API at all would be needed.

As a fail-safe, you would want a system-wide sysctl to turn on MAD. I 
guess switching that would require an OS upgrade.

Whatever, as you say below, these are not really interop standardization 
issues (but it's still worth airing the possibilities).

>
> So... suppose an app decides it is latency-sensitive and wants to
> reduce ACK delays and negotiate a MAD value. And furthermore, the app
> is either (a) doing A/B experiments, or (b) has already convinced
> itself that MAD will work on this path.
>
> Then the app could enable MAD with a simple API like:
>    int mad = 1; // enable
>    err = setsockopt(fd, SOL_TCP, TCP_MAD, &mad, sizeof(mad));
>
> For better or for worse, that makes the TCP_MAD option much like the
> TCP_NODELAY option. Both in the sense that latency sensitive apps
> should remember to set this bit if they want low-latency behavior. And
> in the sense that the APIs would look very similar. And TCP_NODELAY
> and TCP_MAD would be sort of complimentary: TCP_NODELAY is the app
> saying "I want low latency for my sends" and TCP_MAD is the app
> saying "I want low latency for my ACKs". My guess is that most
> low-latency apps will want both.
>
> For the MAD API, I think this might be the "as simple as possible, but
> no simpler point".
[BB]: Is there an app that wants high delay loss recovery?
There is no tradeoff here, so pls keep it simple and just enable low 
latency for all connections.

>
> That said, that's an API issue. And I think for TCPM we should focus
> more on the wire protocol issues.
>
>     * Let's consider what advice we would give on what MAD value ought
>     to be configured.
>
>
> I would suggest that the advice be that when an app requests TCP_MAD,
> then transport implementors would have the transport implementation
> use the lowest feasible value based on the end host hardware/OS/app
> capabilities and workloads.
> Our sense from our deployment at Google
> is that for many current technologies and workloads this is probably
> currently in the range of 5ms - 10ms.
>
> But I don't think we should get bogged down in a discussion of what this
> configured value ought to be.
[BB]: Sry, perhaps I wasn't clear. I wrote that sentence to ask:
* not "what specific MAD value ought to be configured"
* but rather "what a good MAD value ought to depend on". I pick up on 
this question later...

> I think we should focus on the simplest
> protocol mechanism that can convey to the remote host the minimum
> info needed for the remote transport endpoint to achieve excellent
> performance.
>
> Here I think of the MSS option as a good analogy (and that's why we
> suggested the name "MAD").
>
> For MSS, the point is not to spend time discussing what MSS should be
> used, or to come up with complicated formulas to derive MSS. The point
> is to have a simple but general mechanism so that, no matter what the
> MSS value is (or the underlying hardware constraints are), there is a
> simple option that can convey a hint to the remote host. Then the
> remote host can use that hint to tune its sending behavior to achieve
> good performance.
>
> Now substitute "MAD" in the place of "MSS" in the preceding paragraph. :-)
>
>     * You say that MAD can be smaller in DCs. So I assume your advice
>     would be that MAD should depend on RTT {Note 1} and clock
>     granularity {Note 2}.
>
>
> Personally I do not think that MAD should depend on RTT. And I don't 
> think the draft says that it should (though let me know if there is 
> some spot I didn't notice).
>
> I'd vote for keeping MAD as simple as possible, which means keeping 
> RTT out of it. :-)
>
>     * So why configure one value of MAD for all RTTs? That only makes
>     sense in DC environments where the range of RTTs is small.
>
>
> I'd recommend one value of MAD for all RTTs for the sake of 
> simplicity. If we keep MAD as simple as possible, then it stays just 
> about the practical delay limitations of the end host (OS timers, CPU 
> power, CPU load, app behavior, end host queuing delays, etc). That is 
> what we have found makes sense in our deployment. And note that our 
> deployment of a MAD-like option covers RTTs that span quite a range, 
> from <1 ms up to hundreds of ms.
>
> Most OSes I know already have a constant that defines the maximum 
> interval over which they can delay their ACKs. We are basically just 
> suggesting a simple wire format for transport endpoints to advertise 
> this existing value as a hint.
>
>     * However, for the range of RTTs on the public Internet, why not
>     calculate MAD from RTT and granularity, then standardize the
>     calculation so that both ends arrive at the same result when
>     starting from the same RTT and granularity parameters? (The sender
>     and receiver might measure different smoothed (SRTT) values, but
>     they will converge as the flow progresses.)
>
>     Then the receiver only needs to communicate its clock granularity
>     to the sender, and the fact that it is driving MAD off its SRTT.
>     Then the sender can use a formula for RTO derived from the value
>     of MAD that it calculates the receiver will be using. Then its RTO
>     will be completely tailored to the RTT of the flow.
>
>
> A couple questions here:
>
> - Why  should we add the complexity of making MAD dependent on RTT? 
> I'm not clear on what the argument would be for the benefit of 
> introducing this complexity.
>
> - Even if the receiver only communicates its clock granularity to the 
> sender, and the fact that it is driving MAD off its SRTT, then there's 
> a the question of *how* it is deriving MAD. Presumably this could 
> change, as we come up with better ideas. So then we would want a 
> version number field to indicate which calculation is being used. It 
> seems much simpler to me to allow the end point to just communicate a 
> numerical delay value, rather than negotiate a version number of a 
> formula that can take a clock granularity and RTT as input and produce 
> a delay as output.
[BB]: Good point.
>
> - Introducing RTT as a dependence also introduces the question of what 
> to do when there is no RTT estimate (because all packets so far have 
> been retransmitted, with no timestamps). And as we discussed in Prague 
> and you mention here, the two sides often have slightly different RTT 
> estimates. There are probably other wrinkles as well.
[BB]: OK. I understood that the pretext for this draft was that the max 
ACK delay is too long for the low RTTs that are often in use these days. 
So I hadn't appreciated that you would advise that MAD would not depend 
on RTT.

Fair enough. I'll go along with this advice for now (but see later). 
However, let's just check that your proposal makes sense in other respects.

Q1. Is there not a risk that a value of MAD solely dependent on the 
receiver's OS parameters will be lower than the typical inter-packet 
arrival time for some flows? E.g.
     If data packets arrive every 7 ms {Note 3} then, even with a 
del_ack factor of 2, a receiver with MAD = 5 ms will ACK every packet. 
In fact, I think it will immediately ACK the first packet, then delay 
the ACK of every subsequent packet by 5ms. {Note 4}

I guess you are saying that would be OK from the point of view of the 
receiver's workload (otherwise it would not have set MAD=5ms). However, 
delayed ACKs are also intended to reduce network workload. {Note 5}.

{Note 3}: With 1500B packets that implies 1.7Mb/s, which is more than 3x 
my own ADSL uplink (I live in the developed world, but in a rural part 
of  it, where  such rates are common and the only alternative is 3G, 
which offers an even slower uplink :(

{Note 4}: I don't know what Implementations do, but RFC5681 implies that 
a receiver delays the next ACK whenever it sent the previous ACK, even 
if it delayed the previous one. The words are: "MUST be generated within 
<MAD> of the arrival of the first unacknowledged packet,"

{Note 5}: Not to mention that delaying every ACK makes it hard for the 
sender to use the ACKs to monitor queuing delay. However, this might be 
fixed by separate introduction of way to measure one-way delay using 
timestamps.

>
>     Note: There are two different uses for the min RTO that need to be
>     separated:
>         a) Before an initial RTT value has been measured, to determine
>     the RTO during the 3WHS.
>         b) Once either end has measured the RTT for a connection.
>     (a) needs to cope with the whole range of possible RTTs, whereas
>     (b) is the subject of this email, because it can be tailored for
>     the measured RTT.
>
>     *2/ The problem, and its prevalence**
>     *
>     With gradual removal of bufferbloat and more prevalent usage of
>     CDNs, typical base RTTs on the public Internet now make the value
>     of minRTO and of MAD look silly.
>
>     As can be seen above, the problem is indeed that each end only has
>     partial knowledge of the config of the other end.
>     However, the problem is not just that MAD needs to be communicated
>     to the other end so it can be hard-coded to a lower value.
>     The problem is that MAD is hard-coded in the first place.
>
>     The draft needs to say how prevalent the problem is (on the public
>     Internet) where the sender has to wait for the receiver's delayed
>     ACK timer at the end of a flow or between the end of a volley of
>     packets and the start of the next.
>
>     The draft also needs to say what tradeoff is considered acceptable
>     between a residual level of spurious retransmissions and lower
>     timeout delay. Eliminating all spurious retransmissions is not the
>     goal.
>
>     The draft also needs to say that introducing a new TCP Option is
>     itself a problem (on the public Internet), because of middleboxes
>     particularly proxies. Therefore a solution that does not need a
>     new TCP Option would be preferable....
>
>     Perhaps the solution for communicating timestamp resolution in
>     draft-scheffenegger-tcpm-timestamp-negotiation-05 (which cites
>     draft-trammell-tcpm-timestamp-interval-01) could be modified to
>     also communicate:
>     * TCP's clock granularity (closely related to TCP timestamp
>     resolution),
>     *  and the fact that the host is calculating MAD as a function of
>     RTT and granularity.
>     Then the existing timestamp option could be repurposed, which
>     should drastically reduce deployment problems.
>
>     *3/ Only DC?**
>     *
>     All the related work references are solely in the context of a DC.
>     Pls include refs about this problem in a public Internet context.
>     You will find there is a pretty good search engine at
>     www.google.com <http://www.google.com>.
>
>     The only non-DC ref I can find about minRTO is [Psaras07], which
>     is mainly about a proposal to apply minRTO if the sender expects
>     the next ACK to be delayed. Nonetheless, the simulation experiment
>     in Section 5.1 provides good evidence for how RTO latency is
>     dependent on uncertainty about the MAD that the other end is using.
>
>     [Psaras07] Psaras, I. & Tsaoussidis, V., "The TCP Minimum RTO
>     Revisited," In: Proc. 6th Int'l IFIP-TC6 Conference on Ad Hoc and
>     Sensor Networks, Wireless Networks, Next Generation Internet
>     NETWORKING'07 pp.981-991 Springer-Verlag (2007)
>     https://www.researchgate.net/publication/225442912_The_TCP_Minimum_RTO_Revisited
>     <https://www.researchgate.net/publication/225442912_The_TCP_Minimum_RTO_Revisited>
>
>
> All great points. Thanks!
>
>
>     *4/ Status**
>     *
>     Normally, I wouldn't want to hold up a draft that has been proven
>     over years of practice, such as the technique in low-latency-opt,
>     which has been proven in Google's DCs over the last few years.
>     Whereas, my ideas are just that: ideas, not proven. However, the
>     technique in low-latency-opt has only been proven in DC
>     environments where the range of RTTs is limited. So, now that you
>     are proposing to transplant it onto the public Internet, it also
>     only has the status of an unproven idea.
>
>     To be clear, as it stands, I do not think low-latency-opt is
>     applicable to the public Internet.
>
>
> Can you please elaborate on this? Is this because you think there 
> ought to be a dependence on RTT?
[BB]: I was trying to judge whether this is a straightforward 
standardization of tried and tested technology, or experimental.

The opinion about inapplicability to the Internet was based on the way 
the config requirements were written, which limited the draft to 
environments covered by a configuration management system, which is not 
typical for the public Internet.

I'm happier now that the focus is moving towards auto-tuning. However, 
this makes my first point about unproven territory even more 
applicable..., so Google's previous experience becomes less relevant, 
and makes this more experimental/researchy. For instance, the case I 
pointed out above for my own uplink would double the ACK rate, which 
might lead to knock-on problems - perhaps an increase in server 
processing load, or even processor overload on intermediate network 
equipment. We are also likely to discover interactions with ACK-thinning 
middleboxes.

more...
>
>
>
>     *5/ Nits**
>     *These nits depart from my promise not comment on details that
>     could become irrelevant if you agree with my idea. Hey, whatever,...
>
>     S.3.5:
>
>     	RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay)
>
>     My immediate reaction to this was that G should not appear twice.
>     However, perhaps you meant them to be G_s and G_r (sender and
>     receiver) respectively. {Note 2}
>
>     S.3.5 & S.5. It seems unnecessary to prohibit values of MAD
>     greater than the default (given some companies are already
>     investing in commercial public space flight programmes, so TCP
>     could need to routinely support RTTs that are longer than typical
>     not just shorter).
>
>
>     Cheers
>
>
>
>     Bob
>
>     *
>     **{Note 1}*: On average, if not app-limited, the time between ACKs
>     will be d_r*R_r/W_s where:
>        R is SRTT
>        d is the delayed ACK factor, e.g. d=2 for ACKing every other packet
>        W is the window in units of segments
>        subscripts X_r or X_s denote receiver or sender for the
>     half-connection.
>
>     So as long as the receiver can estimate the varying value of W at
>     the sender, the receiver's MAD could be
>         MAD_r = max(k*d_r*R_r / W_s, G_r),
>     The factor k (lower case) allows for some bunching of packets e.g.
>     due to link layer aggregation or the residual effects of
>     slow-start, which leaves some bunching even if SS uses pacing.
>     Let's say k=2, but it would need to be checked empirically.
>
>     For example, take R=100us, d=2, W=8 and G = 1us.
>     Given d*R/W = 25us, MAD could be perhaps 50us (i.e. k=2). k might
>     need to be greater, but there would certainly be no need for MAD
>     to be 5ms, which is perhaps 100 times greater than necessary.
>
>
> With currently popular OS implementations I'm aware of, 50us for a 
> delayed ACK timer is infeasible. Most have a minimum granularity of 
> 1ms, or 10ms, or even larger, for delayed ACKs. And part of the point 
> of delayed ACKs is to wait for applications to respond, so that data 
> can be combined with the ACK. And 50us does not give the app much time 
> to respond.
[BB]: A modern processor can to do as much in 50us as a processor from 
the 1990s could do in about 10mins.

The min clock interrupt period has not changed much from the typical 
value of 10ms in 1990 [Dovrolis00]. This minimum is meant to maintain 
performance by keeping a healthy ratio between real work and context 
switching. However, the number of ops that can be processed in this 
duration has increased by about 10^7 over the same period.

I am (genuinely) interested to know what is the underlying factor that 
limits ACK delay to no less than 1-10ms? Is it Wirth's Law of software 
bloat (that the same task takes just as long, because increases in 
processing speed are absorbed by increases in code complexity)?

Nonetheless, whatever the clock granularity on a particular OS/machine, 
a stack should still be able to calculate what MAD ought to be. Then the 
final step in the calculation would round MAD up to the interrupt clock 
granularity. At least then code would perform better for OSs that reduce 
their clock granularity.


[Dovrolis00] Dovrolis, C. & Ramanathan, P., "Increasing the Clock 
Interrupt Frequency for Better Support of Real-Time Applications," Uni 
Wisconsin-Madison, Dept. Electrical & Computer Engineering 
http://www.cc.gatech.edu/~dovrolis/Papers/timers.ps (March 2000).


This returns us to the question:
*What ought MAD depend on?**
***
I prefer to start by analyzing what the best function and dependencies 
should be, then try to approximate for simplicity. I don't think we 
should start from the other direction ("let's think up a simple way to 
do it, and see if it works"). The former gives insight. The latter risks 
a random stumble across a territory of countless unforeseen problems.

The thought experiment I was conducting in my 'Note 1' above started 
from the idea that MAD ought to be somehow related to the average 
inter-arrival time in a flow. The (unstated) reasoning went like this:
* the retransmission delay after a pause/stop in the data stream ought 
to be similar to the retransmission delay without a pause/stop.
* so the ack delay after a pause/stop in the data stream ought to be 
similar to the ack delay without a pause/stop.

Put more succinctly:
     R + MAD ~= R + d * t_i            (1)
Then by definition:
     t_i = R/W
Which is how I got to
     MAD ~= d * R / W

where the notation is as earlier, plus
     t_i = avg inter-arrival time

Now, moving on to how to simplify this...

I accept your point that MAD has to be above "the practical delay 
limitations of the end host (OS timers, CPU power, CPU load, app 
behavior, end host queuing delays, etc)". Let's wrap that all into a 
variable we'll call g_r (which is itself lower bounded by clock 
granularity G_r).

Eqn (1) shows that the approximation for MAD only has to be within the 
order of the RTT.

Also, setting a lower bound for MAD of the order of an RTT would help to 
prevent the case I raised earlier where MAD is less than the average 
inter-arrival time (because it is abnormal to have <1 packet per RTT). 
Even for ultra-low RTTs, this also protects the network, because network 
processing should be sized to cope with ~1 ACK per RTT.

So, in summary, this is my current preferred approximation (but I'm open 
to others):

     MAD ~= max(c*R_r, g_r)

c is a constant factor determined empirically. I'm not sure whether it 
will be less or greater than 1, so let's assume nominally c=1.

To be clear, I'm accepting your argument that it is simplest for the 
receiver to communicate MAD in the TCP option. So (for now) I'm no 
longer proposing that the sender bases MAD on the RTT it measures 
itself. I.e. the receiver calculates MAD based on it's initial estimate 
of the RTT of the connection and other local parameters, then 
communicates it to the sender. It's not important how accurate the RTT 
estimate is. This is just to get a lower bound of roughly the right 
order of magnitude.

You are right that the receiver might not have a good RTT estimate if 
packets within the 3WHS were retransmitted. But, let me assume (for now) 
that we are using TCP timestamps, so a host can get a good RTT estimate 
even with retransmissions (...because I believe it will be easiest to 
deploy this MAD option by repurposing the TCP timestamp, similar to 
draft-scheffenegger-tcpm-timestamp-negotiation-05).

>
> Again, IMHO the MAD needs to incorporate hardware, software, and 
> workload constraints on the receiving end host.

[BB]: As above, delayed ACKs are also about reducing processing load in 
network equipment. If we do not take this into account, we risk networks 
deploying boxes to take this into account for themselves (e.g. ACK 
thinning).


>     *
>     **{Note 2}*: Why is there no field in the Low Latency option to
>     communicate receiver clock granularity to the sender?
>
>
> The idea is that the MAD value is a function of many parameters on the 
> end host. The clock granularity is only one of them. The simplest way 
> to convey on the wire a MAD parameter that is a function of many other 
> parameters is just to convey the MAD value itself.
>
> Bob, thanks again for your detailed and insightful feedback!
[BB]: I think we're getting somewhere.

Cheers


Bob

>
> neal
>
>

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/

[tcpm] New individual (int-area) draft minimizing… Bob Briscoe
[tcpm] Review of draft-wang-tcpm-low-latency-opt-… Bob Briscoe
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Wei Wang
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Neal Cardwell
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Bob Briscoe
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Bob Briscoe
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Jeremy Harris
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Jeremy Harris