Re: [tcpm] Review of draft-wang-tcpm-low-latency-opt-00

Bob Briscoe <ietf@bobbriscoe.net> Sun, 06 August 2017 11:27 UTC

To: Wei Wang <weiwan@google.com>
Cc: Eric Dumazet <edumazet@google.com>, Yuchung Cheng <ycheng@google.com>, Neal Cardwell <ncardwell@google.com>, tcpm IETF list <tcpm@ietf.org>
References: <8abadc4d-4165-a5bc-23bb-e4f9258c695b@bobbriscoe.net> <CAK6E8=c4D0QTzMobMQXLZMU5JiBRXXPdYJ0KTqvg08t+G0VDxQ@mail.gmail.com> <CANn89iL+TC6sh=e+keb4Psxz+E6oHV3Mcvsay6UYL2qEKUT6bw@mail.gmail.com> <2131135f-b123-70f0-d464-dac6640d6cd2@bobbriscoe.net> <d2570431-8c01-d7fc-5aa3-581d69836923@bobbriscoe.net> <CAEA6p_CN+w6XH-A=zNEc3SL9gnRF-oH5jKD4Kvkxb3=p_PTBUg@mail.gmail.com>
From: Bob Briscoe <ietf@bobbriscoe.net>
Message-ID: <dbd19f9e-cd80-4ac8-8e88-1c56577d8ad6@bobbriscoe.net>
Date: Sun, 06 Aug 2017 12:27:04 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <CAEA6p_CN+w6XH-A=zNEc3SL9gnRF-oH5jKD4Kvkxb3=p_PTBUg@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------D0BD57B3215DFFC78BA84A3C"
Content-Language: en-GB
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/UFzOi06RH_qHoRD_dNCb4Fnwf6I>
Subject: Re: [tcpm] Review of draft-wang-tcpm-low-latency-opt-00
Precedence: list

Wei,

On 04/08/17 17:55, Wei Wang wrote:
> Hi Bob,
>
> Thanks a lot for your review and detailed feedback on the draft.
> Please see my comments inline below:
>
> On Wed, Aug 2, 2017 at 8:54 AM, Bob Briscoe <ietf@bobbriscoe.net 
> <mailto:ietf@bobbriscoe.net>> wrote:
>
>     Wei, Yuchung, Neal and Eric, as authors of
>     draft-wang-tcpm-low-latency-opt-00,
>
>     I promised a review. It questions the technical logic behind the
>     draft, so I haven't bothered to give a detailed review of the
>     wording of the draft, because that might be irrelevant if you
>     agree with my arguments.
>
>     *1/ MAD by configuration?**
>     *
>
>         o  If the user does not specify a MAD value, then the implementation
>            SHOULD NOT specify a MAD value in the Low Latency option.
>
>     That sentence triggered my "anti-human-intervention" reflex. My
>     train of thought went as follows:
>
>     * Let's consider what advice we would give on what MAD value ought
>     to be configured.
>     * You say that MAD can be smaller in DCs. So I assume your advice
>     would be that MAD should depend on RTT {Note 1} and clock
>     granularity {Note 2}.
>     * So why configure one value of MAD for all RTTs? That only makes
>     sense in DC environments where the range of RTTs is small.
>     * However, for the range of RTTs on the public Internet, why not
>     calculate MAD from RTT and granularity, then standardize the
>     calculation so that both ends arrive at the same result when
>     starting from the same RTT and granularity parameters? (The sender
>     and receiver might measure different smoothed (SRTT) values, but
>     they will converge as the flow progresses.)
>
>     Then the receiver only needs to communicate its clock granularity
>     to the sender, and the fact that it is driving MAD off its SRTT.
>     Then the sender can use a formula for RTO derived from the value
>     of MAD that it calculates the receiver will be using. Then its RTO
>     will be completely tailored to the RTT of the flow.
>
>
> First of all, we recommend that operating system should have a 
> per-route MAD configuration API and a per-connection MAD configuration 
> API. So different connections could have different MAD values 
> configured. It is not one value for all.
[BB]: I prefer Neal's subsequent response agreeing that a MAD API is 
fraught with human-intervention problems, and preferring a binary API 
(use MAD or not).

I saw that pre route config is already possible when I checked the Linux 
code. However, per route config just makes the likelihood of errors 
greater. Particularly cos IETF standardization is primarily for the 
Internet, not just DCs. And on the Internet, a large proportion of 
clients are not controlled by a management system.

>
> And in my opinion, what MAD value should be set to is not only 
> depending on RTT and clock granularity. It also depends on how the 
> application wants the delayed ack behavior to be. Some application 
> might only send data say every 1ms, so it will delay its ack up to 2ms 
> so that it can always piggy back the ack to the data.
> That is why a per-connection MAD configuration makes sense for the 
> application to fine tune MAD according to its own demand.
[BB]: This has to be automated for the Internet.

An app only cares if MAD is too long. An app doesn't care if the ACK 
delay is too short. But 'the network' cares if there are too many 
unnecessary ACKs (and this knocks-on to every other app including the 
original app). So on the public Internet, the stack, not the app, is an 
appropriate place to determine MAD. The app can only be trusted to do 
this in a managed environment.

See response to Neal for further thoughts.


>
> And when user tries to set a new MAD value, we do boundary check to 
> make sure it is less than the current default MAD value. This is a 
> safety check to make sure user does not configure something that is 
> worse than current default value.
[BB]: That warrants a warning on the UI, not prohibition and certainly 
not silently ignoring the input (see point I already made below about 
large RTT environments).

>
> About your question in {Note 2} that why receiver does not communicate 
> its clock granularity to the sender, I don't really see a reason why 
> receiver side clock granularity is needed. Because the MAD value sent 
> by receiver is already a value that is rounded to the clock 
> granularity. Say if a user wants to set MAD to 1ms, and the clock 
> granularity is 10ms, receiver will send MAD value as 10ms. In the 
> draft, we specify that:
>
>       If specified, then the MAD value in the Low Latency option MUST be
>       set, as close as possible, to the implementation's actual delayed
>       ACK timeout for the connection.  Note that the actual maximum
>       delayed ACK timeout of the connection may be larger than the
>       actual user specified value because of implementation constraints
>              (e.g. timer granularity limitations).
[BB]: Understood. I should have made clear that my question was only 
relevant if you accepted my argument that the sender would calculate 
what the receiver would use for MAD (from RTT and granularity).

See my response to Neal for further thoughts.

>
>
>
>     Note: There are two different uses for the min RTO that need to be
>     separated:
>         a) Before an initial RTT value has been measured, to determine
>     the RTO during the 3WHS.
>         b) Once either end has measured the RTT for a connection.
>     (a) needs to cope with the whole range of possible RTTs, whereas
>     (b) is the subject of this email, because it can be tailored for
>     the measured RTT.
>
>
> Again, we don't think MAD value is only a function of RTT and clock 
> granularity.
>
>
>     *2/ The problem, and its prevalence**
>     *
>     With gradual removal of bufferbloat and more prevalent usage of
>     CDNs, typical base RTTs on the public Internet now make the value
>     of minRTO and of MAD look silly.
>
>     As can be seen above, the problem is indeed that each end only has
>     partial knowledge of the config of the other end.
>     However, the problem is not just that MAD needs to be communicated
>     to the other end so it can be hard-coded to a lower value.
>     The problem is that MAD is hard-coded in the first place.
>
>     The draft needs to say how prevalent the problem is (on the public
>     Internet) where the sender has to wait for the receiver's delayed
>     ACK timer at the end of a flow or between the end of a volley of
>     packets and the start of the next.
>
>
> Noted. We will add more contexts on how delayed ack works and why long 
> delayed ack time is hurting performance. We are also planning on 
> adding some history about why delayed ack was configured as a constant 
> in the first place and why the current constant value was chosen.
>
>
>     The draft also needs to say what tradeoff is considered acceptable
>     between a residual level of spurious retransmissions and lower
>     timeout delay. Eliminating all spurious retransmissions is not the
>     goal.
>
>
> Noted.
>
>
>     The draft also needs to say that introducing a new TCP Option is
>     itself a problem (on the public Internet), because of middleboxes
>     particularly proxies. Therefore a solution that does not need a
>     new TCP Option would be preferable....
>
>
> There is already a section in the draft that states the middle box issue:
>         5. Middlebox Considerations
> Is that portion a good enough explanation on this?
[BB]: I'm afraid not.

1/ The likelihood that the option is stripped (e.g. by proxies) is not 
mentioned. It only mentions the likelihood the whole SYN is discarded 
because of the option. That was why I pointed out it may be possible to 
redesign this without a new TCP option, by repurposing the timestamp 
option in a similar way to tcpm-timestamp-negotiation (note: 'similar' 
means using the ideas, not necessarily the exact same scheme).

2/ The first bullet relies on data about middleboxes that Michio 
gathered 6 years ago. Google has the ability to verify the current position.

3/ The second bullet would be irrelevant if you accept my point that the 
option needs to support larger RTTs not just smaller. Nonetheless, there 
is little evidence that middleboxes alter the fields  in unknown options.

>     Perhaps the solution for communicating timestamp resolution in
>     draft-scheffenegger-tcpm-timestamp-negotiation-05 (which cites
>     draft-trammell-tcpm-timestamp-interval-01) could be modified to
>     also communicate:
>     * TCP's clock granularity (closely related to TCP timestamp
>     resolution),
>     *  and the fact that the host is calculating MAD as a function of
>     RTT and granularity.
>     Then the existing timestamp option could be repurposed, which
>     should drastically reduce deployment problems.
>
>
> I am not sure if this is doable but will look into it.
>
>
>     *3/ Only DC?**
>     *
>     All the related work references are solely in the context of a DC.
>     Pls include refs about this problem in a public Internet context.
>     You will find there is a pretty good search engine at
>     www.google.com <http://www.google.com>.
>
>     The only non-DC ref I can find about minRTO is [Psaras07], which
>     is mainly about a proposal to apply minRTO if the sender expects
>     the next ACK to be delayed. Nonetheless, the simulation experiment
>     in Section 5.1 provides good evidence for how RTO latency is
>     dependent on uncertainty about the MAD that the other end is using.
>
>     [Psaras07] Psaras, I. & Tsaoussidis, V., "The TCP Minimum RTO
>     Revisited," In: Proc. 6th Int'l IFIP-TC6 Conference on Ad Hoc and
>     Sensor Networks, Wireless Networks, Next Generation Internet
>     NETWORKING'07 pp.981-991 Springer-Verlag (2007)
>     https://www.researchgate.net/publication/225442912_The_TCP_Minimum_RTO_Revisited
>     <https://www.researchgate.net/publication/225442912_The_TCP_Minimum_RTO_Revisited>
>
>
> Noted. Thanks a lot for the pointers. Will look into them and add to 
> the draft.
>
>
>
>     *4/ Status**
>     *
>     Normally, I wouldn't want to hold up a draft that has been proven
>     over years of practice, such as the technique in low-latency-opt,
>     which has been proven in Google's DCs over the last few years.
>     Whereas, my ideas are just that: ideas, not proven. However, the
>     technique in low-latency-opt has only been proven in DC
>     environments where the range of RTTs is limited. So, now that you
>     are proposing to transplant it onto the public Internet, it also
>     only has the status of an unproven idea.
>
>     To be clear, as it stands, I do not think low-latency-opt is
>     applicable to the public Internet.
>
>
>
> Hmm... I think overall, this approach should not do any harm to the 
> network. It provides an additional feature to let the user configure 
> the MAD if the user cares about it. If not, they can leave it as the 
> default behavior as it is right now.
> To your concerns about the RTT variation in the internet, first, as I 
> explained, this MAD value will be set per connection or per route. 
> Secondly, I would think it is doable to do some bound check or error 
> correction on the MAD value set by the user if we find that it is way 
> below RTT and does not make sense. But again, we don't think MAD value 
> is only a function of RTT. User should be able to configure it to a 
> value suitable for his/her need.
> We want to make it as a standard so that all operating systems could 
> implement this in the same way so that they could understand each 
> other. One use case is that in a cloud environment where different 
> operating systems are running in the same DC, they should be able to 
> interpret this option with no issue.
[BB]: Yes, I guessed that this was probably what Google was really 
wanting to standardize this for. With the current config constraint 
text, it is limited to managed environments, which would make it 
uninteresting to many at the IETF.

Fortunately, I think the line of thinking between Neal & me is already 
widening applicability to unmanaged environments.

>
>
>     *5/ Nits**
>     *These nits depart from my promise not comment on details that
>     could become irrelevant if you agree with my idea. Hey, whatever,...
>
>     S.3.5:
>
>     	RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay)
>
>     My immediate reaction to this was that G should not appear twice.
>     However, perhaps you meant them to be G_s and G_r (sender and
>     receiver) respectively. {Note 2}
>
>
> As explained earlier, clock granularity of the receiver is already 
> being considered in the MAD value itself. In the above formula, both G 
> are the clock granularity on the sender side.
[BB]: Then it should not be necessary to round up 2 terms to the same 
granularity. Would it not be correct to use:

	RTO <- SRTT + max(G, (K*RTTVAR + max_ACK_delay) )


>
>     S.3.5 & S.5. It seems unnecessary to prohibit values of MAD
>     greater than the default (given some companies are already
>     investing in commercial public space flight programmes, so TCP
>     could need to routinely support RTTs that are longer than typical
>     not just shorter).
>
>
> Noted. Will take consideration of this.

Regards



Bob
>
>
>     Cheers
>
>
>
>     Bob
>
>     *
>     **{Note 1}*: On average, if not app-limited, the time between ACKs
>     will be d_r*R_r/W_s where:
>        R is SRTT
>        d is the delayed ACK factor, e.g. d=2 for ACKing every other packet
>        W is the window in units of segments
>        subscripts X_r or X_s denote receiver or sender for the
>     half-connection.
>
>     So as long as the receiver can estimate the varying value of W at
>     the sender, the receiver's MAD could be
>         MAD_r = max(k*d_r*R_r / W_s, G_r),
>     The factor k (lower case) allows for some bunching of packets e.g.
>     due to link layer aggregation or the residual effects of
>     slow-start, which leaves some bunching even if SS uses pacing.
>     Let's say k=2, but it would need to be checked empirically.
>
>     For example, take R=100us, d=2, W=8 and G = 1us.
>     Given d*R/W = 25us, MAD could be perhaps 50us (i.e. k=2). k might
>     need to be greater, but there would certainly be no need for MAD
>     to be 5ms, which is perhaps 100 times greater than necessary.
>     *
>     **{Note 2}*: Why is there no field in the Low Latency option to
>     communicate receiver clock granularity to the sender?
>
>
>     Bob
>
>     -- 
>     ________________________________________________________________
>     Bob Briscoehttp://bobbriscoe.net/
>
>

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/

[tcpm] New individual (int-area) draft minimizing… Bob Briscoe
[tcpm] Review of draft-wang-tcpm-low-latency-opt-… Bob Briscoe
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Wei Wang
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Neal Cardwell
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Bob Briscoe
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Bob Briscoe
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Jeremy Harris
Re: [tcpm] Review of draft-wang-tcpm-low-latency-… Jeremy Harris