Re: [tsvwg] QUIC with L4S

Sebastian Moeller <moeller0@gmx.de> Thu, 07 July 2022 09:24 UTC

Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\))
From: Sebastian Moeller <moeller0@gmx.de>
In-Reply-To: <9fa10cc8-5634-d170-0431-610b809e4968@huitema.net>
Date: Thu, 07 Jul 2022 11:23:55 +0200
Cc: "tsvwg@ietf.org" <tsvwg@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <A301153A-6128-4396-84A7-3F8B80F55F88@gmx.de>
References: <AM8PR07MB8137710DD707BF2DB4FDEA9DC2B89@AM8PR07MB8137.eurprd07.prod.outlook.com> <AM8PR07MB81378A432301907ABFDFC2B8C2B89@AM8PR07MB8137.eurprd07.prod.outlook.com> <77332295-c7b7-21aa-7661-af5770b4c249@huitema.net> <FD39D53A-0A47-4609-930A-DFD6526CA49B@gmx.de> <9fa10cc8-5634-d170-0431-610b809e4968@huitema.net>
To: Christian Huitema <huitema@huitema.net>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/QgE0TonncY5rhsVEyCSDq0iiVjs>
Subject: Re: [tsvwg] QUIC with L4S
Precedence: list

Hi Christian,


> On Jul 6, 2022, at 23:19, Christian Huitema <huitema@huitema.net> wrote:
> 
> 
> 
> On 7/6/2022 12:09 AM, Sebastian Moeller wrote:
>>> We may also need to do some work on "slow start". My implementation exist slow start on first ECN-CE mark, but at that point there is a full flight of packets in transit, which is going to cause either packet losses or spikes in latency.
>>> 
>> 	[SM] Is this actually fixable? This seems to require an oracle or alternatively a quick side-effect-free method to empirically estimate a given flows immediate path capacity. Then again, current Prague essentially only works reasonably well on short-RTT paths without bottlenecks with plain FIFOs, so the scenarios were L4S signaling has been demonstrated to work (to some degree) will not suffer from too much data in flight, no?
> My preferred fix would use a two-epoch cycle: send data for one epoch at a tentative high rate, then use a slower rate for the next epoch while the feedback from sending packets at the high rate is collected.

	[SM] For that to beat current "slow-start" the high rate epochs need to be considerably higher than today (because we want to increase the average rate over the slow start period, or rather shorten the "race to the capacity limit"). That will cause problems, because on nodes where such flows meet and duke it out traffic will become more volatile (and potentially more likely to oscillation/"resonance" phenomena). Not saying that faster than current slow-start is not desirable or not possible, just that this will likely come with its own side-effects.


> Based on feedback, either find a plausible rate and get out of slow start, or repeat the two-epoch cycle with a higher tentative rate. This might be combined with creative use of pacing to bunch lots of packets in a burst, then use the transmission time of the burst to assess the high data rate.

	[SM] Well packet pair or packets of variable size can be both used to deduce throughput capacity, but both are not super robust and reliable, e.g. any parallel path (bonded link that does not hash full flows persistently to the same member) can introduce both reordering and remove the inter-packet delay as reliable throughput correlate (similar problems are introduced by all processes/nodes that introduce packet bunching/burts*). All of which I believe is old news... In the context of L4S there was an attempt at having another go at this issue, I think called "paced chirping", but all that lead to so far is a thesis and an IEEE conference paper (and maybe a few more). So IMHO is still unclear how robust this approach is working in general over the existing internet (though I do hope it achieves its goals).


*) Like WiFi, or G.INP retransmissions in DSL, I am sure other physical layers with retransmit will have similar properties.

> That would result in a "make before you break" approach, avoiding too much queuing and too many losses during initial ramp up, which would be nicer for the competing users of the bottleneck.

	[SM] Yes, it would be fine, but the problem is the measure it wants to use is already known to be unreliable (on short time scales, as so often averaging will help, but cost time).

> BBR uses a similar approach when "searching for bottleneck bandwidth", but it only seek a maximum 25% improvement in two epochs. That would probably be too slow, and implementations would demand something faster for slow start. And of course we should remember that most connections manage to send all their data before exiting slow start...

	[SM] I am probably wrong, but I thought both BBR and traditional slow start double rate/cwnd every round/RTT, so it seems that BBR has not "fixed" the slow start issue (caveat doubling the sending rate is unlikely to be exactly equivalent to doubling the cwnd, but I naively? assume they will be somewhat similar in effect).
	There will always be flows that finish before a reliable estimate of the flows share of capacity is available. So the question IMHO becomes "are the side effects from making start-up more aggressive (it is not that exponential growth is not plenty aggressive already*) is worth the benefit of speeding up short flows"? 

*) and can easily be scaled up by using another factor, say quadruple each round.



> 
>>> I also had to try a couple of different settings for the threshold value used in the simulation. Too low, and the throughput drops. Too high, and the amount of losses increases too much. In the tests, the router has a queue size of 1 BDP, and setting the threshold to BDP/4 appears to work well.
>>> 
>> 	[SM] I thought both legs of an L4S-AQM engage based on sojourn time (either measured directly or estimated from queue size and known egress rate), why does the maximal length of the queue matter here? What AQM are you using?
> I was simulating a straight L4S AQM, with the binary marking: keep ECT0

	[SM] Confused, should that not be ETC(1) for L4S? For ECT(0) the expected behavior would be an rfc3168 compliant response IIUC.


> if no queue, switch to CE if queue longer than threshold.

	[SM] Okay, so that uses queue length (in bytes I assume) as proxy for sojourn time (which for fix egress time it will be).

> The overall buffer size is only used when not using Prague, or during slow start. The issue was "where to set the threshold". Most of the complexity comes from the interaction between the L4S AQM and the typical "leaky bucket" pacing used by the transport: we can get lots of CE marks if the leaky bucket size is larger than L4S threshold at the bottleneck. And of course the network admin setting the threshold does not know what parameters the end-to-end transport used for pacing. And vice versa.

	[SM] I interpret that as an example of L4S' intolerance to bursty traffic, it is interesting though that the little burstiness inherent in a TB-pacer apparently is enough to cause noticeable performance degradation. Interesting, although not surprising given Pete's data (https://github.com/heistp/l4s-tests#underutilization-with-bursty-links). 

Regards
	Sebastian


> 
> -- Christian Huitema
> 
>

[tsvwg] QUIC with L4S Ingemar Johansson S
Re: [tsvwg] QUIC with L4S Ingemar Johansson S
Re: [tsvwg] QUIC with L4S Christian Huitema
Re: [tsvwg] QUIC with L4S Vidhi Goel
Re: [tsvwg] QUIC with L4S Christian Huitema
Re: [tsvwg] QUIC with L4S Vidhi Goel
Re: [tsvwg] QUIC with L4S Sebastian Moeller
Re: [tsvwg] QUIC with L4S Christian Huitema
Re: [tsvwg] QUIC with L4S Sebastian Moeller