Re: [rtcweb] Congestion Control proposal

Colin Perkins <csp@csperkins.org> Tue, 11 October 2011 15:35 UTC

Return-Path: <csp@csperkins.org>
X-Original-To: rtcweb@ietfa.amsl.com
Delivered-To: rtcweb@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D445621F8E73 for <rtcweb@ietfa.amsl.com>; Tue, 11 Oct 2011 08:35:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -103.116
X-Spam-Level:
X-Spam-Status: No, score=-103.116 tagged_above=-999 required=5 tests=[AWL=0.483, BAYES_00=-2.599, RCVD_IN_DNSWL_LOW=-1, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2S9xPIKv6UXA for <rtcweb@ietfa.amsl.com>; Tue, 11 Oct 2011 08:35:06 -0700 (PDT)
Received: from anchor-msapost-1.mail.demon.net (anchor-msapost-1.mail.demon.net [195.173.77.164]) by ietfa.amsl.com (Postfix) with ESMTP id 58CB621F8C10 for <rtcweb@ietf.org>; Tue, 11 Oct 2011 08:35:06 -0700 (PDT)
Received: from mangole.dcs.gla.ac.uk ([130.209.247.112]) by anchor-post-1.mail.demon.net with esmtpsa (AUTH csperkins-dwh) (TLSv1:AES128-SHA:128) (Exim 4.69) id 1RDeMD-000564-g8; Tue, 11 Oct 2011 15:35:05 +0000
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset=us-ascii
From: Colin Perkins <csp@csperkins.org>
In-Reply-To: <4E8F711B.3050808@jesup.org>
Date: Tue, 11 Oct 2011 16:35:02 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <E062899A-4AAE-4E61-B881-2726910B5255@csperkins.org>
References: <4E8DCA06.5060506@jesup.org> <B08B5729-09C6-466A-933B-DA71BB487E9D@csperkins.org> <4E8F711B.3050808@jesup.org>
To: Randell Jesup <randell-ietf@jesup.org>
X-Mailer: Apple Mail (2.1084)
Cc: rtcweb@ietf.org
Subject: Re: [rtcweb] Congestion Control proposal
X-BeenThere: rtcweb@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Real-Time Communication in WEB-browsers working group list <rtcweb.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtcweb>
List-Post: <mailto:rtcweb@ietf.org>
List-Help: <mailto:rtcweb-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Oct 2011 15:35:07 -0000

On 7 Oct 2011, at 22:37, Randell Jesup wrote:
> On 10/7/2011 3:12 PM, Colin Perkins wrote:
>> [inline]
>> 
>> On 6 Oct 2011, at 16:32, Randell Jesup wrote:
>> ...
>>> Startup:
>>> 
>>> There are issues both with starting too low (quality is poor at the start and takes a while to get better), and with starting too high (you immediately can get into a delay situation and have to go into recovery). In general, starting too low is better than starting too high, as when we ramp up past the bottleneck we will not be too far over the limit and so recovery will be easier.  In general, the bottleneck link is often upstream, though this is moderating as high-bandwidth broadband becomes more common.
>>> 
>>> Options include:
>>>  a) Start at fixed value
>>>  b) Start at N% below the value from the last call with this person
>>>     (probably requires the App to help here)
>>>  c) Start at N% below the value from the last call with anyone
>>>  d) Have the other side tell us the rate they successfully received in the
>>>     last call.  Start the call at the lower of the last-call reception
>>>     rate from the far end or our last-call sending rate, minus n%
>>>  e) probe bandwidth at or before the start of the call using a packet
>>>     train
>>> 
>>> 'a' will be quite sub-optimal for most cases, though if the value is low enough it won't be over-bandwidth.  Some applications (games, etc) may want to limit the starting bandwidth to avoid negative interactions with other dataflows.
>>> 
>>> 'b' has a problem that if the last call had a much higher bottleneck bandwidth (especially on the remote downstream), the new call may be over-bandwidth, perhaps badly.  This won't be the norm, but may happen especially with mobile/wifi at either end.
>>> 
>>> 'c' has a problem that if the last callee had a much higher bottleneck bandwidth (especially on the remote downstream), the new call may be over-bandwidth, perhaps badly.  If the caller's upstream bandwidth is high compared to typical downstream bandwidth, then the likelihood of starting over-bandwidth is high.
>>> 
>>> 'd' has the advantage of selecting an appropriate value regardless of whether the bottleneck is on upstream or downstream.  The downside of 'd' is that the bottlenecks may have changed since the last call in which case we may start over-bandwidth.  The historical data could be transferred via pre-call signalling.
>>> 
>>> 'e' allows direct measurement of the current call bottlenecks.  However 'e' can be misled if a mid-call router that isn't the bottleneck buffers the packet train while housekeeping or doing other operations and then releases them (it can collapse the signal from an earlier bottleneck).  It would need to be combined with starting on the low side until a valid measurement is completed.  Single measurements are imprecise with packet trains.
>> 
>> For a new call, the only safe approach for the network would seem to be option (a), starting at a low rate, followed by a fairly rapid increase in rate to find the safe operating point (i.e., mirroring TCP slow start in concept, if not in detail). The low rate would be chosen to support the voice channel, presumably, so the call is useful immediately, but might then take several RTTs to ramp up to a high enough rate for video to be usable. An initial delay on the video is not an ideal user experience, but I don't see it as being especially problematic provided the voice channel is usable immediately, and we don't seem to have safe options for avoiding it.
> 
> I understand your concern, and agree one should start "low" - but I also feel that for many applications and many users, this would be an over-conservatism that would seriously degrade a very important point in a communication.  That initial "answer" behavior had a big impact on the user experience and in the end, utility to the user.  In "most" calls (I'd guess for residential desktop browsers >95%, laptops >70% though much more variable) the bottleneck will be the same or similar to the last call - the local upstream, which is fairly constant.
> 
> There are users and use-cases where this consistency will not apply - mobile laptop users (when away from desk/office), tablet/handsets.  I will note that in most of these cases the change in bottleneck will tend to be less than 2x, so major changes in bandwidth will be fairly uncommon.  (You can get a hint on this by checking for external IP address changes!)
> 
> In most videophones and soft clients that I've looked at, they have a configured bitrate set by the user and either start at that bitrate, or start at a fraction of that bitrate.  At WorldGate, we started at a sliding percentage of the configured bandwidth - a higher percentage at low bandwidths, and down to 50% of configured at high bandwidths, combined (as you suggest) with a much higher ramp-up rate (and faster/further cut-back on problems) for the first seconds of the call.
> 
> These things, taken together, make we want to be a bit more aggressive at using history, and perhaps also use the recent N calls' variance in bottleneck rate as a guide (if they're all good at higher rates, start at 50 or 60% of the average; if there's a lot of variance (laptop moving around) use not the last call, but perhaps the lowest quartile or 10th minus 10 or 20%.
> 
> Also - this is something that I strongly feel should not be normative.  This is guidance and I do feel that there is  a place for the application to provide input and/or user preference here.  I think we *should* provide guidance and defaults.

I'm hesitant to accept the argument that bottleneck bandwidth is likely to be roughly consistent, so we can start at a rate near that bandwidth. There's getting to be a lot more mobile devices that can switch between WiFi and 3G links with potentially very different capacity, and even ADSL lines can be highly variable with time of day and other users on the same link. As a result, if we are to use history, I think we need normative rules for how, so we at least limit the amount and duration of any damage caused when the history-derived bandwidth estimate is wrong.

>> Using a packet pair/train to get an estimate of the available bandwidth is appealing, but I'm unconvinced it's accurate enough to be reliable. We've done packet pair measurements to residential links that show extremely inaccurate (high) bandwidth estimates in many cases, so I wouldn't trust that as an initial sending rate. Using a longer packet train would improve accuracy, but at the cost of loading the path, and potentially disrupting any traffic sharing the path. I'm not sure such a bandwidth estimate is that worthwhile if it delays/disrupts the start of the voice channel.
> 
> 
> I am likewise leery of packet trains, especially short start-of-call trains.  They're interesting, but as you mention they don't seem reliable enough.
> 
>> When restarting a previously application limited flow after some short pause in transmission (e.g., the centralised video mixer scenario, where only the active speaker sends video to the mixer), then I'm much more comfortable with restarting at or near the previous rate, provided there's been no indication that the path has changed. The rationale here is that you've had some recent indication that the path can support the rate, but with a new connection, even if to someone you've previously spoken with, there's much less of a guarantee that the available bandwidth is consistent with the previous observation.
> 
> 
> Agreed.
> 
>>> Even if the data flows are separately congestion controlled, in many uses
>>> they will be non-continuous flows (discrete messages as opposed to data or
>>> file transfers).  These sorts of channels have different needs and
>>> different impacts on the media channels from continuous flows: they're
>>> bursty; low delay is important (especially on the initial packet for a
>>> burst, which is often the only packet), and without a steady stream
>>> congestion control will have little effect - and will have little impact on
>>> the media flows.  For small bursts of data traffic, the "right" solution
>>> may be to simply ignore them up to some limit, and above that limit start
>>> to subtract the data channel usage from the available bandwidth for
>>> variable-rate encoders.
>>> 
>>> The problem here will be defining "small".  This will need to be tested and
>>> simulated so we know the impacts of different data usage patterns.  I'll
>>> throw a straw-man proposal out there: if the increase in data usage over
>>> the last 1/2 second is below 20% of the total estimated channel bandwidth,
>>> no adjustment to the media channels shall be done, and normal congestion
>>> control mechanisms will be allowed to operate with no interference.  Above
>>> that value, the media channels if possible will be adjusted down in
>>> bandwidth by some amount related to the increase in data channel bandwidth.
>>> 
>>> The reason for using the increase in bandwidth here is that for steady
>>> flows, normal congestion control should operate well; the problem is when
>>> you have a spike in bandwidth use - the latency in reaction may be longer
>>> than the duration of the burst, and hurt end-to-end delay in the meantime.
>>> So when there's a sudden jump in data bandwidth, you temporarily offset
>>> media down to keep delay and buffering under control.  You could also
>>> increase media bitrates if a steady flow suddenly drops, though normal
>>> congestion control mechanisms should use the now-available bandwidth
>>> quickly on their own.
>> Is there some way of measuring the spare queueing capacity of the path, perhaps by periodic probing? To define "small" in a way that's meaningful to the path, we'd be looking at a burst that can be queued in the network without causing excessive delay or queue overflow/loss. Using "1/2 second" or "20%" seems too arbitrary.
> 
> 
> It may be possible to use naturally-occuring bursts (or slightly artificially-contrived bursts) to probe the channel.  For example, periodic or error-recovery iframes amount to a fairly large burst of large packets; measuring the dispersion at the reception end may will give you a reasonable estimate of unused bandwidth. (I'll note that if you don't have access to the raw network interface this might be a little noisier).

Unused bandwidth might be less important than unused queuing capacity. 

>>> JS Interface:
>>> 
>>> We need to tell the application about bandwidth changes, and give it the
>>> option of making choices, both in the allocation of bits among the various
>>> streams of data, and of the actual streams themselves (adding or shutting
>>> down streams, or switching modes (mono/stereo, codecs possibly, etc)).  If
>>> the application doesn't care to make a choice we'll do it for them.  We'll
>>> also want the JS application to be able to affect the congestion-control
>>> status of the data channels, so it could take bits from the data channels
>>> without engendering a tug-of-war and temporary over-bandwidth/buffering.
>>> 
>>> There are inherent delays to some of these actions by the app, and in an
>>> over-bandwidth case we need to reduce bandwidth *now*, so we may want to
>>> have the "I'm over-bandwidth" case start by automatically reducing
>>> bandwidth if possible, and inform the JS app of the reduction and let it
>>> rebalance or change the details or nature of the call.
>>> 
>>> We could use some good straw-men proposals for an API here to the JS code.
>> The right model might be for the application to communicate policy for how to adapt, which the browser then implements. Putting Javascript in the congestion control loop is a concern because of the slow response, which runs the risk of causing oscillation. If we can let the browser run the rate adaption mechanism that's more likely to be stable, and leave the slower Javascript to make the decision about adaptation policy.
> 
> 
> Agreed, though some of the adaptations have to occur at the application level (turning off channels, which may involve UI changes, etc).   That was why I suggested we adapt using defaults (and any hints given ("policy" in your post)), and then inform the JS app of the change and allow it to rebalance or re-allocate the BW usage, or to change the already-made decisions.  This helps guarantee we have working defaults and response speed.
> 
> I fear if we try to push all that logic down into the media code it becomes too complex, because you have to anticipate any way an application might want to react, and provide it (and test it).


Agree. There's clearly a balance to be got here. The main point is that the javascript can be in the control loop, since it's operating at a slower timescale.

-- 
Colin Perkins
http://csperkins.org/