Re: [rtcweb] Congestion Control proposal

Colin Perkins <csp@csperkins.org> Fri, 07 October 2011 19:09 UTC

Return-Path: <csp@csperkins.org>
X-Original-To: rtcweb@ietfa.amsl.com
Delivered-To: rtcweb@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 27D7821F8BEC for <rtcweb@ietfa.amsl.com>; Fri, 7 Oct 2011 12:09:06 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -103.558
X-Spam-Level:
X-Spam-Status: No, score=-103.558 tagged_above=-999 required=5 tests=[AWL=0.041, BAYES_00=-2.599, RCVD_IN_DNSWL_LOW=-1, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yYiPEFL2eVoz for <rtcweb@ietfa.amsl.com>; Fri, 7 Oct 2011 12:09:05 -0700 (PDT)
Received: from lon1-msapost-2.mail.demon.net (lon1-msapost-2.mail.demon.net [195.173.77.181]) by ietfa.amsl.com (Postfix) with ESMTP id AA2EE21F86A0 for <rtcweb@ietf.org>; Fri, 7 Oct 2011 12:09:04 -0700 (PDT)
Received: from starkperkins.demon.co.uk ([80.176.158.71] helo=[192.168.0.30]) by lon1-post-2.mail.demon.net with esmtpsa (AUTH csperkins-dwh) (TLSv1:AES128-SHA:128) (Exim 4.69) id 1RCFq9-000205-bI; Fri, 07 Oct 2011 19:12:13 +0000
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset="us-ascii"
From: Colin Perkins <csp@csperkins.org>
In-Reply-To: <4E8DCA06.5060506@jesup.org>
Date: Fri, 07 Oct 2011 20:12:13 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <B08B5729-09C6-466A-933B-DA71BB487E9D@csperkins.org>
References: <4E8DCA06.5060506@jesup.org>
To: Randell Jesup <randell-ietf@jesup.org>
X-Mailer: Apple Mail (2.1084)
Cc: "rtcweb@ietf.org" <rtcweb@ietf.org>
Subject: Re: [rtcweb] Congestion Control proposal
X-BeenThere: rtcweb@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Real-Time Communication in WEB-browsers working group list <rtcweb.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtcweb>
List-Post: <mailto:rtcweb@ietf.org>
List-Help: <mailto:rtcweb-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Oct 2011 19:09:06 -0000

[inline]

On 6 Oct 2011, at 16:32, Randell Jesup wrote:
...
> Startup:
> 
> There are issues both with starting too low (quality is poor at the start
> and takes a while to get better), and with starting too high (you
> immediately can get into a delay situation and have to go into recovery).
> In general, starting too low is better than starting too high, as when we
> ramp up past the bottleneck we will not be too far over the limit and so
> recovery will be easier.  In general, the bottleneck link is often
> upstream, though this is moderating as high-bandwidth broadband becomes
> more common.
> 
> Options include:
>  a) Start at fixed value
>  b) Start at N% below the value from the last call with this person
>     (probably requires the App to help here)
>  c) Start at N% below the value from the last call with anyone
>  d) Have the other side tell us the rate they successfully received in the
>     last call.  Start the call at the lower of the last-call reception
>     rate from the far end or our last-call sending rate, minus n%
>  e) probe bandwidth at or before the start of the call using a packet
>     train
> 
> 'a' will be quite sub-optimal for most cases, though if the value is
> low enough it won't be over-bandwidth.  Some applications (games, etc)
> may want to limit the starting bandwidth to avoid negative interactions
> with other dataflows.
> 
> 'b' has a problem that if the last call had a much higher bottleneck
> bandwidth (especially on the remote downstream), the new call may be
> over-bandwidth, perhaps badly.  This won't be the norm, but may happen
> especially with mobile/wifi at either end.
> 
> 'c' has a problem that if the last callee had a much higher bottleneck
> bandwidth (especially on the remote downstream), the new call may be
> over-bandwidth, perhaps badly.  If the caller's upstream bandwidth is high
> compared to typical downstream bandwidth, then the likelihood of starting
> over-bandwidth is high.
> 
> 'd' has the advantage of selecting an appropriate value regardless of
> whether the bottleneck is on upstream or downstream.  The downside of 'd'
> is that the bottlenecks may have changed since the last call in which case
> we may start over-bandwidth.  The historical data could be transferred
> via pre-call signalling.
> 
> 'e' allows direct measurement of the current call bottlenecks.  However 'e'
> can be misled if a mid-call router that isn't the bottleneck buffers the
> packet train while housekeeping or doing other operations and then releases
> them (it can collapse the signal from an earlier bottleneck).  It would
> need to be combined with starting on the low side until a valid measurement
> is completed.  Single measurements are imprecise with packet trains.

For a new call, the only safe approach for the network would seem to be option (a), starting at a low rate, followed by a fairly rapid increase in rate to find the safe operating point (i.e., mirroring TCP slow start in concept, if not in detail). The low rate would be chosen to support the voice channel, presumably, so the call is useful immediately, but might then take several RTTs to ramp up to a high enough rate for video to be usable. An initial delay on the video is not an ideal user experience, but I don't see it as being especially problematic provided the voice channel is usable immediately, and we don't seem to have safe options for avoiding it.

Using a packet pair/train to get an estimate of the available bandwidth is appealing, but I'm unconvinced it's accurate enough to be reliable. We've done packet pair measurements to residential links that show extremely inaccurate (high) bandwidth estimates in many cases, so I wouldn't trust that as an initial sending rate. Using a longer packet train would improve accuracy, but at the cost of loading the path, and potentially disrupting any traffic sharing the path. I'm not sure such a bandwidth estimate is that worthwhile if it delays/disrupts the start of the voice channel.

When restarting a previously application limited flow after some short pause in transmission (e.g., the centralised video mixer scenario, where only the active speaker sends video to the mixer), then I'm much more comfortable with restarting at or near the previous rate, provided there's been no indication that the path has changed. The rationale here is that you've had some recent indication that the path can support the rate, but with a new connection, even if to someone you've previously spoken with, there's much less of a guarantee that the available bandwidth is consistent with the previous observation.

> Also, we need to consider the impact on other existing PeerConnections -
> for example, when you add a new person to a full-mesh conference, you'll want
> to start their bandwidth at about 1/Nth of the former total outgoing
> bandwidth, and reduce the outgoing bandwidth of those other channels to
> provide the bandwidth for the new participant.
> 
> Obviously there is no obvious perfect solution, and combinations of
> solutions may give the best overall experience.  In this situation, it
> makes sense to leave the actual choice up to the implementations or even
> the application, and make sure that they have the information needed to
> make the choice, such as the last-call bandwidth from the other end.
> (I don't believe that information would be a privacy leakage of any note.)
> 
> 
> Data channels:
> 
> If we use SCTP-DTLS-UDP, we can rely on SCTP's congestion-control, perhaps
> modified to "play nice" with the media streams, and to have some amount of
> known priority relative to them.  (For example, if the media streams need
> to back off and the data streams are using a consistent amount of data, it
> could reduce the amount or rate data is sent.)  We also could pre-allocate
> some amount of bandwidth to data, and so long as it's not using it media
> could.  Note that SCTP's congestion control is supposed to be in some
> manner pluggable, so we could modify/replace it fairly easily.
> 
> Worst case: it acts as a separate, TCP-friendly flow we compete with.
> 
> If we have to use TCP-over-UDP, then that will be congestion-controlled for
> each individual flow, but the non-reliable channels will need separate
> congestion control.  Simplest would be to add RTP-like timestamps to the
> non-reliable data channels so they could pulled into the framework of the
> RTP channels as if they were bursty RTP data.
> 
> Even if the data flows are separately congestion controlled, in many uses
> they will be non-continuous flows (discrete messages as opposed to data or
> file transfers).  These sorts of channels have different needs and
> different impacts on the media channels from continuous flows: they're
> bursty; low delay is important (especially on the initial packet for a
> burst, which is often the only packet), and without a steady stream
> congestion control will have little effect - and will have little impact on
> the media flows.  For small bursts of data traffic, the "right" solution
> may be to simply ignore them up to some limit, and above that limit start
> to subtract the data channel usage from the available bandwidth for
> variable-rate encoders.
> 
> The problem here will be defining "small".  This will need to be tested and
> simulated so we know the impacts of different data usage patterns.  I'll
> throw a straw-man proposal out there: if the increase in data usage over
> the last 1/2 second is below 20% of the total estimated channel bandwidth,
> no adjustment to the media channels shall be done, and normal congestion
> control mechanisms will be allowed to operate with no interference.  Above
> that value, the media channels if possible will be adjusted down in
> bandwidth by some amount related to the increase in data channel bandwidth.
> 
> The reason for using the increase in bandwidth here is that for steady
> flows, normal congestion control should operate well; the problem is when
> you have a spike in bandwidth use - the latency in reaction may be longer
> than the duration of the burst, and hurt end-to-end delay in the meantime.
> So when there's a sudden jump in data bandwidth, you temporarily offset
> media down to keep delay and buffering under control.  You could also
> increase media bitrates if a steady flow suddenly drops, though normal
> congestion control mechanisms should use the now-available bandwidth
> quickly on their own.

Is there some way of measuring the spare queueing capacity of the path, perhaps by periodic probing? To define "small" in a way that's meaningful to the path, we'd be looking at a burst that can be queued in the network without causing excessive delay or queue overflow/loss. Using "1/2 second" or "20%" seems too arbitrary. 

> JS Interface:
> 
> We need to tell the application about bandwidth changes, and give it the
> option of making choices, both in the allocation of bits among the various
> streams of data, and of the actual streams themselves (adding or shutting
> down streams, or switching modes (mono/stereo, codecs possibly, etc)).  If
> the application doesn't care to make a choice we'll do it for them.  We'll
> also want the JS application to be able to affect the congestion-control
> status of the data channels, so it could take bits from the data channels
> without engendering a tug-of-war and temporary over-bandwidth/buffering.
> 
> There are inherent delays to some of these actions by the app, and in an
> over-bandwidth case we need to reduce bandwidth *now*, so we may want to
> have the "I'm over-bandwidth" case start by automatically reducing
> bandwidth if possible, and inform the JS app of the reduction and let it
> rebalance or change the details or nature of the call.
> 
> We could use some good straw-men proposals for an API here to the JS code.

The right model might be for the application to communicate policy for how to adapt, which the browser then implements. Putting Javascript in the congestion control loop is a concern because of the slow response, which runs the risk of causing oscillation. If we can let the browser run the rate adaption mechanism that's more likely to be stable, and leave the slower Javascript to make the decision about adaptation policy.


-- 
Colin Perkins
http://csperkins.org/