Re: [rtcweb] Congestion Control proposal

Randell Jesup <randell-ietf@jesup.org> Fri, 07 October 2011 21:38 UTC

Return-Path: <randell-ietf@jesup.org>
X-Original-To: rtcweb@ietfa.amsl.com
Delivered-To: rtcweb@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 45F2921F8BF4 for <rtcweb@ietfa.amsl.com>; Fri, 7 Oct 2011 14:38:25 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.574
X-Spam-Level:
X-Spam-Status: No, score=-2.574 tagged_above=-999 required=5 tests=[AWL=0.025, BAYES_00=-2.599]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id y3WzCZwF6Qam for <rtcweb@ietfa.amsl.com>; Fri, 7 Oct 2011 14:38:24 -0700 (PDT)
Received: from r2-chicago.webserversystems.com (r2-chicago.webserversystems.com [173.236.101.58]) by ietfa.amsl.com (Postfix) with ESMTP id 1B09521F8BF0 for <rtcweb@ietf.org>; Fri, 7 Oct 2011 14:38:24 -0700 (PDT)
Received: from pool-173-49-141-165.phlapa.fios.verizon.net ([173.49.141.165] helo=[192.168.1.12]) by r2-chicago.webserversystems.com with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69) (envelope-from <randell-ietf@jesup.org>) id 1RCIAk-0008FJ-DA for rtcweb@ietf.org; Fri, 07 Oct 2011 16:41:38 -0500
Message-ID: <4E8F711B.3050808@jesup.org>
Date: Fri, 07 Oct 2011 17:37:31 -0400
From: Randell Jesup <randell-ietf@jesup.org>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.1) Gecko/20110830 Thunderbird/6.0.1
MIME-Version: 1.0
To: rtcweb@ietf.org
References: <4E8DCA06.5060506@jesup.org> <B08B5729-09C6-466A-933B-DA71BB487E9D@csperkins.org>
In-Reply-To: <B08B5729-09C6-466A-933B-DA71BB487E9D@csperkins.org>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - r2-chicago.webserversystems.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - jesup.org
X-Source:
X-Source-Args:
X-Source-Dir:
Subject: Re: [rtcweb] Congestion Control proposal
X-BeenThere: rtcweb@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Real-Time Communication in WEB-browsers working group list <rtcweb.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtcweb>
List-Post: <mailto:rtcweb@ietf.org>
List-Help: <mailto:rtcweb-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Oct 2011 21:38:25 -0000

On 10/7/2011 3:12 PM, Colin Perkins wrote:
> [inline]
>
> On 6 Oct 2011, at 16:32, Randell Jesup wrote:
> ...
>> Startup:
>>
>> There are issues both with starting too low (quality is poor at the start
>> and takes a while to get better), and with starting too high (you
>> immediately can get into a delay situation and have to go into recovery).
>> In general, starting too low is better than starting too high, as when we
>> ramp up past the bottleneck we will not be too far over the limit and so
>> recovery will be easier.  In general, the bottleneck link is often
>> upstream, though this is moderating as high-bandwidth broadband becomes
>> more common.
>>
>> Options include:
>>   a) Start at fixed value
>>   b) Start at N% below the value from the last call with this person
>>      (probably requires the App to help here)
>>   c) Start at N% below the value from the last call with anyone
>>   d) Have the other side tell us the rate they successfully received in the
>>      last call.  Start the call at the lower of the last-call reception
>>      rate from the far end or our last-call sending rate, minus n%
>>   e) probe bandwidth at or before the start of the call using a packet
>>      train
>>
>> 'a' will be quite sub-optimal for most cases, though if the value is
>> low enough it won't be over-bandwidth.  Some applications (games, etc)
>> may want to limit the starting bandwidth to avoid negative interactions
>> with other dataflows.
>>
>> 'b' has a problem that if the last call had a much higher bottleneck
>> bandwidth (especially on the remote downstream), the new call may be
>> over-bandwidth, perhaps badly.  This won't be the norm, but may happen
>> especially with mobile/wifi at either end.
>>
>> 'c' has a problem that if the last callee had a much higher bottleneck
>> bandwidth (especially on the remote downstream), the new call may be
>> over-bandwidth, perhaps badly.  If the caller's upstream bandwidth is high
>> compared to typical downstream bandwidth, then the likelihood of starting
>> over-bandwidth is high.
>>
>> 'd' has the advantage of selecting an appropriate value regardless of
>> whether the bottleneck is on upstream or downstream.  The downside of 'd'
>> is that the bottlenecks may have changed since the last call in which case
>> we may start over-bandwidth.  The historical data could be transferred
>> via pre-call signalling.
>>
>> 'e' allows direct measurement of the current call bottlenecks.  However 'e'
>> can be misled if a mid-call router that isn't the bottleneck buffers the
>> packet train while housekeeping or doing other operations and then releases
>> them (it can collapse the signal from an earlier bottleneck).  It would
>> need to be combined with starting on the low side until a valid measurement
>> is completed.  Single measurements are imprecise with packet trains.
> For a new call, the only safe approach for the network would seem to be option (a), starting at a low rate, followed by a fairly rapid increase in rate to find the safe operating point (i.e., mirroring TCP slow start in concept, if not in detail). The low rate would be chosen to support the voice channel, presumably, so the call is useful immediately, but might then take several RTTs to ramp up to a high enough rate for video to be usable. An initial delay on the video is not an ideal user experience, but I don't see it as being especially problematic provided the voice channel is usable immediately, and we don't seem to have safe options for avoiding it.


I understand your concern, and agree one should start "low" - but I also 
feel that for many applications and many users, this would be an 
over-conservatism that would seriously degrade a very important point in 
a communication.  That initial "answer" behavior had a big impact on the 
user experience and in the end, utility to the user.  In "most" calls 
(I'd guess for residential desktop browsers >95%, laptops >70% though 
much more variable) the bottleneck will be the same or similar to the 
last call - the local upstream, which is fairly constant.

There are users and use-cases where this consistency will not apply - 
mobile laptop users (when away from desk/office), tablet/handsets.  I 
will note that in most of these cases the change in bottleneck will tend 
to be less than 2x, so major changes in bandwidth will be fairly 
uncommon.  (You can get a hint on this by checking for external IP 
address changes!)

In most videophones and soft clients that I've looked at, they have a 
configured bitrate set by the user and either start at that bitrate, or 
start at a fraction of that bitrate.  At WorldGate, we started at a 
sliding percentage of the configured bandwidth - a higher percentage at 
low bandwidths, and down to 50% of configured at high bandwidths, 
combined (as you suggest) with a much higher ramp-up rate (and 
faster/further cut-back on problems) for the first seconds of the call.

These things, taken together, make we want to be a bit more aggressive 
at using history, and perhaps also use the recent N calls' variance in 
bottleneck rate as a guide (if they're all good at higher rates, start 
at 50 or 60% of the average; if there's a lot of variance (laptop moving 
around) use not the last call, but perhaps the lowest quartile or 10th 
minus 10 or 20%.

Also - this is something that I strongly feel should not be normative.  
This is guidance and I do feel that there is  a place for the 
application to provide input and/or user preference here.  I think we 
*should* provide guidance and defaults.

> Using a packet pair/train to get an estimate of the available bandwidth is appealing, but I'm unconvinced it's accurate enough to be reliable. We've done packet pair measurements to residential links that show extremely inaccurate (high) bandwidth estimates in many cases, so I wouldn't trust that as an initial sending rate. Using a longer packet train would improve accuracy, but at the cost of loading the path, and potentially disrupting any traffic sharing the path. I'm not sure such a bandwidth estimate is that worthwhile if it delays/disrupts the start of the voice channel.


I am likewise leery of packet trains, especially short start-of-call 
trains.  They're interesting, but as you mention they don't seem 
reliable enough.

> When restarting a previously application limited flow after some short pause in transmission (e.g., the centralised video mixer scenario, where only the active speaker sends video to the mixer), then I'm much more comfortable with restarting at or near the previous rate, provided there's been no indication that the path has changed. The rationale here is that you've had some recent indication that the path can support the rate, but with a new connection, even if to someone you've previously spoken with, there's much less of a guarantee that the available bandwidth is consistent with the previous observation.


Agreed.

>> Even if the data flows are separately congestion controlled, in many uses
>> they will be non-continuous flows (discrete messages as opposed to data or
>> file transfers).  These sorts of channels have different needs and
>> different impacts on the media channels from continuous flows: they're
>> bursty; low delay is important (especially on the initial packet for a
>> burst, which is often the only packet), and without a steady stream
>> congestion control will have little effect - and will have little impact on
>> the media flows.  For small bursts of data traffic, the "right" solution
>> may be to simply ignore them up to some limit, and above that limit start
>> to subtract the data channel usage from the available bandwidth for
>> variable-rate encoders.
>>
>> The problem here will be defining "small".  This will need to be tested and
>> simulated so we know the impacts of different data usage patterns.  I'll
>> throw a straw-man proposal out there: if the increase in data usage over
>> the last 1/2 second is below 20% of the total estimated channel bandwidth,
>> no adjustment to the media channels shall be done, and normal congestion
>> control mechanisms will be allowed to operate with no interference.  Above
>> that value, the media channels if possible will be adjusted down in
>> bandwidth by some amount related to the increase in data channel bandwidth.
>>
>> The reason for using the increase in bandwidth here is that for steady
>> flows, normal congestion control should operate well; the problem is when
>> you have a spike in bandwidth use - the latency in reaction may be longer
>> than the duration of the burst, and hurt end-to-end delay in the meantime.
>> So when there's a sudden jump in data bandwidth, you temporarily offset
>> media down to keep delay and buffering under control.  You could also
>> increase media bitrates if a steady flow suddenly drops, though normal
>> congestion control mechanisms should use the now-available bandwidth
>> quickly on their own.
> Is there some way of measuring the spare queueing capacity of the path, perhaps by periodic probing? To define "small" in a way that's meaningful to the path, we'd be looking at a burst that can be queued in the network without causing excessive delay or queue overflow/loss. Using "1/2 second" or "20%" seems too arbitrary.


It may be possible to use naturally-occuring bursts (or slightly 
artificially-contrived bursts) to probe the channel.  For example, 
periodic or error-recovery iframes amount to a fairly large burst of 
large packets; measuring the dispersion at the reception end may will 
give you a reasonable estimate of unused bandwidth. (I'll note that if 
you don't have access to the raw network interface this might be a 
little noisier).

>> JS Interface:
>>
>> We need to tell the application about bandwidth changes, and give it the
>> option of making choices, both in the allocation of bits among the various
>> streams of data, and of the actual streams themselves (adding or shutting
>> down streams, or switching modes (mono/stereo, codecs possibly, etc)).  If
>> the application doesn't care to make a choice we'll do it for them.  We'll
>> also want the JS application to be able to affect the congestion-control
>> status of the data channels, so it could take bits from the data channels
>> without engendering a tug-of-war and temporary over-bandwidth/buffering.
>>
>> There are inherent delays to some of these actions by the app, and in an
>> over-bandwidth case we need to reduce bandwidth *now*, so we may want to
>> have the "I'm over-bandwidth" case start by automatically reducing
>> bandwidth if possible, and inform the JS app of the reduction and let it
>> rebalance or change the details or nature of the call.
>>
>> We could use some good straw-men proposals for an API here to the JS code.
> The right model might be for the application to communicate policy for how to adapt, which the browser then implements. Putting Javascript in the congestion control loop is a concern because of the slow response, which runs the risk of causing oscillation. If we can let the browser run the rate adaption mechanism that's more likely to be stable, and leave the slower Javascript to make the decision about adaptation policy.


Agreed, though some of the adaptations have to occur at the application 
level (turning off channels, which may involve UI changes, etc).   That 
was why I suggested we adapt using defaults (and any hints given 
("policy" in your post)), and then inform the JS app of the change and 
allow it to rebalance or re-allocate the BW usage, or to change the 
already-made decisions.  This helps guarantee we have working defaults 
and response speed.

I fear if we try to push all that logic down into the media code it 
becomes too complex, because you have to anticipate any way an 
application might want to react, and provide it (and test it).


-- 
Randell Jesup
randell-ietf@jesup.org