[rtcweb] Alternative Proposal for Dynamic Codec Parameter Change (Was: Re: Resolution negotiation - a contribution)

Magnus Westerlund <magnus.westerlund@ericsson.com> Tue, 24 April 2012 14:25 UTC

Return-Path: <magnus.westerlund@ericsson.com>
X-Original-To: rtcweb@ietfa.amsl.com
Delivered-To: rtcweb@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C859821F8807 for <rtcweb@ietfa.amsl.com>; Tue, 24 Apr 2012 07:25:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -106.126
X-Spam-Level:
X-Spam-Status: No, score=-106.126 tagged_above=-999 required=5 tests=[AWL=0.123, BAYES_00=-2.599, HELO_EQ_SE=0.35, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cm9eLcQvrxR6 for <rtcweb@ietfa.amsl.com>; Tue, 24 Apr 2012 07:25:31 -0700 (PDT)
Received: from mailgw2.ericsson.se (mailgw2.ericsson.se [193.180.251.37]) by ietfa.amsl.com (Postfix) with ESMTP id 3023521F85AE for <rtcweb@ietf.org>; Tue, 24 Apr 2012 07:25:30 -0700 (PDT)
X-AuditID: c1b4fb25-b7b18ae000000dce-3d-4f96b7d9dc5e
Authentication-Results: mailgw2.ericsson.se x-tls.subject="/CN=esessmw0237"; auth=fail (cipher=AES128-SHA)
Received: from esessmw0237.eemea.ericsson.se (Unknown_Domain [153.88.253.125]) (using TLS with cipher AES128-SHA (AES128-SHA/128 bits)) (Client CN "esessmw0237", Issuer "esessmw0237" (not verified)) by mailgw2.ericsson.se (Symantec Mail Security) with SMTP id 98.AA.03534.9D7B69F4; Tue, 24 Apr 2012 16:25:30 +0200 (CEST)
Received: from [127.0.0.1] (153.88.115.8) by esessmw0237.eemea.ericsson.se (153.88.115.91) with Microsoft SMTP Server id 8.3.213.0; Tue, 24 Apr 2012 16:25:28 +0200
Message-ID: <4F96B7C9.1030609@ericsson.com>
Date: Tue, 24 Apr 2012 16:25:13 +0200
From: Magnus Westerlund <magnus.westerlund@ericsson.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20120327 Thunderbird/11.0.1
MIME-Version: 1.0
To: Harald Alvestrand <harald@alvestrand.no>
References: <4F869648.2020605@alvestrand.no>
In-Reply-To: <4F869648.2020605@alvestrand.no>
X-Enigmail-Version: 1.4
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Brightmail-Tracker: AAAAAA==
Cc: "rtcweb@ietf.org" <rtcweb@ietf.org>
Subject: [rtcweb] Alternative Proposal for Dynamic Codec Parameter Change (Was: Re: Resolution negotiation - a contribution)
X-BeenThere: rtcweb@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Real-Time Communication in WEB-browsers working group list <rtcweb.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtcweb>
List-Post: <mailto:rtcweb@ietf.org>
List-Help: <mailto:rtcweb-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Apr 2012 14:25:32 -0000

Harald, WG,

This is posted as individual. When it comes to this topic I will not
engage in any chair activities because I do have an alternative proposal
based on
https://datatracker.ietf.org/doc/draft-westerlund-avtext-codec-operation-point/.


My proposal has some significant differences in how it works compared to
Harald's. I will start with a discussion of requirements, in addition to
Harald's, then an overview of my proposal, and ending with a discussion
of the differences between the proposals. This is quite long but I do
hope you will read it to the end.

Requirements
------------

Let's start with the requirements that Harald have written. I think they
should be formulated differently. First of all I think the requirements
are negotiated, indicated etc in the context of a peer connection.

Secondly, when it comes to maximum resolution and maximum frame-rate,
there are actually three different limits that I think are important;
Highest spatial resolution, highest frame-rate, and maximum complexity
for a given video stream. The maximum complexity is often expressed as
the number of macroblocks a video codec can process per second. This is
a well-established complexity measure, used as part of standardized
video codec's "level" definitions since the introduction of H.263 Annex
X in 2004. As this is basically a joint maximum requirement on the
maximum amount of pixels per frame times the maximum frame-rate, there
exist cases where this complexity figure actually is the constraining
factor forcing a sender or receiver to either request higher frame rate
and lower resolution or higher resolution and a lower frame rate.

The requirements should also be clearer in that one needs to handle
multiple SSRCs per RTP session, including multiple cameras or audio
streams (microphones) from a single end-point, where each stream can be
encoded using different parameter values.

I also think it is important that we consider what additional encoding
property parameters that would make sense for WebRTC to have a
possibility to dynamically negotiate, request and indicate.

Some additional use cases that should be considered in the requirements:

1) We have use cases including multi-party communication. One way to
realize these is to use a central node and I believe everyone agrees
that we should have good support for getting this usage to work well.
Thus in a basic star topology of different participants there is going
to be different path characteristics between the central node and each
participant.

1A) This makes it necessary to consider how one can ensure that the
central node can deliver appropriate rates. One way is to de-couple the
links by having the central node perform individual transcodings to
every participant. A simpler non-transcoding central node, only
forwarding streams between end-points, would have to enforce the lowest
path characteristics to all. If one don't want to transcode at the
central node and not use the lowest path characteristics to all, one
need to consider either simulcast or scalable video coding. Both
simulcast and scalable video coding result in that at least in the
direction from a participant to the central node one needs to use
multiple codec operation points. Either one per peer connection, which
is how I see simulcast being realized with today's API and
functionality, or using an encoding format supporting scalable coding
within a single peer connection.

1B) In cases where the central node has minimal functionality and
basically is a relay and an ICE plus DTLS-SRTP termination point (I
assume EKT to avoid having to do re-encryption), there is a need to be
able to handle sources from different participants. This puts extra
requirements on how to successfully negotiate the parameters. For
example changing the values for one media source should not force one to
renegotiate with everyone.

2) The non-centralized multiparty use case appears to equally or more
stress the need for having timely dynamic control. If each sender has a
number of peer-connections to its peers, it may use local audio levels
to determine if its media stream is to be sent or not. Thus the amount
of bit-rate and screen estate needed to display will rapidly change as
different users speaks. Thus the need for minimal delay when changing
preferences are important.

3) We also have speech and audio including audio-only use cases. For
audio there could also exist desire to request or indicate changes in
the audio bandwidth required, or usage of multi-channel.

4) Adaptation to legacy node from a central node in a multi-party
conference. In some use cases legacy nodes might have special needs that
are within the profiles a WebRTC end-point is capable of producing. Thus
the central node might request the nodes to constrain themselves to
particular payload types, audio bandwidth etc to meet a joining session
participant.

5) There appear to exist a need for expressing dynamic requests for
target bit-rate as one parameter. This can be supported by TMMBR
(RFC5104) but there exist additional transport related parameters could
help with the adaptation. These include MTU, limits on packet rate, and
amount of aggregation of audio frames in the payload.

Overview
--------

The basic idea in this proposal is to use JSEP to establish the outer
limits for behavior and then use Codec Operation Point (COP) proposal as
detailed in draft-westerlund-avtext-codec-operation-point to handle
dynamic changes during the session.

So highest resolution, frame-rate and maximum complexity are expressed
in JSEP SDP. Complexity is in several video codecs expressed by profile
and level. I know that VP8 currently doesn't have this but it is under
discussion when it comes to these parameters.

During the session the browser implementation detects when there is need
to use COP to do any of the following things.

A) Request new target values for codec operation, for example due to
that the GUI element displaying a video has changed due to window resize.

B) Indicate when the end-point in its role as sender change parameters.

In addition to just spatial resolution and video frame rate, I propose
that the following parameters are considered as parameters that could be
dynamically possible to indicate and request.

Spatial resolution (as x and y resolution), Frame-rate, Picture Aspect
ratio, Sample Aspect Ratio, Payload Type, Bit-rate, Token Bucket Size
(To control burstiness of sender), Channels, Sampling Rate, Maximum RTP
Packet Size, Maximum RTP Packet Rate, and Application Data Unit
Aggregation (to control amount of audio frames in the same RTP packet).


Differences
-----------

A) Using COP and using SDP based signaling for the dynamic changes are
two quite different models in relation to how the interaction happens.

For COP this all happens in the browser, normally initiated by the
browser's own determination that a COP request or notification is
needed. Harald's proposal appears to require that the JS initiate a
renegotiation. This puts a requirement on the implementation to listen
to the correct callbacks to know when changes happens, such as window
resize. To my knowledge there are not yet any proposals for how the
browser can initiate a JSEP renegotiation.

Thus COP has the advantage that there is no API changes to get browser
triggered parameter changes. W3C can select too but are not required to
add API methods to allow JS to make codec parameter requests.

The next thing is that COP does not require the JS application to have
code to detect and handle re-negotiation. This makes it simpler for the
basic application to get good behavior and they are not interrupted nor
do they need to handle JSEP & Offer/Answer state machine lock-out
effects due to dynamic changes.

How big impact these API issues have are unclear as W3C currently appear
not to have included any discussion of how the browser can initiate a
offer/answer exchange towards the JS when it determines a need to change
parameters.

But I am worried that using SDP and with an API that requires the
application to listen for triggers that could benefit from a codec
parameter renegotiation. This will likely only result in good behavior
for the JS application implementors that are really good and find out
what listeners and what signaling tricks are needed with JSEP to get
good performance. I would much rather prefer good behavior by default in
simple applications, i.e. using the default behavior that the browser
implementor have put in.

B) Using the media plane, i.e. RTCP for this signaling lets it in most
case go directly between the encoding and the decoding entity in the
code. There is no need to involve the JS nor the signaling server. One
issue of using JSEP and SDP is that the state machine lock-out effects
that can occur if one has sent an Offer. Then that browser may not be
able to send a new updated Offer reflecting the latest change until the
answer has been properly processed. COP doesn't have these limitations.
It can send a new parameter request immediately, only limited by RTCP
bandwidth restrictions. Using the media plane in my view guarantees that
COP is never worse than what the signaling plane can perform at its best.

C) As the general restrictions are determined in the initial
negotiation, COP doesn't have the issue that in-flight media streams can
become out of bounds. Thus no need for a two phase change of signaling
parameters.

D) Relying on draft-lennox-mmusic-sdp-source-selection-04 has several
additional implications that should be discussed separately. The draft
currently includes the following functionalities.

  D1) It contains a media stream pause proposal. This makes it subject
to the architectural debate currently ongoing in dispatch around
draft-westerlund-avtext-rtp-stream-pause-00 which is a competing
proposal for the same functionality.

  D2) The inclusion of max desired frame rate in a SSRC specific way

  D3) Extending the image attribute to be per SSRC specific expression
of desired resolutions.

  D4) Expressing relative priority in receiving different SSRCs.

  D5) Providing an "information" on a sent SSRC

  D6) Indication if the media sender is actively sending media using the
given SSRC.

Is it a correct observation that only D2 and D3 are required for the
functionality of Resolution negotiation?

E) The standardization situation is similar for both proposals. They are
both relying on Internet drafts that are currently individual
submissions. Both are partially caught in the architectural discussion
which was held in Paris in the DISPATCH WG around Media pause/resume
(draft-westerlund-avtext-rtp-stream-pause-00) and Media Stream Selection
(draft-westerlund-dispatch-stream-selection-00) on what the most
appropriate level of discussion are. This discussion will continue on
the RAI-Area mailing list.

F) As seen by the discussion on the mailing list the imageattr
definitions may not be a 100% match to what is desired with Harald's
proposal. I believe that COP's are more appropriate especially the
"target" values possibility. In addition these are still open for
adjustment and if they don't match WebRTC's requirements.

I would also like to point out that I believe this functionality is also
highly desirable for CLUE and that their requirements should be taken
into account. I do think that this is one of the aspects where having
matching functionality will make it much easier to have WebRTC to CLUE
interworking.

Thanks for reading all the way here!

Cheers

Magnus Westerlund

----------------------------------------------------------------------
Multimedia Technologies, Ericsson Research EAB/TVM
----------------------------------------------------------------------
Ericsson AB                | Phone  +46 10 7148287
Färögatan 6                | Mobile +46 73 0949079
SE-164 80 Stockholm, Sweden| mailto: magnus.westerlund@ericsson.com
----------------------------------------------------------------------