Re: [rtcweb] Alternative Proposal for Dynamic Codec Parameter Change (Was: Re: Resolution negotiation - a contribution)

Harald Alvestrand <harald@alvestrand.no> Tue, 24 April 2012 16:43 UTC

Return-Path: <harald@alvestrand.no>
X-Original-To: rtcweb@ietfa.amsl.com
Delivered-To: rtcweb@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AD92821F8709 for <rtcweb@ietfa.amsl.com>; Tue, 24 Apr 2012 09:43:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -110.499
X-Spam-Level:
X-Spam-Status: No, score=-110.499 tagged_above=-999 required=5 tests=[AWL=0.100, BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kGwkL0JCf3Hd for <rtcweb@ietfa.amsl.com>; Tue, 24 Apr 2012 09:43:47 -0700 (PDT)
Received: from eikenes.alvestrand.no (eikenes.alvestrand.no [158.38.152.233]) by ietfa.amsl.com (Postfix) with ESMTP id EE39C21F8702 for <rtcweb@ietf.org>; Tue, 24 Apr 2012 09:43:46 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by eikenes.alvestrand.no (Postfix) with ESMTP id C525B39E0CD; Tue, 24 Apr 2012 18:43:45 +0200 (CEST)
X-Virus-Scanned: Debian amavisd-new at eikenes.alvestrand.no
Received: from eikenes.alvestrand.no ([127.0.0.1]) by localhost (eikenes.alvestrand.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PqdWbrt4WxhO; Tue, 24 Apr 2012 18:43:44 +0200 (CEST)
Received: from hta-dell.lul.corp.google.com (62-20-124-50.customer.telia.com [62.20.124.50]) by eikenes.alvestrand.no (Postfix) with ESMTPSA id 3F4AE39E089; Tue, 24 Apr 2012 18:43:44 +0200 (CEST)
Message-ID: <4F96D83F.8020705@alvestrand.no>
Date: Tue, 24 Apr 2012 18:43:43 +0200
From: Harald Alvestrand <harald@alvestrand.no>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.28) Gecko/20120313 Thunderbird/3.1.20
MIME-Version: 1.0
To: Magnus Westerlund <magnus.westerlund@ericsson.com>
References: <4F869648.2020605@alvestrand.no> <4F96B7C9.1030609@ericsson.com>
In-Reply-To: <4F96B7C9.1030609@ericsson.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 8bit
Cc: "rtcweb@ietf.org" <rtcweb@ietf.org>
Subject: Re: [rtcweb] Alternative Proposal for Dynamic Codec Parameter Change (Was: Re: Resolution negotiation - a contribution)
X-BeenThere: rtcweb@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Real-Time Communication in WEB-browsers working group list <rtcweb.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtcweb>
List-Post: <mailto:rtcweb@ietf.org>
List-Help: <mailto:rtcweb-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Apr 2012 16:43:48 -0000

I'm happy to see this - there are a number of quibbles I want to make, 
but overall, I like it.

My big worry is the IPR issue - I think that negotiation of video size 
during a call is likely to be "MUST implement", and I don't want to 
require implementation of a protocol where people have filed non-RF IPR 
claims against it if there are viable alternatives.

               Harald

On 04/24/2012 04:25 PM, Magnus Westerlund wrote:
> Harald, WG,
>
> This is posted as individual. When it comes to this topic I will not
> engage in any chair activities because I do have an alternative proposal
> based on
> https://datatracker.ietf.org/doc/draft-westerlund-avtext-codec-operation-point/.
>
>
> My proposal has some significant differences in how it works compared to
> Harald's. I will start with a discussion of requirements, in addition to
> Harald's, then an overview of my proposal, and ending with a discussion
> of the differences between the proposals. This is quite long but I do
> hope you will read it to the end.
>
> Requirements
> ------------
>
> Let's start with the requirements that Harald have written. I think they
> should be formulated differently. First of all I think the requirements
> are negotiated, indicated etc in the context of a peer connection.
>
> Secondly, when it comes to maximum resolution and maximum frame-rate,
> there are actually three different limits that I think are important;
> Highest spatial resolution, highest frame-rate, and maximum complexity
> for a given video stream. The maximum complexity is often expressed as
> the number of macroblocks a video codec can process per second. This is
> a well-established complexity measure, used as part of standardized
> video codec's "level" definitions since the introduction of H.263 Annex
> X in 2004. As this is basically a joint maximum requirement on the
> maximum amount of pixels per frame times the maximum frame-rate, there
> exist cases where this complexity figure actually is the constraining
> factor forcing a sender or receiver to either request higher frame rate
> and lower resolution or higher resolution and a lower frame rate.
>
> The requirements should also be clearer in that one needs to handle
> multiple SSRCs per RTP session, including multiple cameras or audio
> streams (microphones) from a single end-point, where each stream can be
> encoded using different parameter values.
>
> I also think it is important that we consider what additional encoding
> property parameters that would make sense for WebRTC to have a
> possibility to dynamically negotiate, request and indicate.
>
> Some additional use cases that should be considered in the requirements:
>
> 1) We have use cases including multi-party communication. One way to
> realize these is to use a central node and I believe everyone agrees
> that we should have good support for getting this usage to work well.
> Thus in a basic star topology of different participants there is going
> to be different path characteristics between the central node and each
> participant.
>
> 1A) This makes it necessary to consider how one can ensure that the
> central node can deliver appropriate rates. One way is to de-couple the
> links by having the central node perform individual transcodings to
> every participant. A simpler non-transcoding central node, only
> forwarding streams between end-points, would have to enforce the lowest
> path characteristics to all. If one don't want to transcode at the
> central node and not use the lowest path characteristics to all, one
> need to consider either simulcast or scalable video coding. Both
> simulcast and scalable video coding result in that at least in the
> direction from a participant to the central node one needs to use
> multiple codec operation points. Either one per peer connection, which
> is how I see simulcast being realized with today's API and
> functionality, or using an encoding format supporting scalable coding
> within a single peer connection.
>
> 1B) In cases where the central node has minimal functionality and
> basically is a relay and an ICE plus DTLS-SRTP termination point (I
> assume EKT to avoid having to do re-encryption), there is a need to be
> able to handle sources from different participants. This puts extra
> requirements on how to successfully negotiate the parameters. For
> example changing the values for one media source should not force one to
> renegotiate with everyone.
>
> 2) The non-centralized multiparty use case appears to equally or more
> stress the need for having timely dynamic control. If each sender has a
> number of peer-connections to its peers, it may use local audio levels
> to determine if its media stream is to be sent or not. Thus the amount
> of bit-rate and screen estate needed to display will rapidly change as
> different users speaks. Thus the need for minimal delay when changing
> preferences are important.
>
> 3) We also have speech and audio including audio-only use cases. For
> audio there could also exist desire to request or indicate changes in
> the audio bandwidth required, or usage of multi-channel.
>
> 4) Adaptation to legacy node from a central node in a multi-party
> conference. In some use cases legacy nodes might have special needs that
> are within the profiles a WebRTC end-point is capable of producing. Thus
> the central node might request the nodes to constrain themselves to
> particular payload types, audio bandwidth etc to meet a joining session
> participant.
>
> 5) There appear to exist a need for expressing dynamic requests for
> target bit-rate as one parameter. This can be supported by TMMBR
> (RFC5104) but there exist additional transport related parameters could
> help with the adaptation. These include MTU, limits on packet rate, and
> amount of aggregation of audio frames in the payload.
>
> Overview
> --------
>
> The basic idea in this proposal is to use JSEP to establish the outer
> limits for behavior and then use Codec Operation Point (COP) proposal as
> detailed in draft-westerlund-avtext-codec-operation-point to handle
> dynamic changes during the session.
>
> So highest resolution, frame-rate and maximum complexity are expressed
> in JSEP SDP. Complexity is in several video codecs expressed by profile
> and level. I know that VP8 currently doesn't have this but it is under
> discussion when it comes to these parameters.
>
> During the session the browser implementation detects when there is need
> to use COP to do any of the following things.
>
> A) Request new target values for codec operation, for example due to
> that the GUI element displaying a video has changed due to window resize.
>
> B) Indicate when the end-point in its role as sender change parameters.
>
> In addition to just spatial resolution and video frame rate, I propose
> that the following parameters are considered as parameters that could be
> dynamically possible to indicate and request.
>
> Spatial resolution (as x and y resolution), Frame-rate, Picture Aspect
> ratio, Sample Aspect Ratio, Payload Type, Bit-rate, Token Bucket Size
> (To control burstiness of sender), Channels, Sampling Rate, Maximum RTP
> Packet Size, Maximum RTP Packet Rate, and Application Data Unit
> Aggregation (to control amount of audio frames in the same RTP packet).
>
>
> Differences
> -----------
>
> A) Using COP and using SDP based signaling for the dynamic changes are
> two quite different models in relation to how the interaction happens.
>
> For COP this all happens in the browser, normally initiated by the
> browser's own determination that a COP request or notification is
> needed. Harald's proposal appears to require that the JS initiate a
> renegotiation. This puts a requirement on the implementation to listen
> to the correct callbacks to know when changes happens, such as window
> resize. To my knowledge there are not yet any proposals for how the
> browser can initiate a JSEP renegotiation.
>
> Thus COP has the advantage that there is no API changes to get browser
> triggered parameter changes. W3C can select too but are not required to
> add API methods to allow JS to make codec parameter requests.
>
> The next thing is that COP does not require the JS application to have
> code to detect and handle re-negotiation. This makes it simpler for the
> basic application to get good behavior and they are not interrupted nor
> do they need to handle JSEP&  Offer/Answer state machine lock-out
> effects due to dynamic changes.
>
> How big impact these API issues have are unclear as W3C currently appear
> not to have included any discussion of how the browser can initiate a
> offer/answer exchange towards the JS when it determines a need to change
> parameters.
>
> But I am worried that using SDP and with an API that requires the
> application to listen for triggers that could benefit from a codec
> parameter renegotiation. This will likely only result in good behavior
> for the JS application implementors that are really good and find out
> what listeners and what signaling tricks are needed with JSEP to get
> good performance. I would much rather prefer good behavior by default in
> simple applications, i.e. using the default behavior that the browser
> implementor have put in.
>
> B) Using the media plane, i.e. RTCP for this signaling lets it in most
> case go directly between the encoding and the decoding entity in the
> code. There is no need to involve the JS nor the signaling server. One
> issue of using JSEP and SDP is that the state machine lock-out effects
> that can occur if one has sent an Offer. Then that browser may not be
> able to send a new updated Offer reflecting the latest change until the
> answer has been properly processed. COP doesn't have these limitations.
> It can send a new parameter request immediately, only limited by RTCP
> bandwidth restrictions. Using the media plane in my view guarantees that
> COP is never worse than what the signaling plane can perform at its best.
>
> C) As the general restrictions are determined in the initial
> negotiation, COP doesn't have the issue that in-flight media streams can
> become out of bounds. Thus no need for a two phase change of signaling
> parameters.
>
> D) Relying on draft-lennox-mmusic-sdp-source-selection-04 has several
> additional implications that should be discussed separately. The draft
> currently includes the following functionalities.
>
>    D1) It contains a media stream pause proposal. This makes it subject
> to the architectural debate currently ongoing in dispatch around
> draft-westerlund-avtext-rtp-stream-pause-00 which is a competing
> proposal for the same functionality.
>
>    D2) The inclusion of max desired frame rate in a SSRC specific way
>
>    D3) Extending the image attribute to be per SSRC specific expression
> of desired resolutions.
>
>    D4) Expressing relative priority in receiving different SSRCs.
>
>    D5) Providing an "information" on a sent SSRC
>
>    D6) Indication if the media sender is actively sending media using the
> given SSRC.
>
> Is it a correct observation that only D2 and D3 are required for the
> functionality of Resolution negotiation?
>
> E) The standardization situation is similar for both proposals. They are
> both relying on Internet drafts that are currently individual
> submissions. Both are partially caught in the architectural discussion
> which was held in Paris in the DISPATCH WG around Media pause/resume
> (draft-westerlund-avtext-rtp-stream-pause-00) and Media Stream Selection
> (draft-westerlund-dispatch-stream-selection-00) on what the most
> appropriate level of discussion are. This discussion will continue on
> the RAI-Area mailing list.
>
> F) As seen by the discussion on the mailing list the imageattr
> definitions may not be a 100% match to what is desired with Harald's
> proposal. I believe that COP's are more appropriate especially the
> "target" values possibility. In addition these are still open for
> adjustment and if they don't match WebRTC's requirements.
>
> I would also like to point out that I believe this functionality is also
> highly desirable for CLUE and that their requirements should be taken
> into account. I do think that this is one of the aspects where having
> matching functionality will make it much easier to have WebRTC to CLUE
> interworking.
>
> Thanks for reading all the way here!
>
> Cheers
>
> Magnus Westerlund
>
> ----------------------------------------------------------------------
> Multimedia Technologies, Ericsson Research EAB/TVM
> ----------------------------------------------------------------------
> Ericsson AB                | Phone  +46 10 7148287
> Färögatan 6                | Mobile +46 73 0949079
> SE-164 80 Stockholm, Sweden| mailto: magnus.westerlund@ericsson.com
> ----------------------------------------------------------------------
>
>
>