[rtcweb] Comments on draft-uberti-rtcweb-plan-00

Bernard Aboba <bernard_aboba@hotmail.com> Sat, 11 May 2013 23:59 UTC

Message-ID: <BLU169-W1158BEB6CD5A0828D7D866293A60@phx.gbl>
Content-Type: multipart/alternative; boundary="_99b62df1-10d4-4679-9d7d-00cf5754d31f_"
From: Bernard Aboba <bernard_aboba@hotmail.com>
To: "rtcweb@ietf.org" <rtcweb@ietf.org>
Date: Sat, 11 May 2013 16:59:49 -0700
Importance: Normal
MIME-Version: 1.0
Subject: [rtcweb] Comments on draft-uberti-rtcweb-plan-00
Precedence: list

First some general comments:
To some extent this document is arguing against a straw man (individual m= lines for RTP streams) that in its extreme case (dozens of RTP streams, possibly with simulcast and layered coding) is quite impractical. AFAIK, many (if not the majority) of video implementations have rejected that purist "Plan A" approach. So to some extent, the document is beating a horse that is quite dead (or should be). To characterize that dead horse as "legacy" is a mistake -- in fact legacy implementations are more sophisticated than that, and several resemble Plan B in their approach, though not quite enough to avoid potential interoperability problems. More later on how to get betterinteroperability with legacy and also how to reduce the exchange to a single O/A in common cases.
Section 2.4
2.4. Interworking with legacy devices

When interacting with a legacy application that only knows how to
deal with a small number of sources, it must be possible to degrade
gracefully to a usable basic experience, where at least a single
audio and video source are active on each side, using a typical offer
/answer exchange.
[BA] I would suggest that only being able to handle a single legacy stream is toolimiting. It seems that it should be possible to handle multiple undeclared SSRCs as long as it is assumed that they inherit the parameters declared for the RTP session they are part of.
2.1
These layouts can change dynamically, depending on the conference
content and the preferences of the receiver. As such, there are not
well-defined 'roles', that could be used to group sources into
specific 'large' or 'thumbnail' categories. As such, the requirement
Plan B attempts to satisfy is support for sending and receiving up to
hundreds of simultaneous, hetereogeneous sources.

[BA] While I agree that the layouts can change dynamically, I am wondering if there is an implication that the burden of determining the 'roles' is on the mixer. For example, it might be assumed that the mixer allocates an SSRC for the 'large' category, and other SSRCs for the 'thumbnails' and then these SSRCs are statically mapped to MSTs and rendered. However, another way to handle it is for the browser to handle the role assignment, and I would argue that this could make more sense in some cases, particularly since this could make the mixer a lot simpler, or even obviate the need for a mixer entirely (e.g. an RTP translator might work in some cases).

2.6. Simple binding of MediaStreamTrack to SDP

In WebRTC, each media source is identified by a MediaStreamTrack
object. In order to ensure that the MSTs created by the sender show
up at the receiver, each MST's id attribute needs to be reflected in
SDP.
[BA] While I might understand why the MST ID might be useful to expose in the API, I don't see why this needs to be signaled over the wire. In general, since WebRTC is supposed to be "signaling independent", we should try to avoid signaling things that don't need to be signaled. With respect to MST id, a number of approaches seem preferrable, including use of an RTP SDES item, or (even better)allowing the receiver to determine which MediaStreamTrack an incoming RTP stream belongs to when the SSRC is first received.

2.7. Support for RTX, FEC, simulcast, layered coding

For robust applications, techniques like RTX and FEC are used to
protect media, and simulcast/layered coding can be used to provide
support to hetereogeneous receivers. It needs to be possible to
support these techniques, allow the receipient to optionally use or
not use them on a source-by-source basis, and for simulcast/layered
scenarios, control which simulcast streams or layers are received.
[BA] In practice, control over simulcast streams or layers is often left with the sender. That is, a sender can determine what simulcast stream or set of layers will go a particular receiver based on the RTCP feedback. What the receiver needs to indicate is what it is capable of receiving (e.g. "I can handle up to 4 temporal layers"). While I'm not against allowing a receiver to indicate what simulcast or layered streams it wants to receive, I don't think this is the most important use case. In particular, assuming receiver-side control isn't compatible with the sender-side congestion control most commonly used along with simulcast/layered coding.
Section 3.1
By only using a m= line for each media type, as opposed to each media
source, this approach reduces the number of transports required to 2
even in complex audio/video cases.
[BA] Wasn't sure if the "2" referred to here was one for audio and one for video,or one for RTP and another for RTCP. To clarify I might say "2 (one for audio and one for video, assuming rtcp mux)".
4.1. Negotiation of new or legacy behavior
In order to know whether a given application supports Plan B, an
attribute in the offer is needed. There are various options that
could be used for this:

o a=ssrc isn't enough, since you might not have any send streams,
and therefore no a=ssrc attributes.

o a=max-*-ssrc could work, but has additional semantics

o a=msid-semantic indicates that you understand MSIDs.

Because understanding MSID is a prerequisite to using plan B, the
third option (presence of a=msid-semantic) is recommended.

[BA] I would suggest that max-*-ssrc is a better choice because there are legacy scenarios where msid might not be present.
4.2. New signaling flow

When both sides support Plan B, to properly allow both sides to
indicate which MSTs they have, and allow the remote side to select
the desired MSTs to receive, a 3-way handshake is needed (this is
just math; the offer can't select the answerer's MSTs until they know
about them).
[BA] While I understand the argument for why you need two O/As for bothsides to select the desired streams, I think that it's possible to designthe exchange so that only one O/A is needed most of the time. The keyconcept is for the offer to contain information on what the offerer is capableof receiving in addition to what it is capable of sending. Yes, anotherO/A might be needed if it turns out that the Offerer wants something differentthan what the Answerer chose, but at least the first Answer is guaranteed tobe acceptable to the Offerer.
That is, I believe we should think of the second O/A as an optional exchangethat hopefully won't be needed much of the time than as part of a "3-way" or"4-way" handshake that will execute every time.
The expected flow for this would be for the caller to
send an offer with its sources, then the callee would send back an
answer with the sources it wants the caller to send, followed
immediately by an offer with the sources that the callee has
available to send. Finally, the answerer will reply back with the
sources that it wants to request from the callee. The entire
sequence can be done in 1.5 RTT.
[BA] Why not add the info on what sources the callee has available to send to the first Answer? If the Offer also contains the maximum number of received SSRCs, the Offerer should prepare to receive that many SSRCs, and the Answer could include up to that many sourcesas enabled and start sending. That way, if the sources sent are OK with the Offerer then we don't need anotherOffer/Answer exchange, because the Answerer has indicated what sources it wants from the ones the Offerersaid it could send.
This assumes that the Offerer can handle incoming RTP streams up to the maximum number of receive SSRCs before it receives the Answer which can explicitly declare the SSRCs.
In addition, since the
sources are known ahead of time by the recipient of said sources, it
is prepared to demux them by SSRC without any signaling/media race.

[BA] If you don't make this a hard requirement, it seems like you could get by with a single O/A exchange much of the time.
4.3. Legacy signaling flow

In the legacy case, Plan B degrades gracefully back to a single
offer-answer sequence. Since there's no brokering of which sources
should be sent, the "new" endpoint picks a default media source for
each m= line, and upon receiving an answer indicating lack of support
for Plan B, it sends just the default sources to the legacy endpoint.
When receiving media from the legacy endpoint, the new endpoint
creates a "default" MediaStream (containing a single
MediaStreamTrack) for each m= line, just as when talking to any other
legacy endpoint, as specified in the MSID draft.
[BA] I'd claim that not only the "legacy" case but other cases as well can behandled with a single O/A if you are prepared to handle more than one undeclaredSSRC. Seems like a win to me.

[rtcweb] Comments on draft-uberti-rtcweb-plan-00 Bernard Aboba
Re: [rtcweb] Comments on draft-uberti-rtcweb-plan… Paul Kyzivat
Re: [rtcweb] Comments on draft-uberti-rtcweb-plan… Bernard Aboba
Re: [rtcweb] Comments on draft-uberti-rtcweb-plan… Harald Alvestrand
Re: [rtcweb] Comments on draft-uberti-rtcweb-plan… Cullen Jennings (fluffy)