Re: [AVTCORE] Comments on draft-westerlund-avtcore-rtp-simulcast-00

"Brandenburg, R. (Ray) van" <> Thu, 03 November 2011 15:27 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id DFFE91F0C72 for <>; Thu, 3 Nov 2011 08:27:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -0.403
X-Spam-Status: No, score=-0.403 tagged_above=-999 required=5 tests=[AWL=0.101, BAYES_00=-2.599, HELO_EQ_NL=0.55, HOST_EQ_NL=1.545]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id fjt59+YHHnZU for <>; Thu, 3 Nov 2011 08:27:50 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 572F211E80CD for <>; Thu, 3 Nov 2011 08:27:48 -0700 (PDT)
X-IronPort-AV: E=Sophos;i="4.69,450,1315173600"; d="scan'208";a="59850394"
Received: from unknown (HELO ([]) by with ESMTP; 03 Nov 2011 16:27:44 +0100
Received: from ([]) by ([]) with mapi id 14.01.0323.003; Thu, 3 Nov 2011 16:27:44 +0100
From: "Brandenburg, R. (Ray) van" <>
To: Magnus Westerlund <>
Thread-Topic: [AVTCORE] Comments on draft-westerlund-avtcore-rtp-simulcast-00
Thread-Index: AcyTIZ2iVQibSCxwScyjBkbRqff1xwHC0ncAAAMuygA=
Date: Thu, 3 Nov 2011 15:27:42 +0000
Message-ID: <>
References: <> <>
In-Reply-To: <>
Accept-Language: en-US, nl-NL
Content-Language: en-US
x-originating-ip: []
Content-Type: text/plain; charset="iso-8859-1"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Cc: "" <>
Subject: Re: [AVTCORE] Comments on draft-westerlund-avtcore-rtp-simulcast-00
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 03 Nov 2011 15:27:52 -0000

Hi Magnus, 

See comments inline. 


-----Original Message-----
From: Magnus Westerlund [] 
Sent: donderdag 3 november 2011 15:32
To: Brandenburg, R. (Ray) van
Subject: Re: [AVTCORE] Comments on draft-westerlund-avtcore-rtp-simulcast-00


Thanks for the review. Some replies inline.

On 2011-10-26 09:27, Brandenburg, R. (Ray) van wrote:
> Hi Magnus,
> I reviewed your draft draft-westerlund-avtcore-rtp-simulcast-00 and 
> have some questions/comments:
> -          Section 3: This section lacks a summary/conclusion. A number
> of different scenarios are introduced, some of which are explicitly 
> stated to be out-of-scope (e.g. 3.5), and some are stated to be 
> in-scope (e.g. 3.3). Some scenarios however (e.g. 3.2) are not 
> specified to be either in- or out-of-scope.

We consider Multicast in scope but with less importance that the mixer based scenarios.

[RAY: Ok. Maybe you could add the reasoning to the document?]

> -          Section 3: This section talks quite a bit about the relation
> between Simulcast and Scalable Coding and in which situations one of 
> the two is more suitable. I'm not sure why this is relevant. Simulcast 
> and Scalable Coding might be two techniques that in some cases can be 
> used to solve the same problem, the layer on which they do this is 
> completely different (codec-level versus transport level). The fact 
> that Scalable Coding might, in some situations, be a better solution 
> than Simulcast does not mean that Simulcast should not be standardized 
> to handle those situations. I understand that you've written this 
> mainly from a video conferencing perspective, in which Scalable Coding 
> might be used more often, but in a lot of situations, such as IPTV, 
> Scalable Coding is just not an option due to its inherent complexity. 
> In these situations Simulcast might be able to solve some real issues, 
> despite the fact that Scalable Coding might be able to solve these issues more efficiently.

I think layered encoding is both a codec level and a transport level thing. Just as in simulcast you need transport level methods for selecting the media streams/layers that you want to receive.

I fully agree that there are use cases where one might be considered much more appropriate than the other. And that is part of our arguments for why it should be defined for usage also, even though we have layered coding specified.

[RAY: I don't agree with you that layered coding is a transport-level thing (or maybe I don't understand your definition of layered coding. With layered coding I mean e.g. SVC). It might be that layered-coding requires some specific transport-level properties (such as alignment), however these are not unique to layered coding. Your draft (or specifically section 3) seems to imply that Simulcast can be considered an alternative to Scalable Coding and that you therefore have to argue why Simulcast is better in some situations. In my opinion, Scalable Coding is an alternative to creating multiple bitrates/resolutions (i.e. Adaptive Streaming) and not to Simulcast. My understanding is that Simulcast is a method to deliver those multiple bitrates, the same way it could be used to deliver Scalable Coding streams.]

> -          Section 3.1.1 See comment above. Why is this section relevant
> to the rest of the document?

See above. But will attempt to clarify this.


> -          Section 3.2.1: This scenario is the only one of the scenarios
> in section 3 to make a clear pro/con of SSRC versus Session based 
> multiplexing. Furthermore, it does so before the actual discussion of 
> SSRC vs. Session-based multiplexing is introduced in section 4.

Ok, for consistency we should likely move the pro and cons to the analysis section.


> -          Section 3.3: For the tiled streaming use case described
> below, this scenario is especially relevant. Is there any technical 
> reason why it has less emphasis than the RTP-Mixer scenario?

If my understanding of tiled is correct, I don't see how this use case maps onto tiling. Please elaborate.

[RAY: see my comment below]

> -          Section 5.5: Your conclusion here is that Session Based
> Multiplexing is the best choice, since SSRC multiplexing seems to 
> require a large amount of extensions in order to work. How does this 
> conclusion relate to the draft-lennox-rtcweb-rtp-media-type-mux-00 
> which seems to suggest that SSRC multiplexing is not that difficult? 
> (I should note that I have no experience with sending multiple streams 
> inside a single RTP session, so it could be that I don't understand 
> the issues
> correctly)

If you review RTP Multiplexing architecture you are likely going realize that there are a number of cases where I consider SSRC multiplexed media streams to be the best and most appropriate choice. However, it is important that one consider the use case. So this document argues that for Simulcast the best choice is to have the simulcast version in different RTP sessions. However, if one have multiple media streams one should in fact use multiple SSRC in each of these sessions to carry the different media sources.

And in sessions where all media streams going to be sent are of the same type and same configuration then using SSRC multiplexing becomes easy.
The added twist in Lennox's draft is the multiple media types and that do bring a few issues, but not that many.
has discussion on that proposal.

The main point of the multiplexing architecture document is to make clear that one size fits all does not apply to using multiple SSRCs vs using several RTP sessions.

[RAY: Thanks for clearing that up]

> -          Section 6.2: How do you suggest the sequence numbers and RTP
> timestamps are handled between multiple alternative streams? Would it 
> make things easier if these were aligned across multiple streams?

No, I don't believe in alignment. first of all the sequence numbers needs to progress with the packets actually sent for a media stream.
Encoding alternatives are quite likely to have different packet rates, especially for video as soon as one comes to a media bit-rate where some alternative requires multiple packets per video frame.

When it comes to timestamp I don't see any need either. For the best robustness and simplicity one should simply skip using alignment and instead use the mechanism for synchronizing media clocks with each other that exist, i.e. RTCP SR and Rapid Synchronization of RTP flows (RFC 6051).

[RAY: For the sequence numbers I agree, however about the timestamps I'm not so sure. In some cases it might be particularly useful to align timestamps. One example are areas where you need bit-level alignment (such as SVC and spatially segmented streams). For these situations NTP-based solutions such as RTCP SR are not accurate enough]

> And one more remark:
> -          In the introduction session you describe three different ways
> in which the encoding of a media content can differ: bit-rate, codec 
> and sampling. Recent work in the area of immersive media has proposed 
> a fourth method: tiled video. With tiled video (or spatial 
> segmentation), the video resulting from a single camera is split into 
> a number of areas, each focusing on a particular spatial area of the 
> video (e.g. a single video source could be tiled into four separate 
> video streams, one describing the topleft quarter of the video and 
> three more describing the topright, bottomleft and bottomright quarters respectively).

If I understand this correct, what you really are talking about are breaking up the output of one camer or several aligned cameras that can be considered to produce a single video image is divided into several separate streams. Then on the receiver side these are commonly one device per media stream. So a practical example would be to split the output of 4k camera into four 1080p media streams and on the receiver side use 4 1080 projectors that are aligned so the right edges matches.

If I am correct in my understanding that isn't simulcast. There are no two or more  different alternatives of the media source. From my perspective it looks like your splitting your single media source into multiple synchronized sources. However, it is clear that if you are doing Tiled video then you would like to consider the transport options for it. Because it depends if the receiver is actually a single end-point with 4 displays connected to them or if one actually have four different end-points and like to have each stream only being delivered to each end-point, not all of the others.

[RAY: Sorry for not being any clearer. We have a different understanding of tiled streaming (or spatial segmentation as is probably more descriptive). The use case is not having four network-connected projectors. The use case is having a mobile device with a relatively low resolution (e.g. 480x320 pixels) being able to navigate through high resolution (4k/8k) content. By splitting the output of a single 4k camera into 'tiles' of for example 240x160 pixels, and storing each of these as a separate stream, a client can request multiple streams simultaneously and reconstruct part of the original video (in this case it would request 4 'tiles'). When a user wants to navigate through the content (e.g. pan/zoom) it requests a different subset of tiles. 	
As you can see, what we have here is a large number of tiles (streams), coming from the same source, with different clients requesting different subsets of these tiles. In that aspect, it is not that dissimilar from some of the use cases you describe in your draft.] 


Magnus Westerlund

Multimedia Technologies, Ericsson Research EAB/TVM
Ericsson AB                | Phone  +46 10 7148287
Färögatan 6                | Mobile +46 73 0949079
SE-164 80 Stockholm, Sweden| mailto:

This e-mail and its contents are subject to the DISCLAIMER at