Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt
worley@ariadne.com (Dale R. Worley) Wed, 03 April 2019 11:32 UTC
Return-Path: <worley@alum.mit.edu>
X-Original-To: avt@ietfa.amsl.com
Delivered-To: avt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A688A1200EB for <avt@ietfa.amsl.com>; Wed, 3 Apr 2019 04:32:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.934
X-Spam-Level:
X-Spam-Status: No, score=-1.934 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_SOFTFAIL=0.665] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=comcastmailservice.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id O5P26mUyUkZ3 for <avt@ietfa.amsl.com>; Wed, 3 Apr 2019 04:32:18 -0700 (PDT)
Received: from resqmta-ch2-01v.sys.comcast.net (resqmta-ch2-01v.sys.comcast.net [IPv6:2001:558:fe21:29:69:252:207:33]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A83751200D6 for <avt@ietf.org>; Wed, 3 Apr 2019 04:32:18 -0700 (PDT)
Received: from resomta-ch2-15v.sys.comcast.net ([69.252.207.111]) by resqmta-ch2-01v.sys.comcast.net with ESMTP id BdtNhfCRCVyCgBe7th5vdD; Wed, 03 Apr 2019 11:32:17 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcastmailservice.net; s=20180828_2048; t=1554291137; bh=2l/tFZGlsgUZgbeqwFPlolR32qHzrge47rCEg47uKQ4=; h=Received:Received:Received:Received:From:To:Subject:Date: Message-ID; b=WNvPm7cmJe56Fmct4mrci95pacFENZjvbzuakNVS/ETK7THVWxg3i8jrZ267L2XqF pUeypJmZWtkUFAHmyd1WV52YxU+zQ+JmtDT4bju3Q4nmN2lQEo39UUjn9e/OTCvJWl jeoyPOMT0AefEhHJsCB1il59WF5iB4INTkLy5T4UJkhuyEyujnjbM8COK18gU/Qun3 kp39YuHe0wwXYyBR+AA30ONbtftVKeZEJCI0r0MaKSMHntczUQ4bNDV6pMBpU/DJx5 G4fsl2lH6wd1N4TOwnMUOjLmDdI0o+7XgCltPWpIQX4m/i8BL9fem245RnprXEIYon VuIZ8RzdVTybw==
Received: from hobgoblin.ariadne.com ([IPv6:2601:192:4603:9471:222:fbff:fe91:d396]) by resomta-ch2-15v.sys.comcast.net with ESMTPA id Be7shIOHgcRK8Be7sh4qDo; Wed, 03 Apr 2019 11:32:17 +0000
X-Xfinity-VMeta: sc=-100;st=legit
Received: from hobgoblin.ariadne.com (hobgoblin.ariadne.com [127.0.0.1]) by hobgoblin.ariadne.com (8.14.7/8.14.7) with ESMTP id x33BWFdp014996; Wed, 3 Apr 2019 07:32:15 -0400
Received: (from worley@localhost) by hobgoblin.ariadne.com (8.14.7/8.14.7/Submit) id x33BWFbT014991; Wed, 3 Apr 2019 07:32:15 -0400
X-Authentication-Warning: hobgoblin.ariadne.com: worley set sender to worley@alum.mit.edu using -f
From: worley@ariadne.com
To: Magnus Westerlund <magnus.westerlund@ericsson.com>
Cc: mzanaty@cisco.com, draft-ietf-avtext-framemarking@ietf.org, avt@ietf.org
In-Reply-To: <HE1PR0701MB2522C3B9D045627496CDE91A955A0@HE1PR0701MB2522.eurprd07.prod.outlook.com> (magnus.westerlund@ericsson.com)
Sender: worley@ariadne.com
Date: Wed, 03 Apr 2019 07:32:15 -0400
Message-ID: <877ecbiatc.fsf@hobgoblin.ariadne.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/OoM69ujWui7X1r8nW9NdekTgv9k>
Subject: Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/avt/>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Apr 2019 11:32:22 -0000
I'm no expert on this field, but as the frame marking extension is intended to be used broadly over many different video encodings, I think it can be usefully critiqued relative to its ambition to be a *generalized* frame marking mechanism. In particular, a number of its features seem to reference ideas which generally apply to multiple encodings. But this leaves a great deal of room for lack of alignment as to the exact semantics of the features, which could easily lead to a lot of subtle interoperation problems. So I am here pushing for a clearer definition of what is and is not meant by the features. There is some oddity in how the sections are structured. The short form is defined in 3.1, and the long form is defined in 3.2. The three mapping for specific codecs are listed in 3.2.1.1, 3.2.1.2, and 3.2.1.3, and 3.2.1.4. It would be better to group the two definitional sections together and group the four example sections together. Also, 3.2.1.3 (H264 (AVC) LID Mapping) and 3.2.1.4 (VP8 LID Mapping) don't specify how the S, E, I, D, and B bits are determined from the codec's output packets. Regarding the multiple (four, actually) formats of the extension, it helps specifying them if they can all be mapped into the same semantic data structure. For example, TID is the temporal layer index. It is implicitly 0 if the short format is used. LID is the (spacial) layer index. It is implicitly 0 if the short format is used or the L=1 form of the long format is used. TL0PICIDX: When TID is 0: If present, it is a cyclic counter labeling the frames. If not present, the frames have no such labels. When TID is not 0, it indicates that this frame in this layer depends on the frame with this label in the layer with TID 0. Notice that a missing TL0PICIDX has different semantics than a missing TID or LID. Given the similarity of "temporal layer index" and "layer ID", it seems like you want a more distinctive phrase for the latter. Could it be changed to "spatial layer" or "resolution layer"? There seems to be no way to signal whether the short form is used vs. the L=0 version of the long form (if B=0 and TID=0) -- The ID value for both is signaled in SDP by a=extmap:3 urn:ietf:params:rtp-hdrext:framemarking This doesn't cause a problem, as the semantics of the two alternatives are the same, but it prevents the 4 reserved bits in the short form from being defined in the future for any purpose other than B and LID. -- Alternatively, is the short form simply what the extension is reduced to when using non-scalable streams, as those must necessarily have B=0 and LID=0? (Also see my query below regarding the "default" value of B.) It seems that the intention is that the video stream can be divided into substreams of RTP packets, called "layers", each of which is identified by a particular TID and LID, that is, TID/LID defines a *two-dimensional* hierarchy. "They convey a layer hierarchy with [the layer with] TID=0 and LID=0 identifying the base layer." My guess is that the special case for interpreting the TL0PICIDX value is actually when both TID and LID = 0, that is, the base layer, not just TID = 0 as stated in the text. (If I'm wrong, the structure here is more complicated than I'm describing, with the TL0PICIDX labels of an upper layer referring to the label with the *same* LID but TID = 0.) The idea seems to be that one can "efficiently" discard layers from the RTP stream, as long as: if one keeps a layer with a particular TID and LID, one keeps all layers with lesser or equal TID and LID. I can't quite see how best to define "efficiently" here, but it seems to be the central reason for labeling the layers -- that a receiver can successfully decode all of the data in all of the layers that remain present. Things are more interesting in regard to what packets can be discarded from a layer "efficiently". The S, E, I, D, and B bits seem to be intended to guide a device that needs to discard packets. The use of D bits is specified: When an RTP switch needs to discard a received video frame due to congestion control considerations, it is RECOMMENDED that it preferably drop frames marked with the D (Discardable) bit set [...] And I suspect that it is implied that if packets are dropped from one frame, further packets from the same frame are preferred to be dropped. The S and E bits are intended to help with this process. But dropping whole frames to some degree conflicts with the fact that small losses from video layers can often be recovered from, either due to redundancy in the layer, or by loss-reconstruction strategies in the receiver. However, if one drops a *lot* of packets from one frame, one might as well discard the remainder of them. The I bit suggests that there are provisions for dependency between the frames in a single layer, and dependency between frames is not the same for all frames. It appears that if one frame of a layer is dropped, the following frames are preferred to be dropped until a frame with I = 1 is seen. I am less clear on what the B bit means -- presumably all layers with lower TID and LID than the layer containing the frame in question are retained, B doesn't seem to carry useful information. And there seems to be a problem with "defaulting" the value of B to 0 when there is no scalability. As stated: o B: Base Layer Sync (1 bit) - MUST be 1 if the sender knows this frame only depends on the base temporal layer; otherwise MUST be 0. This can be stated equivalently: MUST be 1 if the sender knows this frame does not depend on any frames that do not have TID=0. Now if the frame itself has TID=0, then it cannot (by the ordering of the layers) depend on any frame that does not also have TID=0. The consequence is that the "natural" value of B in TID=0 layers is 1. And when there is no scalability, the only layer has TID=0. I think what is going on is that there's an implicit structure of dependencies between the frames of a layer (a frame depends on earlier frames), and between the frames of different layers (a frame can depend on frames with lower TID/LID and no later in time), and the various bits are used to signal the *lack* of certain possible dependencies, but how the bits do this needs to be clarified. (The meaning of TL0PICIDX particularly needs to be specified.) But the implicit dependency structure isn't spelled out. That makes things harder in two ways: (1) it is not clear what sorts of dependencies future codecs are *not* allowed tointroduce, and (2) it is difficult to state exactly what dependencies are *removed* by particular signaling. 3.2.1. Layer ID Mappings for Scalable Streams All of the descriptions for specific codecs contain "ID=2", whereas the generic descriptions of the extension formats show "ID=?". The latter is correct, since the ID value is negotiated for every RTP stream. 3.4. Usage Considerations The switching of video streams is recommended to be done this way: When an RTP switch wants to forward a new video stream to a receiver, it is RECOMMENDED to select the new video stream from the first switching point with the I (Independent) bit set in all spatial layers and forward the same. An RTP switch can request a media source to generate a switching point by sending Full Intra Request (RTCP FIR) as defined in [RFC5104], for example. This is difficult to implement in general, as it requires the switch to keep track of all the layer IDs that have been seen, then look ahead in the stream to see if, over a narrow range of time, all of the layers that have been seen have packets with I set. If the fundamental purpose of I is to signal the best points to switch streams, it would be better to define its semantics to be that. E.g., "If a switch intends to start forwarding a video stream, and within that stream, transmitting all frames with TID and LID less than or equal to certain values, it should start forwarding the stream beginning with a packet within that layer that has I set." That is, I signals that at this point, the coming frames of this layer and all layers with lesser TID/LID can be decoded without dependency on any previous frames. 3.4.2. Scalability Structures It would be more effective to state that that for "complex or irregular scalability structures", subdivision by TID and LID is not effective and so such structures should mark all packets with TID=0 and LID=0. The current text suggests that the switch is required to know whether such a structure is in use, and if so, ignore the TID and LID fields, which suggests that the sender can put various values in those fields. This would lead to requiring the switch to know what encoding is in use, and avoiding that is the point of this document. Dale
- [AVTCORE] I-D Action: draft-ietf-avtext-framemark… internet-drafts
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Magnus Westerlund
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Magnus Westerlund
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Dale R. Worley
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Jonathan Lennox
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Magnus Westerlund
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Dale R. Worley
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Dale R. Worley
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Dale R. Worley
- Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)