Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt

Authors — even though we’re past the nominal WGLC, please respond to this as a last call comment.

I’ll hold off on writing up the publication request until this is resolved.

> From: worley@ariadne.com (Dale R. Worley)
> Subject: Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt
> Date: April 3, 2019 at 7:32:15 AM EDT
> To: Magnus Westerlund <magnus.westerlund@ericsson.com>
> Cc: draft-ietf-avtext-framemarking@ietf.org, avt@ietf.org
> 
> 
> I'm no expert on this field, but as the frame marking extension is
> intended to be used broadly over many different video encodings, I think
> it can be usefully critiqued relative to its ambition to be a
> *generalized* frame marking mechanism.  In particular, a number of its
> features seem to reference ideas which generally apply to multiple
> encodings.  But this leaves a great deal of room for lack of alignment
> as to the exact semantics of the features, which could easily lead to a
> lot of subtle interoperation problems.  So I am here pushing for a
> clearer definition of what is and is not meant by the features.
> 
> There is some oddity in how the sections are structured.  The short form
> is defined in 3.1, and the long form is defined in 3.2.  The three
> mapping for specific codecs are listed in 3.2.1.1, 3.2.1.2, and 3.2.1.3,
> and 3.2.1.4.  It would be better to group the two definitional sections
> together and group the four example sections together.
> 
> Also, 3.2.1.3 (H264 (AVC) LID Mapping) and 3.2.1.4 (VP8 LID Mapping)
> don't specify how the S, E, I, D, and B bits are determined from the
> codec's output packets.
> 
> Regarding the multiple (four, actually) formats of the extension, it
> helps specifying them if they can all be mapped into the same semantic
> data structure.  For example,
> 
>    TID is the temporal layer index.  It is implicitly 0 if the short
>    format is used.
> 
>    LID is the (spacial) layer index.  It is implicitly 0 if the short
>    format is used or the L=1 form of the long format is used.
> 
>    TL0PICIDX:  When TID is 0:  If present, it is a cyclic counter
>    labeling the frames.  If not present, the frames have no such labels. 
>    When TID is not 0, it indicates that this frame in this layer
>    depends on the frame with this label in the layer with TID 0.
> 
> Notice that a missing TL0PICIDX has different semantics than a missing
> TID or LID.
> 
> Given the similarity of "temporal layer index" and "layer ID", it seems
> like you want a more distinctive phrase for the latter.  Could it be
> changed to "spatial layer" or "resolution layer"?
> 
> There seems to be no way to signal whether the short form is used
> vs. the L=0 version of the long form (if B=0 and TID=0) -- The ID value
> for both is signaled in SDP by
> 
>      a=extmap:3 urn:ietf:params:rtp-hdrext:framemarking
> 
> This doesn't cause a problem, as the semantics of the two alternatives
> are the same, but it prevents the 4 reserved bits in the short form from
> being defined in the future for any purpose other than B and LID. --
> Alternatively, is the short form simply what the extension is reduced to
> when using non-scalable streams, as those must necessarily have B=0 and
> LID=0?  (Also see my query below regarding the "default" value of B.)
> 
> It seems that the intention is that the video stream can be divided into
> substreams of RTP packets, called "layers", each of which is identified
> by a particular TID and LID, that is, TID/LID defines a
> *two-dimensional* hierarchy.  "They convey a layer hierarchy with [the
> layer with] TID=0 and LID=0 identifying the base layer."
> 
> My guess is that the special case for interpreting the TL0PICIDX value
> is actually when both TID and LID = 0, that is, the base layer, not just
> TID = 0 as stated in the text.  (If I'm wrong, the structure here is
> more complicated than I'm describing, with the TL0PICIDX labels of an
> upper layer referring to the label with the *same* LID but TID = 0.)
> 
> The idea seems to be that one can "efficiently" discard layers from the
> RTP stream, as long as: if one keeps a layer with a particular TID and
> LID, one keeps all layers with lesser or equal TID and LID.  I can't
> quite see how best to define "efficiently" here, but it seems to be the
> central reason for labeling the layers -- that a receiver can
> successfully decode all of the data in all of the layers that remain
> present.
> 
> Things are more interesting in regard to what packets can be discarded
> from a layer "efficiently".  The S, E, I, D, and B bits seem to be
> intended to guide a device that needs to discard packets.  The use of D
> bits is specified:
> 
>   When an RTP switch needs to discard a received video frame due to
>   congestion control considerations, it is RECOMMENDED that it
>   preferably drop frames marked with the D (Discardable) bit set [...]
> 
> And I suspect that it is implied that if packets are dropped from one
> frame, further packets from the same frame are preferred to be dropped.
> The S and E bits are intended to help with this process.
> 
> But dropping whole frames to some degree conflicts with the fact that
> small losses from video layers can often be recovered from, either due
> to redundancy in the layer, or by loss-reconstruction strategies in the
> receiver.  However, if one drops a *lot* of packets from one frame, one
> might as well discard the remainder of them.
> 
> The I bit suggests that there are provisions for dependency between the
> frames in a single layer, and dependency between frames is not the same
> for all frames.  It appears that if one frame of a layer is dropped, the
> following frames are preferred to be dropped until a frame with I = 1 is
> seen.
> 
> I am less clear on what the B bit means -- presumably all layers with
> lower TID and LID than the layer containing the frame in question are
> retained, B doesn't seem to carry useful information.
> 
> And there seems to be a problem with "defaulting" the value of B to 0
> when there is no scalability.
> 
> As stated:
> 
>   o  B: Base Layer Sync (1 bit) - MUST be 1 if the sender knows this
>      frame only depends on the base temporal layer; otherwise MUST be 0.
> 
> This can be stated equivalently:
> 
>      MUST be 1 if the sender knows this frame does not depend on any
>      frames that do not have TID=0.
> 
> Now if the frame itself has TID=0, then it cannot (by the ordering of
> the layers) depend on any frame that does not also have TID=0.  The
> consequence is that the "natural" value of B in TID=0 layers is 1.  And
> when there is no scalability, the only layer has TID=0.
> 
> I think what is going on is that there's an implicit structure of
> dependencies between the frames of a layer (a frame depends on earlier
> frames), and between the frames of different layers (a frame can depend
> on frames with lower TID/LID and no later in time), and the various bits
> are used to signal the *lack* of certain possible dependencies, but how
> the bits do this needs to be clarified.  (The meaning of TL0PICIDX
> particularly needs to be specified.)  But the implicit dependency
> structure isn't spelled out.  That makes things harder in two ways:  (1)
> it is not clear what sorts of dependencies future codecs are *not*
> allowed tointroduce, and (2) it is difficult to state exactly what
> dependencies are *removed* by particular signaling.
> 
> 3.2.1.  Layer ID Mappings for Scalable Streams
> 
> All of the descriptions for specific codecs contain "ID=2", whereas the
> generic descriptions of the extension formats show "ID=?".  The latter
> is correct, since the ID value is negotiated for every RTP stream.
> 
> 3.4.  Usage Considerations
> 
> The switching of video streams is recommended to be done this way:
> 
>   When an RTP switch wants to forward a new video stream to a receiver,
>   it is RECOMMENDED to select the new video stream from the first
>   switching point with the I (Independent) bit set in all spatial
>   layers and forward the same.  An RTP switch can request a media
>   source to generate a switching point by sending Full Intra Request
>   (RTCP FIR) as defined in [RFC5104], for example.
> 
> This is difficult to implement in general, as it requires the switch to
> keep track of all the layer IDs that have been seen, then look ahead in
> the stream to see if, over a narrow range of time, all of the layers
> that have been seen have packets with I set.  If the fundamental purpose
> of I is to signal the best points to switch streams, it would be better
> to define its semantics to be that.  E.g., "If a switch intends to start
> forwarding a video stream, and within that stream, transmitting all
> frames with TID and LID less than or equal to certain values, it should
> start forwarding the stream beginning with a packet within that layer
> that has I set."  That is, I signals that at this point, the coming
> frames of this layer and all layers with lesser TID/LID can be decoded
> without dependency on any previous frames.
> 
> 3.4.2.  Scalability Structures
> 
> It would be more effective to state that that for "complex or irregular
> scalability structures", subdivision by TID and LID is not effective and
> so such structures should mark all packets with TID=0 and LID=0.  The
> current text suggests that the switch is required to know whether such a
> structure is in use, and if so, ignore the TID and LID fields, which
> suggests that the sender can put various values in those fields.  This
> would lead to requiring the switch to know what encoding is in use, and
> avoiding that is the point of this document.
> 
> Dale
> 
> _______________________________________________
> Audio/Video Transport Core Maintenance
> avt@ietf.org
> https://www.ietf.org/mailman/listinfo/avt