Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt

worley@ariadne.com (Dale R. Worley) Wed, 03 April 2019 11:32 UTC

Return-Path: <worley@alum.mit.edu>
X-Original-To: avt@ietfa.amsl.com
Delivered-To: avt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A688A1200EB for <avt@ietfa.amsl.com>; Wed, 3 Apr 2019 04:32:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.934
X-Spam-Level:
X-Spam-Status: No, score=-1.934 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_SOFTFAIL=0.665] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=comcastmailservice.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id O5P26mUyUkZ3 for <avt@ietfa.amsl.com>; Wed, 3 Apr 2019 04:32:18 -0700 (PDT)
Received: from resqmta-ch2-01v.sys.comcast.net (resqmta-ch2-01v.sys.comcast.net [IPv6:2001:558:fe21:29:69:252:207:33]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A83751200D6 for <avt@ietf.org>; Wed, 3 Apr 2019 04:32:18 -0700 (PDT)
Received: from resomta-ch2-15v.sys.comcast.net ([69.252.207.111]) by resqmta-ch2-01v.sys.comcast.net with ESMTP id BdtNhfCRCVyCgBe7th5vdD; Wed, 03 Apr 2019 11:32:17 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcastmailservice.net; s=20180828_2048; t=1554291137; bh=2l/tFZGlsgUZgbeqwFPlolR32qHzrge47rCEg47uKQ4=; h=Received:Received:Received:Received:From:To:Subject:Date: Message-ID; b=WNvPm7cmJe56Fmct4mrci95pacFENZjvbzuakNVS/ETK7THVWxg3i8jrZ267L2XqF pUeypJmZWtkUFAHmyd1WV52YxU+zQ+JmtDT4bju3Q4nmN2lQEo39UUjn9e/OTCvJWl jeoyPOMT0AefEhHJsCB1il59WF5iB4INTkLy5T4UJkhuyEyujnjbM8COK18gU/Qun3 kp39YuHe0wwXYyBR+AA30ONbtftVKeZEJCI0r0MaKSMHntczUQ4bNDV6pMBpU/DJx5 G4fsl2lH6wd1N4TOwnMUOjLmDdI0o+7XgCltPWpIQX4m/i8BL9fem245RnprXEIYon VuIZ8RzdVTybw==
Received: from hobgoblin.ariadne.com ([IPv6:2601:192:4603:9471:222:fbff:fe91:d396]) by resomta-ch2-15v.sys.comcast.net with ESMTPA id Be7shIOHgcRK8Be7sh4qDo; Wed, 03 Apr 2019 11:32:17 +0000
X-Xfinity-VMeta: sc=-100;st=legit
Received: from hobgoblin.ariadne.com (hobgoblin.ariadne.com [127.0.0.1]) by hobgoblin.ariadne.com (8.14.7/8.14.7) with ESMTP id x33BWFdp014996; Wed, 3 Apr 2019 07:32:15 -0400
Received: (from worley@localhost) by hobgoblin.ariadne.com (8.14.7/8.14.7/Submit) id x33BWFbT014991; Wed, 3 Apr 2019 07:32:15 -0400
X-Authentication-Warning: hobgoblin.ariadne.com: worley set sender to worley@alum.mit.edu using -f
From: worley@ariadne.com
To: Magnus Westerlund <magnus.westerlund@ericsson.com>
Cc: mzanaty@cisco.com, draft-ietf-avtext-framemarking@ietf.org, avt@ietf.org
In-Reply-To: <HE1PR0701MB2522C3B9D045627496CDE91A955A0@HE1PR0701MB2522.eurprd07.prod.outlook.com> (magnus.westerlund@ericsson.com)
Sender: worley@ariadne.com
Date: Wed, 03 Apr 2019 07:32:15 -0400
Message-ID: <877ecbiatc.fsf@hobgoblin.ariadne.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/OoM69ujWui7X1r8nW9NdekTgv9k>
Subject: Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/avt/>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Apr 2019 11:32:22 -0000

I'm no expert on this field, but as the frame marking extension is
intended to be used broadly over many different video encodings, I think
it can be usefully critiqued relative to its ambition to be a
*generalized* frame marking mechanism.  In particular, a number of its
features seem to reference ideas which generally apply to multiple
encodings.  But this leaves a great deal of room for lack of alignment
as to the exact semantics of the features, which could easily lead to a
lot of subtle interoperation problems.  So I am here pushing for a
clearer definition of what is and is not meant by the features.

There is some oddity in how the sections are structured.  The short form
is defined in 3.1, and the long form is defined in 3.2.  The three
mapping for specific codecs are listed in 3.2.1.1, 3.2.1.2, and 3.2.1.3,
and 3.2.1.4.  It would be better to group the two definitional sections
together and group the four example sections together.

Also, 3.2.1.3 (H264 (AVC) LID Mapping) and 3.2.1.4 (VP8 LID Mapping)
don't specify how the S, E, I, D, and B bits are determined from the
codec's output packets.

Regarding the multiple (four, actually) formats of the extension, it
helps specifying them if they can all be mapped into the same semantic
data structure.  For example,

    TID is the temporal layer index.  It is implicitly 0 if the short
    format is used.

    LID is the (spacial) layer index.  It is implicitly 0 if the short
    format is used or the L=1 form of the long format is used.

    TL0PICIDX:  When TID is 0:  If present, it is a cyclic counter
    labeling the frames.  If not present, the frames have no such labels. 
    When TID is not 0, it indicates that this frame in this layer
    depends on the frame with this label in the layer with TID 0.

Notice that a missing TL0PICIDX has different semantics than a missing
TID or LID.

Given the similarity of "temporal layer index" and "layer ID", it seems
like you want a more distinctive phrase for the latter.  Could it be
changed to "spatial layer" or "resolution layer"?

There seems to be no way to signal whether the short form is used
vs. the L=0 version of the long form (if B=0 and TID=0) -- The ID value
for both is signaled in SDP by

      a=extmap:3 urn:ietf:params:rtp-hdrext:framemarking

This doesn't cause a problem, as the semantics of the two alternatives
are the same, but it prevents the 4 reserved bits in the short form from
being defined in the future for any purpose other than B and LID. --
Alternatively, is the short form simply what the extension is reduced to
when using non-scalable streams, as those must necessarily have B=0 and
LID=0?  (Also see my query below regarding the "default" value of B.)

It seems that the intention is that the video stream can be divided into
substreams of RTP packets, called "layers", each of which is identified
by a particular TID and LID, that is, TID/LID defines a
*two-dimensional* hierarchy.  "They convey a layer hierarchy with [the
layer with] TID=0 and LID=0 identifying the base layer."

My guess is that the special case for interpreting the TL0PICIDX value
is actually when both TID and LID = 0, that is, the base layer, not just
TID = 0 as stated in the text.  (If I'm wrong, the structure here is
more complicated than I'm describing, with the TL0PICIDX labels of an
upper layer referring to the label with the *same* LID but TID = 0.)

The idea seems to be that one can "efficiently" discard layers from the
RTP stream, as long as: if one keeps a layer with a particular TID and
LID, one keeps all layers with lesser or equal TID and LID.  I can't
quite see how best to define "efficiently" here, but it seems to be the
central reason for labeling the layers -- that a receiver can
successfully decode all of the data in all of the layers that remain
present.

Things are more interesting in regard to what packets can be discarded
from a layer "efficiently".  The S, E, I, D, and B bits seem to be
intended to guide a device that needs to discard packets.  The use of D
bits is specified:

   When an RTP switch needs to discard a received video frame due to
   congestion control considerations, it is RECOMMENDED that it
   preferably drop frames marked with the D (Discardable) bit set [...]

And I suspect that it is implied that if packets are dropped from one
frame, further packets from the same frame are preferred to be dropped.
The S and E bits are intended to help with this process.

But dropping whole frames to some degree conflicts with the fact that
small losses from video layers can often be recovered from, either due
to redundancy in the layer, or by loss-reconstruction strategies in the
receiver.  However, if one drops a *lot* of packets from one frame, one
might as well discard the remainder of them.

The I bit suggests that there are provisions for dependency between the
frames in a single layer, and dependency between frames is not the same
for all frames.  It appears that if one frame of a layer is dropped, the
following frames are preferred to be dropped until a frame with I = 1 is
seen.

I am less clear on what the B bit means -- presumably all layers with
lower TID and LID than the layer containing the frame in question are
retained, B doesn't seem to carry useful information.

And there seems to be a problem with "defaulting" the value of B to 0
when there is no scalability.

As stated:

   o  B: Base Layer Sync (1 bit) - MUST be 1 if the sender knows this
      frame only depends on the base temporal layer; otherwise MUST be 0.

This can be stated equivalently:

      MUST be 1 if the sender knows this frame does not depend on any
      frames that do not have TID=0.

Now if the frame itself has TID=0, then it cannot (by the ordering of
the layers) depend on any frame that does not also have TID=0.  The
consequence is that the "natural" value of B in TID=0 layers is 1.  And
when there is no scalability, the only layer has TID=0.

I think what is going on is that there's an implicit structure of
dependencies between the frames of a layer (a frame depends on earlier
frames), and between the frames of different layers (a frame can depend
on frames with lower TID/LID and no later in time), and the various bits
are used to signal the *lack* of certain possible dependencies, but how
the bits do this needs to be clarified.  (The meaning of TL0PICIDX
particularly needs to be specified.)  But the implicit dependency
structure isn't spelled out.  That makes things harder in two ways:  (1)
it is not clear what sorts of dependencies future codecs are *not*
allowed tointroduce, and (2) it is difficult to state exactly what
dependencies are *removed* by particular signaling.

3.2.1.  Layer ID Mappings for Scalable Streams

All of the descriptions for specific codecs contain "ID=2", whereas the
generic descriptions of the extension formats show "ID=?".  The latter
is correct, since the ID value is negotiated for every RTP stream.

3.4.  Usage Considerations

The switching of video streams is recommended to be done this way:

   When an RTP switch wants to forward a new video stream to a receiver,
   it is RECOMMENDED to select the new video stream from the first
   switching point with the I (Independent) bit set in all spatial
   layers and forward the same.  An RTP switch can request a media
   source to generate a switching point by sending Full Intra Request
   (RTCP FIR) as defined in [RFC5104], for example.

This is difficult to implement in general, as it requires the switch to
keep track of all the layer IDs that have been seen, then look ahead in
the stream to see if, over a narrow range of time, all of the layers
that have been seen have packets with I set.  If the fundamental purpose
of I is to signal the best points to switch streams, it would be better
to define its semantics to be that.  E.g., "If a switch intends to start
forwarding a video stream, and within that stream, transmitting all
frames with TID and LID less than or equal to certain values, it should
start forwarding the stream beginning with a packet within that layer
that has I set."  That is, I signals that at this point, the coming
frames of this layer and all layers with lesser TID/LID can be decoded
without dependency on any previous frames.

3.4.2.  Scalability Structures

It would be more effective to state that that for "complex or irregular
scalability structures", subdivision by TID and LID is not effective and
so such structures should mark all packets with TID=0 and LID=0.  The
current text suggests that the switch is required to know whether such a
structure is in use, and if so, ignore the TID and LID fields, which
suggests that the sender can put various values in those fields.  This
would lead to requiring the switch to know what encoding is in use, and
avoiding that is the point of this document.

Dale