Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt

worley@ariadne.com (Dale R. Worley) Sun, 08 December 2019 19:48 UTC

Return-Path: <worley@alum.mit.edu>
X-Original-To: avt@ietfa.amsl.com
Delivered-To: avt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BE66D1200C5 for <avt@ietfa.amsl.com>; Sun, 8 Dec 2019 11:48:34 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.683
X-Spam-Level:
X-Spam-Status: No, score=-1.683 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.25, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=comcastmailservice.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SNWDBWFtHM1h for <avt@ietfa.amsl.com>; Sun, 8 Dec 2019 11:48:32 -0800 (PST)
Received: from resqmta-ch2-04v.sys.comcast.net (resqmta-ch2-04v.sys.comcast.net [IPv6:2001:558:fe21:29:69:252:207:36]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9CB981200B1 for <avt@ietf.org>; Sun, 8 Dec 2019 11:48:32 -0800 (PST)
Received: from resomta-ch2-02v.sys.comcast.net ([69.252.207.98]) by resqmta-ch2-04v.sys.comcast.net with ESMTP id e299i6oHcXNE4e2XfiRptI; Sun, 08 Dec 2019 19:48:31 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcastmailservice.net; s=20180828_2048; t=1575834511; bh=HT+y6PH92I1b7+s0dOv2noCH3MFWua42hvNHiER65cM=; h=Received:Received:Received:Received:From:To:Subject:Date: Message-ID; b=Sks4X7CONCNQJ7uUYJ1c+mI/XUKk7sC3JYXNxeSLoa7GD9ERLNf8GUsyCJbhoWVRm RjfaJIVOJLXJ6pvBSKeodI4tw30mbvOqnJNJ2+PPWDEOwJkrR9gAB/QDk9nmnRbTlh tNNRaJC80Ue8vDvssb0q5isRb6dthr0M8G+T4WWKTakFjdORzuy3rUlJI5X9C2jWwf fv4pyUzh6wo1YQ6MKY0uRcaUEgHjPYaVqJK//W2kDbbrDFqds9Irm0pM3XS0eogZAf 5UVK6JFkqw2NdSoByNzG0WK9dhAiGfPdUzQwThqSDrA3KZw7JjnfPqJ/TRh73oxYtM lfdxq9pyGyygA==
Received: from hobgoblin.ariadne.com ([IPv6:2601:192:4600:1e00:222:fbff:fe91:d396]) by resomta-ch2-02v.sys.comcast.net with ESMTPA id e2XdiLzULFzGre2XeigXha; Sun, 08 Dec 2019 19:48:30 +0000
X-Xfinity-VMeta: sc=-100.00;st=legit
Received: from hobgoblin.ariadne.com (hobgoblin.ariadne.com [127.0.0.1]) by hobgoblin.ariadne.com (8.14.7/8.14.7) with ESMTP id xB8JmR3Z001116; Sun, 8 Dec 2019 14:48:28 -0500
Received: (from worley@localhost) by hobgoblin.ariadne.com (8.14.7/8.14.7/Submit) id xB8JmQrd001111; Sun, 8 Dec 2019 14:48:26 -0500
X-Authentication-Warning: hobgoblin.ariadne.com: worley set sender to worley@alum.mit.edu using -f
From: worley@ariadne.com
To: "Mo Zanaty (mzanaty)" <mzanaty@cisco.com>
Cc: draft-ietf-avtext-framemarking@ietf.org, avt@ietf.org, magnus.westerlund@ericsson.com
In-Reply-To: <D9FC4088.91915%mzanaty@cisco.com>
Sender: worley@ariadne.com
Date: Sun, 08 Dec 2019 14:48:25 -0500
Message-ID: <87tv6abnue.fsf@hobgoblin.ariadne.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/S8KkX9MSySJ4_YydyO7uPmDO6dI>
Subject: Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/avt/>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Dec 2019 19:48:35 -0000

My apologies for my delay in responding to this.

draft-ietf-avtext-framemarking-10 is definitely better than -08 on a lot
of points, but there are some points that (in my opinion) need to be
more fully addressed before the document is fully ready for approval.

"Mo Zanaty (mzanaty)" <mzanaty@cisco.com> writes:

[Moving this item first, as it is a design question which informs the
entire proposal:]

> Mo: It is impossible to capture and signal all the dependencies in a
> modern video stream efficiently.

That makes sense, but OTOH the *purpose* of this extension is to capture
enough of the dependency structure that e.g. a router can act on the
extension values when deciding how to drop packets from the video stream
without unduly messing up the video or wasting bandwidth -- all without
being able to understand the video packets themselves.  Which pretty
much means that the extension can contain false-positive dependencies
but should avoid false-negative dependencies.  Which implies that the
extension values model a particular dependency system, which can be
tailored to be a reasonable approximation of the dependency systems of
the video formats now in use.

(As an exception, the case "Complex structures also use TIDs and LIDs
but not necessarily in a clean nested hierarchy. The most complex
structures are total anarchy (dynamic, unpredictable) but could still
use TIDs and LIDs (within the codec payloads, but not in this header
extension)." -- In that case, the video stream's structure is encoded in
the trivial way (with many false-positive and no false-negative
dependencies), where all packets have TID=0 and LID=0.)

-----

>>There is some oddity in how the sections are structured.  [...]
>
> Mo: This structure was intentional to show that Layer ID mappings only
> apply to Scalable Streams.  [...]

What I didn't make clear was that when I read -10, I didn't realize that
the short form is simply what the long form reduces to when the stream
is not scalable (and thus the layer IDs are all 0).  The logical
structure of the actual proposal is that the extension is defined in
section 3.2, but if the stream is not scalable, it simplifies to the
short form described in section 3.1.  However, the text is organized as
if you are defining two logically independent extensions (which look a
lot alike), and which one was actually in use for a particular RTP
stream is dependent on whether the particular video format is scalable
or not.  (And as you've noted, we want this extension to work as an
annotation for encrypted video streams, where the e.g. router acting on
the extension may have no information about the underlying video
encoding.)

In this regard, I think the document needs to explicitly state that the
description in section 3.2 is the *definition* of the extension, and the
description in section 3.1 is just what the definition *reduces to* when
TID and LID are both 0 (i.e., the stream is not scalable).  It would
help a great deal if section 3.2 was moved to be first (as it is
logically antecedent), and section 3.1 was annotated as the simplified
case.

-----

>>Also, 3.2.1.3 (H264 (AVC) LID Mapping) and 3.2.1.4 (VP8 LID Mapping)
>>don't specify how the S, E, I, D, and B bits are determined from the
>>codec's output packets.
>
> Mo: This is specified in section 3.2.

What I am hoping for is that sections 3.2.1.1 to 3.2.1.4 give explicit
rules for computing the various flag bits from specific bits of the
video protocols.  Of course, the computation is *implied* by the rules
of section 3.2, and in some cases, the current text states explicit
rules, but I think it would help if all of the computations were stated
in ready-to-code form.

Also, text like this appears in several places:

   The S, E, I and D bits MUST match the corresponding bits in PACSI
   payload structures.

Does "the corresponding bits" mean "the bits named S, E, I, and D in
PACSI ..."?  If so, I think the latter phrasing is clearer, and if the
bits with those meanings have different names in PACSI, the names should
be stated.

-----

   o  TID: Temporal ID (3 bits) - The base temporal layer starts with 0,
      and increases with 1 for each higher temporal layer/sub-layer.  If
      no scalability is used, this MUST be 0.  It is implicitly 0 in the
      short extension format.
   o  LID: Layer ID (8 bits) - Identifies the spatial and quality layer
      encoded, starting with 0 and increasing with higher fidelity.  If
      no scalability is used, this MUST be 0 or omitted to reduce
      length.  When omitted, TL0PICIDX MUST also be omitted.  It is
      implicitly 0 in the short extension format or when omitted in the
      long extension format.

I notice that while TID has the restriction "increases with 1 for each
higher temporal layer", LID does not.  Is there a reason that LID
numbers aren't required to be sequential?

-----

> Mo: I fixed this unfortunate circumstance by forcing B=0 when TID=0. B is
> useless when TID=0, so we can force it to 0 or 1 arbitrarily. The
> "natural" value for something to ignore is 0, so I just enforced this in
> the following text.
> o B: Base Layer Sync (1 bit) - When TID is not 0, this MUST be 1 if
> the sender knows this frame only depends on the base temporal
> layer; otherwise MUST be 0. When TID is 0 or if no scalability is
> used, this MUST be 0.

Hmmm, it's not important enough to write out my full argument, but I
would have voted for B=1 when TID=0.

-----

The following item has a number of ramifications.  I include the prior
discussion for reference:

>>The switching of video streams is recommended to be done this way:
>>
>>   When an RTP switch wants to forward a new video stream to a receiver,
>>   it is RECOMMENDED to select the new video stream from the first
>>   switching point with the I (Independent) bit set in all spatial
>>   layers and forward the same.  An RTP switch can request a media
>>   source to generate a switching point by sending Full Intra Request
>>   (RTCP FIR) as defined in [RFC5104], for example.
>>
>>This is difficult to implement in general, as it requires the switch to
>>keep track of all the layer IDs that have been seen, then look ahead in
>>the stream to see if, over a narrow range of time, all of the layers
>>that have been seen have packets with I set.  If the fundamental purpose
>>of I is to signal the best points to switch streams, it would be better
>>to define its semantics to be that.  E.g., "If a switch intends to start
>>forwarding a video stream, and within that stream, transmitting all
>>frames with TID and LID less than or equal to certain values, it should
>>start forwarding the stream beginning with a packet within that layer
>>that has I set."  That is, I signals that at this point, the coming
>>frames of this layer and all layers with lesser TID/LID can be decoded
>>without dependency on any previous frames.
>
> Mo: Media switches already do this, but by deep inspection of the payload
> rather than simple inspection of a header extension. The I bit does not
> signal anything about lower layers, only this layer. That is what media
> switches want and expect.

The first point is that with this extension, a media switch should be
able to switch streams without inspecting the payload *at all*, so what
they now do by deep inspection of the video stream is not particularly
relevant.

Another point is to look at the bit definitions in section 3.2:

   o  S: Start of Frame (1 bit) - MUST be 1 in the first packet in a
      frame within a layer; otherwise MUST be 0.
   o  E: End of Frame (1 bit) - MUST be 1 in the last packet in a frame
      within a layer; otherwise MUST be 0.  Note that the RTP header
      marker bit MAY be used to infer the last packet of the highest
      enhancement layer, in payload formats with such semantics.
   o  I: Independent Frame (1 bit) - MUST be 1 for frames that can be
      decoded independent of temporally prior frames, e.g. intra-frame,
      VPX keyframe, H.264 IDR [RFC6184], H.265 IDR/CRA/BLA/RAP
      [RFC7798]; otherwise MUST be 0.  Note that this bit only signals
      temporal independence, so it can be 1 in spatial or quality
      enhancement layers that depend on temporally co-located layers but
      not temporally prior frames.
   o  D: Discardable Frame (1 bit) - MUST be 1 for frames the sender
      knows can be discarded, and still provide a decodable media
      stream; otherwise MUST be 0.

Note that bits S and E are defined in regard to "a frame within a
layer".  Section 3 defines "frame":

   A frame, in the context of this specification, is the set of RTP
   packets with the same RTP timestamp from a specific RTP
   synchronization source (SSRC).

I don't think it's stated explicitly, but clearly "layer" means "the
packets with a particular TID and LID value".

However, bits I and D are defined in terms of a frame alone, not "frame
within a layer", so they are *necessarily* the same for all packets
within a frame (i.e. with a particular timestamp) regardless of TID and
LID values.

Is this intended, or is the intention that I and D are defined for
"frame within a layer"?

This question interacts in a complicated way with this part of section
3.4:

   When an RTP switch wants to forward a new video stream to a receiver,
   it is RECOMMENDED to select the new video stream from the first
   switching point with the I (Independent) bit set in all spatial
   layers and forward the same.

Given that the I bit is the same for all layer IDs, this text is
extra-strict, and could be phrased "from the first switching point with
the I bit set".

I suspect you mean for the I bit to be defined in regard to "a frame
within a layer", in which case this text is phrased the way one would
expect it to be.

However, if you want to make life easier for RTP switches, I think you
want to define the I bit this way (and RTP switching seems to be the
only use of I bits) (and I think this would not burden RTP senders):

   o  I: Independent Frame (1 bit) - MUST be 1 for frames within layers
      that can be decoded independently of temporally prior frames,
      e.g. intra-frame, VPX keyframe, H.264 IDR [RFC6184], H.265
      IDR/CRA/BLA/RAP [RFC7798] and for which all layers within the
      frame with no larger TID or LID are similarly independent;
      otherwise MUST be 0.  Note that this bit only signals temporal
      independence, so it can be 1 in spatial or quality enhancement
      layers that depend on temporally co-located layers but not
      temporally prior frames.

That is, the I bit summarizes not only the independence of a packet of a
particular frame/layer, but of all the lower LID/TID layers in the frame
as well.  Then video switching becomes far simpler:

   When an RTP switch wants to forward a new video stream to a receiver,
   it is RECOMMENDED to select the new video stream from the first
   switching point with the I (Independent) bit set in the layer with
   the highest TID/LID that is being passed.

In a somewhat similar way, I think you want to modify the definition of
the D bit to:

   o  D: Discardable Frame (1 bit) - MUST be 1 for frames in layers the
      sender knows can be discarded, and still provide a decodable media
      stream for the layer; otherwise MUST be 0.

The advantage of this definition is that it allows, e.g., the packets
with LID=1 of a frame to be discardable even if the packets with LID=0
of the same frame are not.  (I do not know if any existing video format
generates frames with this property.)

(The reverse case cannot happen in practice, since the LID=1 packets of
a frame are implicitly dependent on the LID=0 packets of the same frame,
and if the LID=0 packets are discarded, the LID=1 packets are expected
to not be decodable.)

Dale