Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt

"Mo Zanaty (mzanaty)" <mzanaty@cisco.com> Wed, 05 August 2020 06:59 UTC

IronPort-PHdr: 9a23:ON5aIxyfoKoSdc7XCy+N+z0EezQntrPoPwUc9psgjfdUf7+++4j5ZRWPt/x3jVbFW4id4PVB2KLasKHlDGoH55vJ8HUPa4dFWBJNj8IK1xchD8iIBQyeTrbqYiU2Ed4EWApj+He2YkFPGc/yYVCUqXq3vnYeHxzlPl9zIeL4UofZk8Ww0bW0/JveKwVFjTawe/V8NhKz+A7QrcIRx4BlL/U8
From: "Mo Zanaty (mzanaty)" <mzanaty@cisco.com>
To: "Dale R. Worley" <worley@ariadne.com>
CC: "draft-ietf-avtext-framemarking@ietf.org" <draft-ietf-avtext-framemarking@ietf.org>, "avt@ietf.org" <avt@ietf.org>
Thread-Topic: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt
Thread-Index: AQHWavXuFnyW52EU8EyOsK75r5K+rw==
Date: Wed, 05 Aug 2020 06:59:16 +0000
Message-ID: <DB4CFB4D.9B4C4%mzanaty@cisco.com>
References: <D9FC4088.91915%mzanaty@cisco.com> <87tv6abnue.fsf@hobgoblin.ariadne.com>
In-Reply-To: <87tv6abnue.fsf@hobgoblin.ariadne.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.7.7.170905
Content-Type: text/plain; charset="us-ascii"
Content-ID: <1F70E647931EB04699F228477348210C@namprd11.prod.outlook.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-Network-Message-Id: e54fc2e2-11cc-476f-063a-08d8390d11e4
X-MS-Exchange-CrossTenant-originalarrivaltime: 05 Aug 2020 06:59:16.8810 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 5ae1af62-9505-4097-a69a-c1553ef7840e
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: HxoSsPo4AVQz8gYMNm53IWuoCbAUisKcLGuZFTjBzdeHPR9JAWOy0i9O2N6Jjmu+SMc2QUJzhmlET8ZEJXXvuw==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN6PR11MB1940
X-OriginatorOrg: cisco.com
X-Outbound-SMTP-Client: 173.36.7.14, xch-aln-004.cisco.com
X-Outbound-Node: rcdn-core-7.cisco.com
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/s73blnFdut2oYj7Z1dcatTKTNno>
Subject: Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/avt/>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Aug 2020 06:59:25 -0000

Hi Dale,

Thanks very much for your continued review despite the delays.
Version -11 has updates for all your feedback below.

https://tools.ietf.org/html/draft-ietf-avtext-framemarking-11

See MZ: inline below for new responses (Mo: for old ones),
and please confirm if they adequately address all your comments.

Thanks,
Mo


On 12/17/19, 10:57 PM, "Dale R. Worley" <worley@ariadne.com> wrote:

In attempt to clarify a particular point:

The whole point of draft-ietf-avtext-framemarking is to provide
information to packet-handling devices on how to manipulate streams of
RTP packets containing encoded video, even when the device cannot
understand the payload of the RTP packets, either because they are
encrypted or they are in a video format that the device does not
understand.

MZ: Correct. Is this not clear in the Abstract?
 
Three typical operations are:
1) Routers dropping packets due to congestion, trying to determine the
least "costly" packets to drop.
2) Routers trying to "shape" the bandwidth demand of a video stream by
removing one or more highest-resolution layers from the video encoding.
3) RTP switches wanting to splice from one video stream to another,
looking for an "efficient" place to switch to the new stream.

MZ: Correct. Is this not clear in the Introduction? I added this bullet in
section 1 to make 2) clearer.
"For scalable streams with dependent layers, the switch may need to
selectively forward specific layers to specific recipients due to
recipient bandwidth or decoder limits or preferences."

In order for draft-ietf-avtext-framemarking to work well, the
significance of the extension data must be well-defined, so devices know
what the extension data tells about the RTP packets.  Inevitably, this
means that the extension data is interpreted within a model about the
packets and how they are related.

The fundamental structure seems to be the "frame-in-layer", the set of
packets that have the same SSRC, RTP timestamp, LID, and TID values.  A
frame-in-layer is assumed to encode a particular image at a particular
temporal and spatial resolution.

MZ: Correct. I added this "frame within a layer" definition in section 3,
and used it later where the distinction is relevant.
"A frame, in the context of this specification, is the set of RTP
   packets with the same RTP timestamp from a specific RTP
   synchronization source (SSRC).  A frame within a layer is the set of
   RTP packets with the same RTP timestamp, SSRC, Temporal ID (TID), and
   Layer ID (LID)."

The remaining extension data seems to always encode dependencies between
frames-in-layer, that is, if the receiver is to successfully decode one
particular frame-in-layer, it needs all or most of the packets in
certain other frames-in-layer.

MZ: Almost correct. Scalable streams with layer dependencies are prime
candidates for this extension, but it is also useful for non-scalable
streams.

E.g., the D bit says that this frame-in-layer is not depended upon by
any other frame-in-layer.  The implication is that dropping these
packets is "low cost" compared to dropping packets within a
frame-in-layere that is depended on by another frame-in-layer.

MZ: Correct. Is this not clear as worded in section 3.5?

There is an implication that if one frame-in-layer has the same SSRC,
same timestamp, no higher LID, and no higher TID than another
frame-in-layer, then the latter frame-in-layer depends on the former
frame-in-layer.

MZ: Almost correct. Higher TID/LID typically depend on same/lower TID/LID,
but not always. They can never depend on higher TID/LID. I added this
clarification in section 3.1:
"The layer information contained in TID and LID convey useful aspects
   of the layer structure that can be utilized in selective forwarding.
   Without further information about the layer structure, these TID/LID
   identifiers can only be used for relative priority of layers and
   implicit dependencies between layers.  They convey a layer hierarchy
   with TID=0 and LID=0 identifying the base layer.  Higher values of
   TID identify higher temporal layers with higher frame rates.  Higher
   values of LID identify higher spatial and/or quality layers with
   higher resolutions and/or bitrates.  Implicit dependencies between
   layers assume that a layer with a given TID/LID MAY depend on
   layer(s) with the same or lower TID/LID, but MUST NOT depend on
   layer(s) with higher TID/LID."

Within this framework, the algorithm for applying the extension to any
particular video encoding attempts to capture the actual dependency
structure of the video packets within the model that the extension data
can express.  There can be two sorts of mismatch:  "false positive",
where the extensions express a dependency not present in the video
encoding, and "false negative", where the extension do not express a
dependency which is present in the video encoding.  The extreme case is
"complex, irregular scalability structures that do not conform to
common, fixed patterns of inter-layer dependencies and referencing
structures."  In that case, using TID and LID is likely to not be
beneficial, and the extension data will tend to express a lot of "false
positive" dependencies.

What I'm pushing for is that all of this machinery be clearly stated,
especially exactly what dependencies are signaled by the extension data.
If those are left to the common intuitive understanding, we're likely to
have a lot of edge cases implemented differently by different devices,
leading to poor user experience (although probably not outright
non-interoperability).

Dale


MZ: Hopefully version -11 adequately clarifies this machinery.



On 12/8/19, 2:48 PM, "Dale R. Worley" <worley@ariadne.com> wrote:

My apologies for my delay in responding to this.

draft-ietf-avtext-framemarking-10 is definitely better than -08 on a lot
of points, but there are some points that (in my opinion) need to be
more fully addressed before the document is fully ready for approval.

"Mo Zanaty (mzanaty)" <mzanaty@cisco.com> writes:

[Moving this item first, as it is a design question which informs the
entire proposal:]

Mo: It is impossible to capture and signal all the dependencies in a
modern video stream efficiently.

That makes sense, but OTOH the *purpose* of this extension is to capture
enough of the dependency structure that e.g. a router can act on the
extension values when deciding how to drop packets from the video stream
without unduly messing up the video or wasting bandwidth -- all without
being able to understand the video packets themselves.  Which pretty
much means that the extension can contain false-positive dependencies
but should avoid false-negative dependencies.  Which implies that the
extension values model a particular dependency system, which can be
tailored to be a reasonable approximation of the dependency systems of
the video formats now in use.

(As an exception, the case "Complex structures also use TIDs and LIDs
but not necessarily in a clean nested hierarchy. The most complex
structures are total anarchy (dynamic, unpredictable) but could still
use TIDs and LIDs (within the codec payloads, but not in this header
extension)." -- In that case, the video stream's structure is encoded in
the trivial way (with many false-positive and no false-negative
dependencies), where all packets have TID=0 and LID=0.)

MZ: Section 3.1 now clarifies the implicit dependencies assumed.
-----

There is some oddity in how the sections are structured.  [...]

Mo: This structure was intentional to show that Layer ID mappings only
apply to Scalable Streams.  [...]

What I didn't make clear was that when I read -10, I didn't realize that
the short form is simply what the long form reduces to when the stream
is not scalable (and thus the layer IDs are all 0).  The logical
structure of the actual proposal is that the extension is defined in
section 3.2, but if the stream is not scalable, it simplifies to the
short form described in section 3.1.  However, the text is organized as
if you are defining two logically independent extensions (which look a
lot alike), and which one was actually in use for a particular RTP
stream is dependent on whether the particular video format is scalable
or not.  (And as you've noted, we want this extension to work as an
annotation for encrypted video streams, where the e.g. router acting on
the extension may have no information about the underlying video
encoding.)

In this regard, I think the document needs to explicitly state that the
description in section 3.2 is the *definition* of the extension, and the
description in section 3.1 is just what the definition *reduces to* when
TID and LID are both 0 (i.e., the stream is not scalable).  It would
help a great deal if section 3.2 was moved to be first (as it is
logically antecedent), and section 3.1 was annotated as the simplified
case.

MZ: The order of sections 3.1 and 3.2 were swapped in version -11 so
the long extension for scalable streams is first, followed by
the short extension for non-scalable streams. The LID mappings were
pulled out to their own separate section. Hopefully this flows better.
Also clarified in section 3.2 the short extension is the long extension
reduced to its smallest size.
"It is identical to the shortest form of the extension for
   scalable streams, except the last four bits (B and TID) are replaced
   with zeros."

-----

Also, 3.2.1.3 (H264 (AVC) LID Mapping) and 3.2.1.4 (VP8 LID Mapping)
don't specify how the S, E, I, D, and B bits are determined from the
codec's output packets.

Mo: This is specified in section 3.2.

What I am hoping for is that sections 3.2.1.1 to 3.2.1.4 give explicit
rules for computing the various flag bits from specific bits of the
video protocols.  Of course, the computation is *implied* by the rules
of section 3.2, and in some cases, the current text states explicit
rules, but I think it would help if all of the computations were stated
in ready-to-code form.

MZ: More explicit rules were added for all bits. These may be useful test
cases, but I don't expect implementations to actually generate this
extension by parsing RTP payload to extract this information, but more
likely get it directly from internal interfaces where they also
get/generate the RTP payload in the first place.

Also, text like this appears in several places:

   The S, E, I and D bits MUST match the corresponding bits in PACSI
   payload structures.

Does "the corresponding bits" mean "the bits named S, E, I, and D in
PACSI ..."?  If so, I think the latter phrasing is clearer, and if the
bits with those meanings have different names in PACSI, the names should
be stated.

MZ: "corresponding bits" was changed to "correspondingly named bits".
They have the same names/letters in all specs.

-----

   o  TID: Temporal ID (3 bits) - The base temporal layer starts with 0,
      and increases with 1 for each higher temporal layer/sub-layer.  If
      no scalability is used, this MUST be 0.  It is implicitly 0 in the
      short extension format.
   o  LID: Layer ID (8 bits) - Identifies the spatial and quality layer
      encoded, starting with 0 and increasing with higher fidelity.  If
      no scalability is used, this MUST be 0 or omitted to reduce
      length.  When omitted, TL0PICIDX MUST also be omitted.  It is
      implicitly 0 in the short extension format or when omitted in the
      long extension format.

I notice that while TID has the restriction "increases with 1 for each
higher temporal layer", LID does not.  Is there a reason that LID
numbers aren't required to be sequential?

MZ: TID description was changed to match LID. While all layers typically
increase by 1, there may be gaps if layers are dropped.

-----

Mo: I fixed this unfortunate circumstance by forcing B=0 when TID=0. B is
useless when TID=0, so we can force it to 0 or 1 arbitrarily. The
"natural" value for something to ignore is 0, so I just enforced this in
the following text.
o B: Base Layer Sync (1 bit) - When TID is not 0, this MUST be 1 if
the sender knows this frame only depends on the base temporal
layer; otherwise MUST be 0. When TID is 0 or if no scalability is
used, this MUST be 0.

Hmmm, it's not important enough to write out my full argument, but I
would have voted for B=1 when TID=0.

-----

The following item has a number of ramifications.  I include the prior
discussion for reference:

The switching of video streams is recommended to be done this way:

   When an RTP switch wants to forward a new video stream to a receiver,
   it is RECOMMENDED to select the new video stream from the first
   switching point with the I (Independent) bit set in all spatial
   layers and forward the same.  An RTP switch can request a media
   source to generate a switching point by sending Full Intra Request
   (RTCP FIR) as defined in [RFC5104], for example.

This is difficult to implement in general, as it requires the switch to
keep track of all the layer IDs that have been seen, then look ahead in
the stream to see if, over a narrow range of time, all of the layers
that have been seen have packets with I set.  If the fundamental purpose
of I is to signal the best points to switch streams, it would be better
to define its semantics to be that.  E.g., "If a switch intends to start
forwarding a video stream, and within that stream, transmitting all
frames with TID and LID less than or equal to certain values, it should
start forwarding the stream beginning with a packet within that layer
that has I set."  That is, I signals that at this point, the coming
frames of this layer and all layers with lesser TID/LID can be decoded
without dependency on any previous frames.

Mo: Media switches already do this, but by deep inspection of the payload
rather than simple inspection of a header extension. The I bit does not
signal anything about lower layers, only this layer. That is what media
switches want and expect.

The first point is that with this extension, a media switch should be
able to switch streams without inspecting the payload *at all*, so what
they now do by deep inspection of the video stream is not particularly
relevant.

Another point is to look at the bit definitions in section 3.2:

   o  S: Start of Frame (1 bit) - MUST be 1 in the first packet in a
      frame within a layer; otherwise MUST be 0.
   o  E: End of Frame (1 bit) - MUST be 1 in the last packet in a frame
      within a layer; otherwise MUST be 0.  Note that the RTP header
      marker bit MAY be used to infer the last packet of the highest
      enhancement layer, in payload formats with such semantics.
   o  I: Independent Frame (1 bit) - MUST be 1 for frames that can be
      decoded independent of temporally prior frames, e.g. intra-frame,
      VPX keyframe, H.264 IDR [RFC6184], H.265 IDR/CRA/BLA/RAP
      [RFC7798]; otherwise MUST be 0.  Note that this bit only signals
      temporal independence, so it can be 1 in spatial or quality
      enhancement layers that depend on temporally co-located layers but
      not temporally prior frames.
   o  D: Discardable Frame (1 bit) - MUST be 1 for frames the sender
      knows can be discarded, and still provide a decodable media
      stream; otherwise MUST be 0.

Note that bits S and E are defined in regard to "a frame within a
layer".  Section 3 defines "frame":

   A frame, in the context of this specification, is the set of RTP
   packets with the same RTP timestamp from a specific RTP
   synchronization source (SSRC).

I don't think it's stated explicitly, but clearly "layer" means "the
packets with a particular TID and LID value".

However, bits I and D are defined in terms of a frame alone, not "frame
within a layer", so they are *necessarily* the same for all packets
within a frame (i.e. with a particular timestamp) regardless of TID and
LID values.

Is this intended, or is the intention that I and D are defined for
"frame within a layer"?

MZ: I added this "frame within a layer" definition in section 3,
and used it later where the distinction is relevant like I/D bits.
"A frame, in the context of this specification, is the set of RTP
packets with the same RTP timestamp from a specific RTP
synchronization source (SSRC). A frame within a layer is the set of
RTP packets with the same RTP timestamp, SSRC, Temporal ID (TID), and
Layer ID (LID)."

This question interacts in a complicated way with this part of section
3.4:

   When an RTP switch wants to forward a new video stream to a receiver,
   it is RECOMMENDED to select the new video stream from the first
   switching point with the I (Independent) bit set in all spatial
   layers and forward the same.

Given that the I bit is the same for all layer IDs, this text is
extra-strict, and could be phrased "from the first switching point with
the I bit set".

I suspect you mean for the I bit to be defined in regard to "a frame
within a layer", in which case this text is phrased the way one would
expect it to be.

However, if you want to make life easier for RTP switches, I think you
want to define the I bit this way (and RTP switching seems to be the
only use of I bits) (and I think this would not burden RTP senders):

MZ: The I bit is also used to detect spatial layer refresh as noted
in Section 3.5.1 on LRR:
"Other refreshes can be detected based on the I bit
being set for the specific spatial layers."

   o  I: Independent Frame (1 bit) - MUST be 1 for frames within layers
      that can be decoded independently of temporally prior frames,
      e.g. intra-frame, VPX keyframe, H.264 IDR [RFC6184], H.265
      IDR/CRA/BLA/RAP [RFC7798] and for which all layers within the
      frame with no larger TID or LID are similarly independent;
      otherwise MUST be 0.  Note that this bit only signals temporal
      independence, so it can be 1 in spatial or quality enhancement
      layers that depend on temporally co-located layers but not
      temporally prior frames.

That is, the I bit summarizes not only the independence of a packet of a
particular frame/layer, but of all the lower LID/TID layers in the frame
as well.  Then video switching becomes far simpler:

MZ: RTP switches need better (not simpler) control of layer refresh, hence
LRR, and the I bit semantics in this extension, which match the semantics
of codec specs. What you think is simpler would actually cause more
confusion and complication to RTP sources and switches, which never expect
a high layer refresh to signal all lower layers also refresh.

   When an RTP switch wants to forward a new video stream to a receiver,
   it is RECOMMENDED to select the new video stream from the first
   switching point with the I (Independent) bit set in the layer with
   the highest TID/LID that is being passed.

MZ: Note that this defeats your original objective, which was to avoid
keeping track of layer IDs and looking ahead to which have the I bit set.
The highest layer comes last, so waiting for it or looking ahead to its I
bit does not simplify or optimize anything.

In a somewhat similar way, I think you want to modify the definition of
the D bit to:

   o  D: Discardable Frame (1 bit) - MUST be 1 for frames in layers the
      sender knows can be discarded, and still provide a decodable media
      stream for the layer; otherwise MUST be 0.

MZ: Agreed. The D bit now uses the "frame within a layer" definition.

The advantage of this definition is that it allows, e.g., the packets
with LID=1 of a frame to be discardable even if the packets with LID=0
of the same frame are not.  (I do not know if any existing video format
generates frames with this property.)

MZ: Almost correct. The highest TID>0 (not LID) is often discardable (D=1).

(The reverse case cannot happen in practice, since the LID=1 packets of
a frame are implicitly dependent on the LID=0 packets of the same frame,
and if the LID=0 packets are discarded, the LID=1 packets are expected
to not be decodable.)

MZ: Almost correct. Some LID=0 packets may be discardable (D=1) like
metadata/filler while the important packets needed for correct decoding of
this and higher layers are not discardable (D=0).

Dale

MZ: Thanks again for the thorough reviews.

[AVTCORE] I-D Action: draft-ietf-avtext-framemark… internet-drafts
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Magnus Westerlund
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Magnus Westerlund
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Dale R. Worley
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Jonathan Lennox
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Magnus Westerlund
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Dale R. Worley
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Dale R. Worley
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Dale R. Worley
Re: [AVTCORE] I-D Action: draft-ietf-avtext-frame… Mo Zanaty (mzanaty)