Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt

Jonathan Lennox <jonathan@vidyo.com> Thu, 04 April 2019 15:31 UTC

Return-Path: <jonathan@vidyo.com>
X-Original-To: avt@ietfa.amsl.com
Delivered-To: avt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6CCB41203F6 for <avt@ietfa.amsl.com>; Thu, 4 Apr 2019 08:31:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.988
X-Spam-Level:
X-Spam-Status: No, score=-1.988 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, T_SPF_PERMERROR=0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=vidyo.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HF5sLRlDeB3Q for <avt@ietfa.amsl.com>; Thu, 4 Apr 2019 08:31:44 -0700 (PDT)
Received: from mail-qt1-x829.google.com (mail-qt1-x829.google.com [IPv6:2607:f8b0:4864:20::829]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DDDCC1200D6 for <avt@ietf.org>; Thu, 4 Apr 2019 08:31:43 -0700 (PDT)
Received: by mail-qt1-x829.google.com with SMTP id p20so3692170qtc.9 for <avt@ietf.org>; Thu, 04 Apr 2019 08:31:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vidyo.com; s=google; h=from:mime-version:subject:date:references:to:in-reply-to:message-id; bh=ZQzMRVYlY4ByKKYB0LR6oF5M2eB3NpyLrH/XL8lF9k0=; b=YN0T4D5FOpi7li7G61ZaGmDsr7LnjAyFcMf9QxNgMpUUMC/NsJixbcLTNLJo5td8ky z8PU4NsZ7liMdbhcAnVrlHjuUf3Eh0JL/q7K2GJlFwD3LOudFWdfmAXjHJQ3OXdMThky 7WnsuUh/+sahuXmrhEKr++Q1HMgsDADgGbjgEyl3WgnENz26tEj5R+ugVvVgrVCLphLC 78P3mdEVeRxYvaJ1m1YW7ACLvz0YEG8rgKwMba/YigSyrcW0BsDL77Rw/ZvMbHMr+Ic4 NYs2+1W6nA0R4nLvYymYEno5pTBc4N595+96P8+hQ2i2Ee/poApn4y/iwAn5P//0drnd 3JcQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:mime-version:subject:date:references:to :in-reply-to:message-id; bh=ZQzMRVYlY4ByKKYB0LR6oF5M2eB3NpyLrH/XL8lF9k0=; b=Sd8OoZDnx2d/gw1Nq5LPm5M8Z+7Wfz0fHKOVFpeDaUS+ncpuxEqI77JL2mbfB/fjrP bJyL1uzWXY218sB3thWHA4NUr/lQbKG2H0y4M2lbuBRtpYAmRbpJu7jz9ch+9oC3spV6 AO9nrnOHuS252agjHYE50UBLlOAOnAvoEfoo7X2uTpGRk9snWnMaFEvCnGmGClnqY+vy Qx3NWXvyZ3bjLiqiVOmpQNDw7G6rpB1uT3tBVGli2zufEx9cSyrWXj8JDhcD6liHWO3S bwBeV4mqFhdwVFcdE4rmIxjPlPd//a08tioJ2miYWUTtrRtZH6HVl+r2bU8lPZ1ug8M6 7LiA==
X-Gm-Message-State: APjAAAX/SwV9jSU+yLzthDAauArgb0r3L8M0MM8q3JmjRhPIdh8RiIP0 zGzaAUfvCwYbbgzV3fO7CRtA4gm2bfY=
X-Google-Smtp-Source: APXvYqyOtpOgeK0YlzjMfoT4q1IWyBEWMLdUjJBMBZ/wGIqEnZZnSoWs2XdE3li5fGAZnSdvK3TcIg==
X-Received: by 2002:ac8:1778:: with SMTP id u53mr5976170qtk.270.1554391902542; Thu, 04 Apr 2019 08:31:42 -0700 (PDT)
Received: from [172.16.1.246] ([160.79.220.2]) by smtp.gmail.com with ESMTPSA id m93sm11658726qte.74.2019.04.04.08.31.40 for <avt@ietf.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 04 Apr 2019 08:31:41 -0700 (PDT)
From: Jonathan Lennox <jonathan@vidyo.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_57A23FA0-11D3-49D0-8CD2-FF06F0C2E5E7"
Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\))
Date: Thu, 04 Apr 2019 11:31:40 -0400
References: <23718.8872.229678.4132@paris.clic.cs.columbia.edu>
To: IETF AVTCore WG <avt@ietf.org>
In-Reply-To: <23718.8872.229678.4132@paris.clic.cs.columbia.edu>
Message-Id: <F845B05F-7761-47C2-AF21-22536D0882EA@vidyo.com>
X-Mailer: Apple Mail (2.3445.102.3)
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/JFKwrOxv0etcLOg-qwijx_L50hI>
Subject: Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/avt/>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Apr 2019 15:31:47 -0000

Authors — even though we’re past the nominal WGLC, please respond to this as a last call comment.

I’ll hold off on writing up the publication request until this is resolved.

> From: worley@ariadne.com (Dale R. Worley)
> Subject: Re: [AVTCORE] I-D Action: draft-ietf-avtext-framemarking-08.txt
> Date: April 3, 2019 at 7:32:15 AM EDT
> To: Magnus Westerlund <magnus.westerlund@ericsson.com>
> Cc: draft-ietf-avtext-framemarking@ietf.org, avt@ietf.org
> 
> 
> I'm no expert on this field, but as the frame marking extension is
> intended to be used broadly over many different video encodings, I think
> it can be usefully critiqued relative to its ambition to be a
> *generalized* frame marking mechanism.  In particular, a number of its
> features seem to reference ideas which generally apply to multiple
> encodings.  But this leaves a great deal of room for lack of alignment
> as to the exact semantics of the features, which could easily lead to a
> lot of subtle interoperation problems.  So I am here pushing for a
> clearer definition of what is and is not meant by the features.
> 
> There is some oddity in how the sections are structured.  The short form
> is defined in 3.1, and the long form is defined in 3.2.  The three
> mapping for specific codecs are listed in 3.2.1.1, 3.2.1.2, and 3.2.1.3,
> and 3.2.1.4.  It would be better to group the two definitional sections
> together and group the four example sections together.
> 
> Also, 3.2.1.3 (H264 (AVC) LID Mapping) and 3.2.1.4 (VP8 LID Mapping)
> don't specify how the S, E, I, D, and B bits are determined from the
> codec's output packets.
> 
> Regarding the multiple (four, actually) formats of the extension, it
> helps specifying them if they can all be mapped into the same semantic
> data structure.  For example,
> 
>    TID is the temporal layer index.  It is implicitly 0 if the short
>    format is used.
> 
>    LID is the (spacial) layer index.  It is implicitly 0 if the short
>    format is used or the L=1 form of the long format is used.
> 
>    TL0PICIDX:  When TID is 0:  If present, it is a cyclic counter
>    labeling the frames.  If not present, the frames have no such labels. 
>    When TID is not 0, it indicates that this frame in this layer
>    depends on the frame with this label in the layer with TID 0.
> 
> Notice that a missing TL0PICIDX has different semantics than a missing
> TID or LID.
> 
> Given the similarity of "temporal layer index" and "layer ID", it seems
> like you want a more distinctive phrase for the latter.  Could it be
> changed to "spatial layer" or "resolution layer"?
> 
> There seems to be no way to signal whether the short form is used
> vs. the L=0 version of the long form (if B=0 and TID=0) -- The ID value
> for both is signaled in SDP by
> 
>      a=extmap:3 urn:ietf:params:rtp-hdrext:framemarking
> 
> This doesn't cause a problem, as the semantics of the two alternatives
> are the same, but it prevents the 4 reserved bits in the short form from
> being defined in the future for any purpose other than B and LID. --
> Alternatively, is the short form simply what the extension is reduced to
> when using non-scalable streams, as those must necessarily have B=0 and
> LID=0?  (Also see my query below regarding the "default" value of B.)
> 
> It seems that the intention is that the video stream can be divided into
> substreams of RTP packets, called "layers", each of which is identified
> by a particular TID and LID, that is, TID/LID defines a
> *two-dimensional* hierarchy.  "They convey a layer hierarchy with [the
> layer with] TID=0 and LID=0 identifying the base layer."
> 
> My guess is that the special case for interpreting the TL0PICIDX value
> is actually when both TID and LID = 0, that is, the base layer, not just
> TID = 0 as stated in the text.  (If I'm wrong, the structure here is
> more complicated than I'm describing, with the TL0PICIDX labels of an
> upper layer referring to the label with the *same* LID but TID = 0.)
> 
> The idea seems to be that one can "efficiently" discard layers from the
> RTP stream, as long as: if one keeps a layer with a particular TID and
> LID, one keeps all layers with lesser or equal TID and LID.  I can't
> quite see how best to define "efficiently" here, but it seems to be the
> central reason for labeling the layers -- that a receiver can
> successfully decode all of the data in all of the layers that remain
> present.
> 
> Things are more interesting in regard to what packets can be discarded
> from a layer "efficiently".  The S, E, I, D, and B bits seem to be
> intended to guide a device that needs to discard packets.  The use of D
> bits is specified:
> 
>   When an RTP switch needs to discard a received video frame due to
>   congestion control considerations, it is RECOMMENDED that it
>   preferably drop frames marked with the D (Discardable) bit set [...]
> 
> And I suspect that it is implied that if packets are dropped from one
> frame, further packets from the same frame are preferred to be dropped.
> The S and E bits are intended to help with this process.
> 
> But dropping whole frames to some degree conflicts with the fact that
> small losses from video layers can often be recovered from, either due
> to redundancy in the layer, or by loss-reconstruction strategies in the
> receiver.  However, if one drops a *lot* of packets from one frame, one
> might as well discard the remainder of them.
> 
> The I bit suggests that there are provisions for dependency between the
> frames in a single layer, and dependency between frames is not the same
> for all frames.  It appears that if one frame of a layer is dropped, the
> following frames are preferred to be dropped until a frame with I = 1 is
> seen.
> 
> I am less clear on what the B bit means -- presumably all layers with
> lower TID and LID than the layer containing the frame in question are
> retained, B doesn't seem to carry useful information.
> 
> And there seems to be a problem with "defaulting" the value of B to 0
> when there is no scalability.
> 
> As stated:
> 
>   o  B: Base Layer Sync (1 bit) - MUST be 1 if the sender knows this
>      frame only depends on the base temporal layer; otherwise MUST be 0.
> 
> This can be stated equivalently:
> 
>      MUST be 1 if the sender knows this frame does not depend on any
>      frames that do not have TID=0.
> 
> Now if the frame itself has TID=0, then it cannot (by the ordering of
> the layers) depend on any frame that does not also have TID=0.  The
> consequence is that the "natural" value of B in TID=0 layers is 1.  And
> when there is no scalability, the only layer has TID=0.
> 
> I think what is going on is that there's an implicit structure of
> dependencies between the frames of a layer (a frame depends on earlier
> frames), and between the frames of different layers (a frame can depend
> on frames with lower TID/LID and no later in time), and the various bits
> are used to signal the *lack* of certain possible dependencies, but how
> the bits do this needs to be clarified.  (The meaning of TL0PICIDX
> particularly needs to be specified.)  But the implicit dependency
> structure isn't spelled out.  That makes things harder in two ways:  (1)
> it is not clear what sorts of dependencies future codecs are *not*
> allowed tointroduce, and (2) it is difficult to state exactly what
> dependencies are *removed* by particular signaling.
> 
> 3.2.1.  Layer ID Mappings for Scalable Streams
> 
> All of the descriptions for specific codecs contain "ID=2", whereas the
> generic descriptions of the extension formats show "ID=?".  The latter
> is correct, since the ID value is negotiated for every RTP stream.
> 
> 3.4.  Usage Considerations
> 
> The switching of video streams is recommended to be done this way:
> 
>   When an RTP switch wants to forward a new video stream to a receiver,
>   it is RECOMMENDED to select the new video stream from the first
>   switching point with the I (Independent) bit set in all spatial
>   layers and forward the same.  An RTP switch can request a media
>   source to generate a switching point by sending Full Intra Request
>   (RTCP FIR) as defined in [RFC5104], for example.
> 
> This is difficult to implement in general, as it requires the switch to
> keep track of all the layer IDs that have been seen, then look ahead in
> the stream to see if, over a narrow range of time, all of the layers
> that have been seen have packets with I set.  If the fundamental purpose
> of I is to signal the best points to switch streams, it would be better
> to define its semantics to be that.  E.g., "If a switch intends to start
> forwarding a video stream, and within that stream, transmitting all
> frames with TID and LID less than or equal to certain values, it should
> start forwarding the stream beginning with a packet within that layer
> that has I set."  That is, I signals that at this point, the coming
> frames of this layer and all layers with lesser TID/LID can be decoded
> without dependency on any previous frames.
> 
> 3.4.2.  Scalability Structures
> 
> It would be more effective to state that that for "complex or irregular
> scalability structures", subdivision by TID and LID is not effective and
> so such structures should mark all packets with TID=0 and LID=0.  The
> current text suggests that the switch is required to know whether such a
> structure is in use, and if so, ignore the TID and LID fields, which
> suggests that the sender can put various values in those fields.  This
> would lead to requiring the switch to know what encoding is in use, and
> avoiding that is the point of this document.
> 
> Dale
> 
> _______________________________________________
> Audio/Video Transport Core Maintenance
> avt@ietf.org
> https://www.ietf.org/mailman/listinfo/avt