Re: [AVTCORE] WG Last Call: "Frame Marking RTP Header Extension"

Bernard Aboba <bernard.aboba@gmail.com> Sat, 05 December 2020 07:30 UTC

MIME-Version: 1.0
From: Bernard Aboba <bernard.aboba@gmail.com>
Date: Fri, 04 Dec 2020 23:30:39 -0800
Message-ID: <CAOW+2ds+pgpG8cd+iZJpvhMsu5Q77zAmNf9C3Dycx4TpnVfpiA@mail.gmail.com>
To: IETF AVTCore WG <avt@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000072ad3005b5b292ac"
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/SznfLrr7YorwYjPEYXdlH5AU4VA>
Subject: Re: [AVTCORE] WG Last Call: "Frame Marking RTP Header Extension"
Precedence: list

Here are my comments.

Overall, I think the document needs to be more clear about goals. For
example, even handling temporal scalability in a codec-agnostic way may not
be easily achieved; implementers have indicated that peculiarities of the
VP8 RTP Payload, described in RFC 7741 Section 4.2, require parsing (and
rewriting) the VP8 payload descriptor.

Section 1

The goal is
to provide a set of streams back to the participants which enable
them to render the right media content. In a simple video
configuration, for example, the goal will be that each participant
sees and hears just the active speaker. In that case, the goal of
the switch is to receive the voice and video streams from each
participant, determine the active speaker based on energy in the
voice packets, possibly using the client-to-mixer audio level RTP
header extension [RFC6464 <https://tools.ietf.org/html/rfc6464>],
and select the corresponding video stream
for transmission to participants; see Figure 1.

[BA] Is the goal only to switch to the active speaker? Most SFUs now
attempt to do more than this, such as to select an operating point
based on the available bandwidth of each participant.

o Because of inter-frame dependencies, it should ideally switch
video streams at a point where the first frame from the new
speaker can be decoded by recipients without prior frames, e.g
switch on an intra-frame.

[BA] Rather than "switching video streams", it seems to me that we are
really talking about "switching operating points".

If so, it should be noted that upswitch points can exist outside of an
intra-frame.

o Furthermore, it is highly desirable to do this in a payload
format-agnostic way which is not specific to each different video
codec. Most modern video codecs share common concepts around
frame types and other critical information to make this codec-
agnostic handling possible.

[BA] Are we sure that this goal is achievable, with framemarking or a
successor RTP header extension?

Perhaps the goal should be reset.

By providing meta-information about the RTP streams outside the
encrypted media payload, an RTP switch can do codec-agnostic
selective forwarding without decrypting the payload.

[BA] Based on some of the peculiarities of codecs such as VP8, it appears
that "codec-agnostic forwarding" is difficult.

Overall, it seems to me that Section 1 needs to contain an
applicability statement.

Section 3.3.4 VP8 LID mapping

[BA] Implementers have reported that framemarking is not suitable for
dealing with VP8 temporal scalability. The problem is due to the
following peculiarity noted in RFC 7741 Section 4.2:

PictureID: 7 or 15 bits (shown left and right, respectively, in
Figure 2) not including the M bit. This is a running index of
the frames, which MAY start at a random value, MUST increase by
1 for each subsequent frame, and MUST wrap to 0 after reaching
the maximum ID (all bits set). The 7 or 15 bits of the
PictureID go from most significant to least significant,
beginning with the first bit after the M bit. The sender
chooses a 7- or 15-bit index and sets the M bit accordingly.
The receiver MUST NOT assume that the number of bits in
PictureID stays the same through the session. Having sent a
7-bit PictureID with all bits set to 1, the sender may either
wrap the PictureID to 0 or extend to 15 bits and continue
incrementing.

The problem is that the PictureID "MUST increase by 1 for each subsequent
frame". This means that an SFU may need to rewrite the PictureID field, so
as to compensate for the frames that it does not forward.

Note that this issue is *not* unique to this specification, but will
also occur with other frame forwarding RTP header extensions such as
the Dependency Descriptor (DD)
<https://aomediacodec.github.io/av1-rtp-spec/#dependency-descriptor-rtp-header-extension>.

If the goal is to be able to handle VP8 temporal scalability without
requiring the SFU to parse the VP8 Payload Descriptor, it seems that
you would need to include the PictureID in this (or another) RTP
header extension, so as to allow the SFU to modify it.

This is somewhat ugly because it implies that the receiver will need
to trust the modified PictureID instead of the PictureID that it
receives in the VP8 payload descriptor.

[AVTCORE] WG Last Call: "Frame Marking RTP Header… Bernard Aboba
Re: [AVTCORE] WG Last Call: "Frame Marking RTP He… Stephan Wenger
Re: [AVTCORE] WG Last Call: "Frame Marking RTP He… Bernard Aboba
Re: [AVTCORE] WG Last Call: "Frame Marking RTP He… Sergio Garcia Murillo
Re: [AVTCORE] WG Last Call: "Frame Marking RTP He… Alexandre GOUAILLARD
Re: [AVTCORE] WG Last Call: "Frame Marking RTP He… Stephan Wenger
Re: [AVTCORE] WG Last Call: "Frame Marking RTP He… Bernard Aboba