Re: [AVTCORE] WG Last Call: "Frame Marking RTP Header Extension"

Bernard Aboba <> Sat, 05 December 2020 07:30 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 23BB33A0E8B for <>; Fri, 4 Dec 2020 23:30:57 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id eVBaZB--RAXT for <>; Fri, 4 Dec 2020 23:30:52 -0800 (PST)
Received: from ( [IPv6:2a00:1450:4864:20::12c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 1C1893A0E8A for <>; Fri, 4 Dec 2020 23:30:52 -0800 (PST)
Received: by with SMTP id v14so10852008lfo.3 for <>; Fri, 04 Dec 2020 23:30:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=bPi2m9H3AMdV4W9JUlNc8abl7hTlISRQkr/wz1krJ5U=; b=ljykyKR6AdsaBNdQeBIJmlj1lJZz4dnfZvhZAYjrZPeSoew+jiHAcZNPRUARKnNZNt E9Wz/fJOJnCE8x2C6r1g9B3w16z4qQ4Ho1GdqItBfer+khtH39XzqyfYGtz0rGvOHMxo efBpkwjKxF+cXkmzlVw7QSDzIUYnm9McmKi4/ss9ZCH7x2VH6aFHg+zVed+KvrvLbfz/ B32LZuMUrIf+6tYlyxYqan86jftbwyLZUPo5BngkUo7kOxB95ERAhyrk1noX1f0rpld0 pmFWwnOEqAWn+JUo9U4mTc1Dj2HvPto9pbul9wR+nciCTNnFzr69/GTngl2rJ5B7B8wD 6rCw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=bPi2m9H3AMdV4W9JUlNc8abl7hTlISRQkr/wz1krJ5U=; b=oxxEKYw5tgluyNk9mBJicC7m1wPMagCHID0uJa+n0Bj92k7L0HyBQofdP5iNMgjEt1 d9s9tp8drCMyIXsRI6MJoVPKOJ1IqUYq+PsNtXQ/F/1L/oMUug50L6YtvjcKv4y6/WjX 6vr7D33Pn4iouTHvDyK15xMf8sPXak11VufxD+25/LWgGHPULvY8ZNMXrIF+odRXqCG9 sx/wReXWvEFWLpIqUiP7x08M3fA4HjxBlrb+711uR/7jwM5nch6diyRxwd7R4rq+wgAv IyIlXsYpIeX79dd/uaa4y5fEH5sd8tXcMSZkuarJUsLVq7J6sou4oefa5mppyeM7U5NM rkYQ==
X-Gm-Message-State: AOAM532JBJ1gDW9MHGeiWQe1SEIgV37qGpI5/tQRVMqLwhNWVBtnXbhQ BqaUGlapyTnzvm7buiLCXONsQlnOx8Z/ZPg1Rt2qX8aMv4yfSQ==
X-Google-Smtp-Source: ABdhPJw12cr0+E3p3KgyFIU4Y5I260YHZZboPpExwnekxlVEphWDlM2ilL0R0FNSA1Y1GPzj85Tv40ukOY8fXKSvOu8=
X-Received: by 2002:a05:6512:51a:: with SMTP id o26mr907266lfb.560.1607153449347; Fri, 04 Dec 2020 23:30:49 -0800 (PST)
MIME-Version: 1.0
From: Bernard Aboba <>
Date: Fri, 4 Dec 2020 23:30:39 -0800
Message-ID: <>
To: IETF AVTCore WG <>
Content-Type: multipart/alternative; boundary="00000000000072ad3005b5b292ac"
Archived-At: <>
Subject: Re: [AVTCORE] WG Last Call: "Frame Marking RTP Header Extension"
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 05 Dec 2020 07:30:57 -0000

Here are my comments.

Overall, I think the document needs to be more clear about goals. For
example, even handling temporal scalability in a codec-agnostic way may not
be easily achieved;  implementers have indicated that peculiarities of the
VP8 RTP Payload, described in RFC 7741 Section 4.2, require parsing (and
rewriting) the VP8 payload descriptor.

Section 1

   The goal is
   to provide a set of streams back to the participants which enable
   them to render the right media content.  In a simple video
   configuration, for example, the goal will be that each participant
   sees and hears just the active speaker.  In that case, the goal of
   the switch is to receive the voice and video streams from each
   participant, determine the active speaker based on energy in the
   voice packets, possibly using the client-to-mixer audio level RTP
   header extension [RFC6464 <>],
and select the corresponding video stream
   for transmission to participants; see Figure 1.

[BA] Is the goal only to switch to the active speaker? Most SFUs now
attempt to do more than this, such as to select an operating point
based on the available bandwidth of each participant.

   o  Because of inter-frame dependencies, it should ideally switch
      video streams at a point where the first frame from the new
      speaker can be decoded by recipients without prior frames, e.g
      switch on an intra-frame.

[BA] Rather than "switching video streams", it seems to me that we are
really talking about "switching operating points".

If so, it should be noted that upswitch points can exist outside of an

   o  Furthermore, it is highly desirable to do this in a payload
      format-agnostic way which is not specific to each different video
      codec.  Most modern video codecs share common concepts around
      frame types and other critical information to make this codec-
      agnostic handling possible.

[BA] Are we sure that this goal is achievable, with framemarking or a
successor RTP header extension?

Perhaps the goal should be reset.

   By providing meta-information about the RTP streams outside the
   encrypted media payload, an RTP switch can do codec-agnostic
   selective forwarding without decrypting the payload.

[BA] Based on some of the peculiarities of codecs such as VP8, it appears
that "codec-agnostic forwarding" is difficult.

Overall, it seems to me that Section 1 needs to contain an
applicability statement.

Section 3.3.4 VP8 LID mapping

[BA] Implementers have reported that framemarking is not suitable for
dealing with VP8 temporal scalability. The problem is due to the
following peculiarity noted in RFC 7741 Section 4.2:

      PictureID:  7 or 15 bits (shown left and right, respectively, in
         Figure 2) not including the M bit.  This is a running index of
         the frames, which MAY start at a random value, MUST increase by
         1 for each subsequent frame, and MUST wrap to 0 after reaching
         the maximum ID (all bits set).  The 7 or 15 bits of the
         PictureID go from most significant to least significant,
         beginning with the first bit after the M bit.  The sender
         chooses a 7- or 15-bit index and sets the M bit accordingly.
         The receiver MUST NOT assume that the number of bits in
         PictureID stays the same through the session.  Having sent a
         7-bit PictureID with all bits set to 1, the sender may either
         wrap the PictureID to 0 or extend to 15 bits and continue

The problem is that the PictureID "MUST increase by 1 for each subsequent
frame". This means that an SFU may need to rewrite the PictureID field, so
as to compensate for the frames that it does not forward.

Note that this issue is *not* unique to this specification, but will
also occur with other frame forwarding RTP header extensions such as
the Dependency Descriptor (DD)

If the goal is to be able to handle VP8 temporal scalability without
requiring the SFU to parse the VP8 Payload Descriptor, it seems that
you would need to include the PictureID in this (or another) RTP
header extension, so as to allow the SFU to modify it.

This is somewhat ugly because it implies that the receiver will need
to trust the modified PictureID instead of the PictureID that it
receives in the VP8 payload descriptor.