[AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-avtcore-multi-party-rtt-mix-18: (with DISCUSS and COMMENT)

Benjamin Kaduk via Datatracker <noreply@ietf.org> Wed, 19 May 2021 02:53 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: avt@ietf.org
Delivered-To: avt@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id CCC613A1A8E; Tue, 18 May 2021 19:53:19 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: Benjamin Kaduk via Datatracker <noreply@ietf.org>
To: "The IESG" <iesg@ietf.org>
Cc: draft-ietf-avtcore-multi-party-rtt-mix@ietf.org, avtcore-chairs@ietf.org, avt@ietf.org, bernard.aboba@gmail.com, bernard.aboba@gmail.com
X-Test-IDTracker: no
X-IETF-IDTracker: 7.29.0
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Benjamin Kaduk <kaduk@mit.edu>
Message-ID: <162139279927.706.12647899386073526674@ietfa.amsl.com>
Date: Tue, 18 May 2021 19:53:19 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/ZhhHZhXsfFssJWDbT4BHxwkTYyM>
Subject: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-avtcore-multi-party-rtt-mix-18: (with DISCUSS and COMMENT)
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.29
List-Id: Audio/Video Transport Core Maintenance <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/avt/>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 19 May 2021 02:53:20 -0000

Benjamin Kaduk has entered the following ballot position for
draft-ietf-avtcore-multi-party-rtt-mix-18: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)

Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about DISCUSS and COMMENT positions.

The document, along with other ballot positions, can be found here:


I'm not sure I understand how the examples are consistent with the main
specification, so let's please discuss it to either un-confuse me or fix
the document.

Section 3.9 seems to say that the oldest (source or redundant) text at
the mixer takes priority when there is text from more than one source
waiting to be sent, but the examples in Section 3.21 seem to show (e.g.)
text received from A at time 20400 that is to be sent as redundancy,
being sent after text from B received at time 20500 (sent as primary).
Is the intent that if there is any primary text, the oldest primary text
is sent first, and only if there is no outstanding primary text do we
consider the redundant text?

In a related vein, Section 3.10 says that a packet is sent when (among
other things) "330 ms has passed since already transmitted text was
queued for transmission as redundant text".  But that doesn't say
anything about the timer being reset by subsequent transmission or
queuing of redundant text, so I'm not sure how in the Section 3.21
example, we say that transmitting B1 and B2 as redundancy was planned as
330 ms after packet 105 -- the original B2 was sent in packet 104, so
shouldn't the 330ms start from packet 104's transmission?  (The stated
time for this seems to match 330ms after 104, so maybe the "105" is just
a typo?)

I also left a note in the comment that there's a remark about "lower
security level" in Section 3.19 that's not really accurate; we should
resolve that in some manner before the document proceeds.


The abstract is perhaps pushing the boundary of length reasonable for an

There were a couple interesting remarks in the shepherd writeup:

% The specification has not been implemented yet, so it is possible that
% issues could arise in implementation. This is more of a concern than
% for typical AVTCORE documents, since this specification is likely to
% become a regulatory requirement prior to advancing beyond Proposed
% Standard.

Are there still no implementations?  Are we happy with publishign the
specification at this time in the absence of implementations?

% During review, the question was raised as to whether the specification
% will require development of an RTT mixer, or whether it could be made
% compatible with existing conferencing servers implementing Selective
% Forwarding.

What was the outcome of the discussion?  Should that be reflected in the


   mixer model.  The possibility to implement the solution in a wide
   range of existing RTP implementations made the RTP-mixer model be
   selected to be fully specified in this document.

It's a little surprising to see this claim given the absence (per the
shepherd writeup) of any actual implementations.

Section 1.2

   Multiple sources per packet
      A new "text" media subtype would be specified with up to 15
      sources in each packet.  The mechanism would make use of the RTP
      mixer model specified in RTP [RFC3550].  Text from up to 15
      sources can be included in each packet.  [...]

(How was the "15" number determined?)

Section 2.3.2

   A party receiving an offer containing the "rtt-mixer" SDP attribute
   and being willing to use the RTP-mixer-based method of this
   specification for sending or receiving or both sending and receiving
   SHALL include the "rtt-mixer" SDP attribute in the corresponding
   "text" media section in the answer.

This requirement doesn't quite seem to match up with what I expect -- an
answerer that's willing to use rtt-mixer and also willing to use
something else seems to still be bound by the "SHALL include" in the
first paragraph, which makes the willingness to use something else a bit
irrelevant and precludes choosing the other option.  Perhaps we want to
say only "chooses to use the RTP-mixer-based method of this

Section 3.2

What purpose does the initial BOM serve?  I note that, e.g., RFC 5198
has an explicit BOM "MUST NOT appear at the beginning of these text
strings" and that RFC 4103 specifies UTF-8 encoding of the text.
I see in Section 3.17.4 (and 4.2.1) we mention that it might be used for
keepalive, but in rtt-mix don't we have lots of non-BOM keepalive

Section 3.4

   If the "CPS" value is reached, longer transmission intervals SHALL be
   applied and only as much of the text queued for transmission SHALL be
   sent at the end of each transmission interval that can be allowed
   without exceeding the "CPS" value, until the transmission rate falls
   under the "CPS" value again.  [...]

This doesn't seem as precisely specified as it could be, given that the
CPS rate is supposed to be enforced over "any 10-second interval".  As
written, this seems to suggest that the entire 10-second history of
packet size/spacing needs to be retained, so that at each transmission
the earliest time for next transmission can be computed that retains the
CPS limit.  It's not clear that there's real need for such a complicated
solution vs something that more bluntly backs off the transmit rate and
uses bucketed averages for tracking the transmission rate.

(I have no idea why it's 330ms for a mixer and 300ms for a non-mixer,
but assume there is some reason for the difference.)

Section 3.6

   Text received by a mixer from a participant SHOULD NOT be included in
   transmission from the mixer to that participant, because the normal
   behavior of the endpoint is to present locally-produced text locally.

When would the SHOULD NOT be ignored?  (How might a mixer know that the
other endpoint is not using the "normal behavior" of presenting
locally-produced text locally?)

Section 3.7

   A mixer SHALL handle reception, recovery from packet loss, deletion
   of superfluous redundancy, marking of possible text loss and deletion
   of 'BOM' characters from each participant before queueing received
   text for transmission to receiving participants.

Are there specific references available for each of these operations?

Section 3.9

   The source with the oldest text received in the mixer or oldest
   redundant text SHALL be next in turn to get all its available unsent
   text transmitted.  Any redundant repetitions of earlier transmitted

Just to confirm: this is really *all* its available unsent text, not
just however much will fit in one packet/flight/etc.?  Can a participant
"hog the mic" by continuing to append to that list even as transmission
has commenced?

Section 3.13

It took me a bit of searching to realize that it is RFC 2198 that
specifies the additional header that includes the "timestamp offset"
field.  A specific reference here (or maybe from an earlier section?)
would have helped me out.

Section 3.14

   The SSRC of the mixer for the RTT session SHALL be inserted in the
   SSRC field of the RTP header.

As written, this could be taken to say that the non-mixer endpoint
should also use the SSRC of the mixer.

Section 3.16

   Confidentiality SHALL be considered when composing these fields.

I think "privacy considerations" would be more relevant than

   Similar considerations SHALL be taken as for other media.

This seems rather vague and it's not really clear how the implementor is
supposed to take action based on it.  (Note that media are typically
straight over (S)RTP, but these reports are (S)RTCP, which admittedly is
also over (S)RTP, but is still different.)

Section 3.17.2

   If it is known that only one source is active in the RTP session,
   then it is likely that a gap equal to or larger than the agreed
   number of redundancy generations (including the primary) causes text
   loss.  [...]

Some more care in description may be needed here, as the gap in RTP
sequence numbers is measured in the RTP sequence units (e.g., time), but
the redundancy generation number is just a dimensionless generation
count.  We need to assume the max inter-packet spacing in order to
convert that into a time value that is suitable for assuming loss.

   evaluate if three or more packets were lost within one second.  If
   this simple method is used, then a t140block SHOULD be created with a
   marker for possible text loss [T140ad1] and associated with the SSRC
   of the transmitter as a general input from the mixer.

Does "input from the mixer" mean that it uses the mixer's SSRC value?
Or is this injected by the mixer (in contrast to the previous paragraph,
where it was the receiver that injects the marker for possible text

Section 3.17.3

   If the packet is not the first packet from a source, then if the
   second generation redundant data is available, its timestamp SHALL be
   created by subtracting its timestamp offset from the RTP timestamp.
   If the resulting timestamp is later than the latest retrieved data
   from the same source, then the redundant data SHALL be retrieved and
   appended to the receive buffer.  The process SHALL be continued in
   the same way for the first generation redundant data.  After that,
   the primary data SHALL be retrieved from the packet and appended to
   the receive buffer for the source.

I think I can come up with reordering scenarios that cause this
procedure to discard data that would otherwise have recovered from loss.

Also, this procedure as written says that the primary data shall always
be appended to the receive buffer (with no time check), which could
result in doubled content in the case of reordering.

Section 3.19

   Security SHOULD be applied when possible regarding the capabilities
   of the participating devices by use of SIP over TLS by default

"Security" is not some all-encompassing attribute that can be
generically applied; there are specific security properties that may or
may not be achieved by any given mechanism, and it's generally worth
being precise about what properties are (or are not) achieved.  So here
we might say "security mechanisms to provide confidentiality and
integrity protection and peer authentication SHOULD be applied".  We
cannot in general achieve source authenticity with just SRTP when a
mixer is involved, though RFC 8723 does specify a double-encryption
mechanism that applies in some cases when there is a central media

   applications where legacy endpoints without security may exist, a
   negotiation SHOULD be performed to decide if encryption on the media
   level will be applied.  [...]

How would endpoints know if legacy endpoints might exist?

How would this negotiation be performed?

   security levels.  The mixing for conference-unaware endpoints has
   lower security level than the mixing method for conference-aware
   endpoints, because there may be an opportunity for a malicious mixer
   or a middleman to masquerade the source labels accompanying the text
   streams in text format.  This is especially true if support of un-
   encrypted SIP and media is supported because of lack of such support
   in the target endpoints.  However, the mixing for conference-aware
   endpoints as specified here also requires that the mixer can be
   trusted.  [...]

As the last sentence indicates, the provided reasoning in the first
sentence is not really accurate, since the mixer could just as easily
adjust the CSRC value in the header as change the label in the in-band
text stream.  This does not inherently invalidate the claim that there
are different security levels, though, as the correct behavior of the
mixer seems easier to independently validate in the conference-aware
endpoint case (with the well-formed RTP payloads providing information
that can be validated out-of-band with other participants).  But I don't
think this description should be left in the document as-is; it doesn't
seem accurate.

In the case of unencrypted media, it does seem technically true that it
is easier for a non-mixer middleman to masquerade the source labels,
since in that case it only adjusts the payload directly without needing
to keep state on the RTP sender information and produce well-formed RTP
headers after its adjustment.  But this is only a modest level of
additional difficulty and does not reflect any kind of effective
security control, so it may not be worth mentioning at all.

   End-to-end encryption would require further work and could be based
   on WebRTC as specified in Section 1.2.

Is RFC 8723 not applicable to these scenarios at all?  I do not think it
is WebRTC-specific.

Section 3.21

          Transmission of A2 and A3 as redundancy is planned for 330 ms
   after packet 101 if no new text from A is ready to be sent before

I thought new text from *anyone* would trigger sending A2 and A3 as
redundancy, per §3.9.

Is there any reason why the dummy offsets in 104 are 300/600 but the
dummy offsets in 103 are 330/660?

Section 4.2

               In order to expedite source switching, a user can, for
   example, end its turn with a new line.

How would a user know that there is a legacy endpoint in the coversation
so as to choose to end its turn in this way?

Section 4.2.2

   *  A long pause (e.g., > 10 seconds) in received text from the
      currently transmitted source

   *  If text from one participant has been transmitted with text from
      other sources waiting for transmission for a long time (e.g., > 1
      minute) and none of the other suitable points for switching has

I think I'm confused how we could hit the 1 minute timer before the 10
seconds timer has triggered.

Section 11

I think that the obvious attacks involving control characters are
addressed in one way or another, but it might be worth a reminder to
implementors that control characters should not be allowed to let one
user's content affect the display of other users' content, or the
presentation-format's label of the sender, etc.

It might be appropriate to have yet another reminder here that SRTP is
the recommended mode of operation.

   The requirement to transfer information about the user in RTCP
   reports in SDES, CNAME, and NAME fields, and in conference
   notifications, for creation of labels may have privacy concerns as
   already stated in RFC 3550 [RFC3550], [...]

Could you point me to where this is stated in RFC 3550?  I looked in the
security considerations (section 14) and searched for all instances of
"CNAME", but didn't see discussion about SDES/CNAME/NAME being privacy

Section 13.1

RFC 8825 is not currently referenced in any context that specifically
requires it to be listed as a normative reference.  This may suggest
that it should be referenced in more places (e.g., in the discussion of
gateway considerations with WebRTC).


Section 1

   Use of RTT is increasing, and specifically, use in emergency calls is
   increasing.  Emergency call use requires multiparty mixing.  [...]

I expect the conclusion that emergency-call use requires mixing to be
non-intuitive to many readers, so additional explanation might be

                                                                RFC 4103
   "RTP Payload for Text Conversation" mixer implementations can use
   traditional RTP functions for source identification, but the

The word "mixer" (or even "mix") does not appear in RFC 4103, so I'm not
sure how to interpret "RFC 4103 mixer implementations".  Perhaps it is
an RFC 3550 mixer acting on RFC 4103 payloads?

   The document updates [RFC4103] by introducing an attribute for
   indicating capability for the RTP-mixer-based multiparty mixing case
   and rules for source indications and interleaving of text from
   different sources.

I think "indicating capability for the RTP-mixer-based multiparty mixing
case" needs another verb.

Section 2.2

   A party acting as a mixer, which has not negotiated any method for
   true multiparty RTT handling, but negotiated a "text/red" or "text/
   t140" format in a session with a participant SHOULD in order to
   maintain interoperability, if nothing else is specified for the
   application, format transmitted text to that participant to be
   suitable to present on a multiparty-unaware endpoint as further
   specified in Section 4.2.

comma after "SHOULD".
The whole sentence is a bit long, though, and the parenthetical "if
nothing else is specified for the application" is somewhat in the way in
its current location.  Further reworking may be in order.

I'd also consider s/format transmitted text/transmit formatted text/ or
/format text transmitted/.

Section 2.3.4

   If the modified offer deletes indication of support for multiparty
   real-time text by excluding the "rtt-mixer" SDP attribute, the answer
   MUST NOT contain the "rtt-mixer" attribute, and both parties SHALL
   after processing the SDP exchange NOT send real-time text formatted
   for multiparty-aware parties according to this specification.

The BCP 14 keyword "SHALL NOT" is supposed to appear as the specific
phrase, so something like "SHALL NOT, after processing the SDP exchange,
send" seems more appropriate.

Section 3.4

   transmission MUST then be made at T140block borders.  See also
   Section 8

full stop at end of sentence.

Section 3.10

   The mixer SHALL compose and transmit an RTP packet to a receiver when
   one of the following conditions has occurred:

Maybe "one or more" (or "one or both")?

Section 3.16

   Confidentiality SHALL be considered when composing these fields.
   They contain name and address information that may be sensitive to
   transmit in its entirety e.g., to unauthenticated participants.

comma before "e.g." (as well as after)

Section 3.17

   presentation areas for each source.  Other receiver roles, such as
   gateways or chained mixers are also feasible, and requires
   consideration if the stream shall just be forwarded, or distributed

"such as gateways or chained mixers" seems like a parenthetical phrase
that should be offset by commas on both sides.

Also, "require consideration" seems to match up better with the plural

Section 3.17.3

   When a packet is received in an RTP session using the packetization
   for multiparty-aware endpoints, its T140blocks SHALL be extracted in
   the following way.  The description is adapted to the default
   redundancy case using the original and two redundant generations.

Is this supposed to imply that the extension to the generic case with
other levels of redundancy is trivial for the reader to perform?

Section 3.18

   This solution has good performance with low text delays as long as
   the sum of characters per second during any 10-second interval sent
   from a number of simultaneously sending participants to a receiving
   participant does not reach the 'CPS' value.  [...]

"sum of characters per second during any 10-second interval" seems to
mean "compute CPS for each second, then add them up".  I don't think
that's the intended meaning.

                                   Only in large unmanaged conferences
   with a high number of participants there may on very rare occasions
   appear situations when many participants happen to send text
   simultaneously, resulting in unpleasantly jerky presentation of text
   from each sending participant.  [...]

This sentence seems a bit long and winding.

Section 3.20

      Offer example for "text/red" format including multiparty
      and security:
            a=fingerprint: (fingerprint1)

I think it would be preferred to make up some random fingerprints and
use them instead of the placeholder string (throughout).

Section 3.21

   offset 660 ms.  The timestamp of packet 106 minus 660 is 20500 which
   is the timestamp of packet 102 THAT was received.  So B1 does not

I don't see why "THAT" needs to be in all majuscule letters.

Section 3.22

   to transmission to a receiver.  The value MAY be modified in the
   "CPS" parameter of the FMTP attribute in the media section for the
   "text/t140" media.  [...]

RFC 4103 seems to show the lowercase "cps" parameter name (there are
subsequent occurrences that I do not quote).

Section 4.2

   one presentation area.  The mixer SHALL group text in suitable groups
   and prepare for presentation of them by inserting a new line between
   them if the transmitted text did not already end with a new line.  A

Is "new line" specified somewhere?  Up in toplevel §4 we cover the
unicode line separator and CRLF sequences but don't use the phrase "new

Section 4.2.2

   Information available to the mixer for composing the label may
   contain sensitive personal information that SHOULD not be revealed in

"SHOULD NOT" in all caps

   sessions not securely authenticated and protected.  Integrity

"confidentiality protected"

   considerations regarding how much personal information is included in
   the label SHOULD therefore be taken when composing the label.

I don't think "integrity" is the right word here.

Section 11

   Therefore, the mixer needs to be trusted to achieve security in
   confidentiality and integrity.  [...]

s/trusted to achieve security in confidentiality and integrity/trusted
to maintain confidentiality and integrity of the RTT data/

   The requirement to transfer information about the user in RTCP
   reports in SDES, CNAME, and NAME fields, and in conference
   notifications, for creation of labels may have privacy concerns as

There's something awry with the commas around "for creation of labels".