Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-avtcore-multi-party-rtt-mix-18: Answer 4.

Gunnar Hellström <> Tue, 25 May 2021 10:30 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 2DEFD3A0945; Tue, 25 May 2021 03:30:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: 3.101
X-Spam-Level: ***
X-Spam-Status: No, score=3.101 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, GB_SUMOF=5, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id v2EPQGg7Oond; Tue, 25 May 2021 03:29:59 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 9D56B3A093C; Tue, 25 May 2021 03:29:57 -0700 (PDT)
Received: from [] ( []) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: by (Postfix) with ESMTPSA id 7B6412006F; Tue, 25 May 2021 12:29:52 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=dkim; t=1621938592; bh=2u9flL+AuCKmMGXX3Lpjy9V825VqEeHSum2bWoFlwrI=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=i1ebht9fizDIVWRapzzWj9+FLWZzhHyJO04mpU3Ncc2mhkCPCmO7mIbpdFbkmli9d vWK8VbWfHcEA3OOcv4tG6h5rBwdOIu0bG8Cbs9pscCpV79gYmsYmQHq9fkLG71+dDv 7s+Ed/NSnrzZ7L9J8utO6Emhp2+MtwHrAlgjVIaQ=
To: Benjamin Kaduk <>, The IESG <>
References: <>
From: =?UTF-8?Q?Gunnar_Hellstr=c3=b6m?= <>
Message-ID: <>
Date: Tue, 25 May 2021 12:29:51 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.2
MIME-Version: 1.0
In-Reply-To: <>
Content-Type: multipart/alternative; boundary="------------F2639ACF693555F432EB7ABB"
Content-Language: sv
Archived-At: <>
Subject: Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-avtcore-multi-party-rtt-mix-18: Answer 4.
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 25 May 2021 10:30:05 -0000


Continuing the answers with the last part.,

Den 2021-05-19 kl. 04:53, skrev Benjamin Kaduk via Datatracker:
> Benjamin Kaduk has entered the following ballot position for
> draft-ietf-avtcore-multi-party-rtt-mix-18: Discuss
> The document, along with other ballot positions, can be found here:

> Section 3.14
>     The SSRC of the mixer for the RTT session SHALL be inserted in the
>     SSRC field of the RTP header.
> As written, this could be taken to say that the non-mixer endpoint
> should also use the SSRC of the mixer.
[GH] I suggest to change to:
" The SSRC header field SHALL contain the SSRC of the RTP session where 
the packet will be transmitted."
> Section 3.16
>     Confidentiality SHALL be considered when composing these fields.
> I think "privacy considerations" would be more relevant than
> "confidentiality".
[GH] Accepted. Changed to "Privacy considerations SHALL be taken..."
>     Similar considerations SHALL be taken as for other media.
> This seems rather vague and it's not really clear how the implementor is
> supposed to take action based on it.  (Note that media are typically
> straight over (S)RTP, but these reports are (S)RTCP, which admittedly is
> also over (S)RTP, but is still different.)
[GH] I propose to delete the sentence.

> Section 3.17.2
>     If it is known that only one source is active in the RTP session,
>     then it is likely that a gap equal to or larger than the agreed
>     number of redundancy generations (including the primary) causes text
>     loss.  [...]
> Some more care in description may be needed here, as the gap in RTP
> sequence numbers is measured in the RTP sequence units (e.g., time), but
> the redundancy generation number is just a dimensionless generation
> count.  We need to assume the max inter-packet spacing in order to
> convert that into a time value that is suitable for assuming loss.
[GH] The RTP sequence number steps by one for each new packet sent in 
the RTP session. It is the timestamp that is increased by the time.  
Real-time text is also unusual in that it can transmit with varying 
intervals and be silent when there is no new text to send. No change 
>     evaluate if three or more packets were lost within one second.  If
>     this simple method is used, then a t140block SHOULD be created with a
>     marker for possible text loss [T140ad1] and associated with the SSRC
>     of the transmitter as a general input from the mixer.
> Does "input from the mixer" mean that it uses the mixer's SSRC value?
> Or is this injected by the mixer (in contrast to the previous paragraph,
> where it was the receiver that injects the marker for possible text
> loss)?
[GH] I suggest to change "transmitter" to "RTP session" to make it clear 
that it is the mixer.
> Section 3.17.3
>     If the packet is not the first packet from a source, then if the
>     second generation redundant data is available, its timestamp SHALL be
>     created by subtracting its timestamp offset from the RTP timestamp.
>     If the resulting timestamp is later than the latest retrieved data
>     from the same source, then the redundant data SHALL be retrieved and
>     appended to the receive buffer.  The process SHALL be continued in
>     the same way for the first generation redundant data.  After that,
>     the primary data SHALL be retrieved from the packet and appended to
>     the receive buffer for the source.
> I think I can come up with reordering scenarios that cause this
> procedure to discard data that would otherwise have recovered from loss.
[GH] Would it be sufficient to insert the words "and reordering" in the 
first two sentences in the paragraph to read:
"The receiver SHALL monitor the RTP sequence numbers of the received 
packets for gaps and packets out of order.
If a sequence number gap appears and still exists after some defined 
short time for jitter and reordering resolution, the packets in the gap 
SHALL be regarded as lost."
> Also, this procedure as written says that the primary data shall always
> be appended to the receive buffer (with no time check), which could
> result in doubled content in the case of reordering.
[GH] I suggest to change the last sentence in the paragraph to " After 
that, the timestamp of the packet SHALL be compared with the timestamp 
of the latest retrieved data from the same source and if it is later, 
then the primary data SHALL be retrieved from the packet and appended to 
the receive buffer for the source."
> Section 3.19
>     Security SHOULD be applied when possible regarding the capabilities
>     of the participating devices by use of SIP over TLS by default
> "Security" is not some all-encompassing attribute that can be
> generically applied; there are specific security properties that may or
> may not be achieved by any given mechanism, and it's generally worth
> being precise about what properties are (or are not) achieved.  So here
> we might say "security mechanisms to provide confidentiality and
> integrity protection and peer authentication SHOULD be applied".  We
> cannot in general achieve source authenticity with just SRTP when a
> mixer is involved, though RFC 8723 does specify a double-encryption
> mechanism that applies in some cases when there is a central media
> distributor.
[GH] Answered in the DISCUSS part.
>                                                               In
>     applications where legacy endpoints without security may exist, a
>     negotiation SHOULD be performed to decide if encryption on the media
>     level will be applied.  [...]
> How would endpoints know if legacy endpoints might exist?
[GH] Answered in the DISCUSS part.
> How would this negotiation be performed?
[GH] Answered in the DISCUSS part.
>     security levels.  The mixing for conference-unaware endpoints has
>     lower security level than the mixing method for conference-aware
>     endpoints, because there may be an opportunity for a malicious mixer
>     or a middleman to masquerade the source labels accompanying the text
>     streams in text format.  This is especially true if support of un-
>     encrypted SIP and media is supported because of lack of such support
>     in the target endpoints.  However, the mixing for conference-aware
>     endpoints as specified here also requires that the mixer can be
>     trusted.  [...]
> As the last sentence indicates, the provided reasoning in the first
> sentence is not really accurate, since the mixer could just as easily
> adjust the CSRC value in the header as change the label in the in-band
> text stream.  This does not inherently invalidate the claim that there
> are different security levels, though, as the correct behavior of the
> mixer seems easier to independently validate in the conference-aware
> endpoint case (with the well-formed RTP payloads providing information
> that can be validated out-of-band with other participants).  But I don't
> think this description should be left in the document as-is; it doesn't
> seem accurate.
> In the case of unencrypted media, it does seem technically true that it
> is easier for a non-mixer middleman to masquerade the source labels,
> since in that case it only adjusts the payload directly without needing
> to keep state on the RTP sender information and produce well-formed RTP
> headers after its adjustment.  But this is only a modest level of
> additional difficulty and does not reflect any kind of effective
> security control, so it may not be worth mentioning at all.
[GH] Answered in the DISCUSS part.
>     End-to-end encryption would require further work and could be based
>     on WebRTC as specified in Section 1.2.
> Is RFC 8723 not applicable to these scenarios at all?  I do not think it
> is WebRTC-specific.
[GH] Answered in the DISCUSS part.
> Section 3.21
>            Transmission of A2 and A3 as redundancy is planned for 330 ms
>     after packet 101 if no new text from A is ready to be sent before
>     that.
> I thought new text from *anyone* would trigger sending A2 and A3 as
> redundancy, per §3.9.
[GH] No, since A2 and A3 sent as redundancy only can be sent alone or 
together with new text from A, they should only be triggered to be sent 
when new text from A has arrived or the current transmission interval 
since last transmission from A has passed. I hope that is clear from 
"3.5 Only one source per packet."
> Is there any reason why the dummy offsets in 104 are 300/600 but the
> dummy offsets in 103 are 330/660?
[GH] There are dummy offsets in 102, 104 and 105. The ones which can be 
analyzed for the values are the ones in 104 and 105. They have the same 
values as they would have if there was redundancy on that level 
available for transmission at that time. Since the value is irrelevant, 
that is just one way to assign a "realistic" value.
> Section 4.2
>                 In order to expedite source switching, a user can, for
>     example, end its turn with a new line.
> How would a user know that there is a legacy endpoint in the coversation
> so as to choose to end its turn in this way?
[GH] It can be from experience or education. The first implementations 
are expected to be in emergency services, where the calling users in 
emergency may get multiparty-aware clients later than the emergency 
service call- takers. The call-takers can then be educated from the 
beginning to end their turns with a new line to expedite the switch so 
that new text from the caller is rapidly presented to another person in 
the service observing, or taking over the call. It is also quite normal 
in modern text communication, both real-time and messaging, to end turn 
with a new line.
> Section 4.2.2
>     *  A long pause (e.g., > 10 seconds) in received text from the
>        currently transmitted source
>     *  If text from one participant has been transmitted with text from
>        other sources waiting for transmission for a long time (e.g., > 1
>        minute) and none of the other suitable points for switching has
> I think I'm confused how we could hit the 1 minute timer before the 10
> seconds timer has triggered.
[GH] If one user is generating text continously, e.g. making a live 
transcription of someone making a speech, then it may happen that the 
text neither has full stop anywhere, nor any pause longer than 10 
seconds appears within one minute or more. In that situation it is fair 
that waiting text is allowed to break in and be given turn.
> Section 11
> I think that the obvious attacks involving control characters are
> addressed in one way or another, but it might be worth a reminder to
> implementors that control characters should not be allowed to let one
> user's content affect the display of other users' content, or the
> presentation-format's label of the sender, etc.

[GH] I suggest to insert in section 11:

"      Participants with malicious intentions may also try to disturb 
the presentation by sending incomplete or malformed control codes. 
Handling of text from the different sources by the receivers MUST 
therefore be well separated so that the effects of such actions only 
affect text from the source causing the action. "

> It might be appropriate to have yet another reminder here that SRTP is
> the recommended mode of operation.
[GH] I suggest to insert the following in section 11:
"As already stated in <xref target="security2" format="default"/>, 
security in media SHOULD be applied by using DTLS-SRTP <xref 
target="RFC5764" format="default"/> on the media level."
>     The requirement to transfer information about the user in RTCP
>     reports in SDES, CNAME, and NAME fields, and in conference
>     notifications, for creation of labels may have privacy concerns as
>     already stated in RFC 3550 [RFC3550], [...]
> Could you point me to where this is stated in RFC 3550?  I looked in the
> security considerations (section 14) and searched for all instances of
> "CNAME", but didn't see discussion about SDES/CNAME/NAME being privacy
> sensitive.
[GH] In RFC 3550, sentences 2 and 3, it is said: "

For example, an impostor can fake source or destination
    network addresses, or change the header or payload.  Within RTCP, the
    CNAME and NAME information may be used to impersonate another
    participant. "

> Section 13.1
> RFC 8825 is not currently referenced in any context that specifically
> requires it to be listed as a normative reference.  This may suggest
> that it should be referenced in more places (e.g., in the discussion of
> gateway considerations with WebRTC).
[GH] I think it is sufficient to reference RFC 8865 in the WebRTC 
gateway section. Therefore my proposal is to move the reference to RFC 
8825 to informational.
> Section 1
>     Use of RTT is increasing, and specifically, use in emergency calls is
>     increasing.  Emergency call use requires multiparty mixing.  [...]
> I expect the conclusion that emergency-call use requires mixing to be
> non-intuitive to many readers, so additional explanation might be
> helpful.
[GH] I propose extending the sentence to: " Emergency call use requires 
multiparty mixing because it is common that one agent needs to transfer 
the call to another specialized agent but is obliged to stay on the call 
at least to verify that the transfer was successful. "
>                                                                  RFC 4103
>     "RTP Payload for Text Conversation" mixer implementations can use
>     traditional RTP functions for source identification, but the
> The word "mixer" (or even "mix") does not appear in RFC 4103, so I'm not
> sure how to interpret "RFC 4103 mixer implementations".  Perhaps it is
> an RFC 3550 mixer acting on RFC 4103 payloads?
[GH] I do not think that other RTP payload type specifications specify 
how RTP mixing shall be applied either. It is just assumed that RFC 3550 
RTP mixing is used. Anyway, I suggest to change the sentence to: "Mixer 
implementations for RFC 4103 "RTP Payload for Text Conversation" can use 
traditional RFC 3550 RTP functions for mixing and source identification."
>     The document updates [RFC4103] by introducing an attribute for
>     indicating capability for the RTP-mixer-based multiparty mixing case
>     and rules for source indications and interleaving of text from
>     different sources.
> I think "indicating capability for the RTP-mixer-based multiparty mixing
> case" needs another verb.
[GH] My non-English background makes me not realize the problem. Will it 
be better with "declaring support of" instead of "indicating capability 
for". If so, there may be a couple of other places where similar wording 
should be changed.
> Section 2.2
>     A party acting as a mixer, which has not negotiated any method for
>     true multiparty RTT handling, but negotiated a "text/red" or "text/
>     t140" format in a session with a participant SHOULD in order to
>     maintain interoperability, if nothing else is specified for the
>     application, format transmitted text to that participant to be
>     suitable to present on a multiparty-unaware endpoint as further
>     specified in Section 4.2.
> comma after "SHOULD".
> The whole sentence is a bit long, though, and the parenthetical "if
> nothing else is specified for the application" is somewhat in the way in
> its current location.  Further reworking may be in order.
[GH] I propose to change to:
"A mixer SHOULD by default format and transmit text to a call 
participant to be suitable to present on a multiparty-unaware endpoint 
which has not negotiated any method for true multiparty RTT handling, 
but negotiated a "text/red" or "text/t140" format in a session. This 
SHOULD be done if nothing else is specified for the application in order 
to maintain interoperability. Section 4.2 specifies how this mixing is 
> I'd also consider s/format transmitted text/transmit formatted text/ or
> /format text transmitted/.
[GH] Included above
> Section 2.3.4
>     If the modified offer deletes indication of support for multiparty
>     real-time text by excluding the "rtt-mixer" SDP attribute, the answer
>     MUST NOT contain the "rtt-mixer" attribute, and both parties SHALL
>     after processing the SDP exchange NOT send real-time text formatted
>     for multiparty-aware parties according to this specification.
> The BCP 14 keyword "SHALL NOT" is supposed to appear as the specific
> phrase, so something like "SHALL NOT, after processing the SDP exchange,
> send" seems more appropriate.
[GH] I propose to change to:
"If the modified offer deletes indication of support for multiparty 
real-time text by excluding the "rtt-mixer" SDP attribute, the answer 
MUST NOT contain the "rtt-mixer" attribute. After processing this SDP 
exchange, the parties MUST NOT send real-time text formatted for 
multiparty-aware parties according to this specification."
> Section 3.4
>     transmission MUST then be made at T140block borders.  See also
>     Section 8
> full stop at end of sentence.
[GH] Accepted.
> Section 3.10
>     The mixer SHALL compose and transmit an RTP packet to a receiver when
>     one of the following conditions has occurred:
> Maybe "one or more" (or "one or both")?
[GH] Accepted.
> Section 3.16
>     Confidentiality SHALL be considered when composing these fields.
>     They contain name and address information that may be sensitive to
>     transmit in its entirety e.g., to unauthenticated participants.
> comma before "e.g." (as well as after)
[GH] Accepted.
> Section 3.17
>     presentation areas for each source.  Other receiver roles, such as
>     gateways or chained mixers are also feasible, and requires
>     consideration if the stream shall just be forwarded, or distributed
> "such as gateways or chained mixers" seems like a parenthetical phrase
> that should be offset by commas on both sides.
[GH] Accepted.
> Also, "require consideration" seems to match up better with the plural
> "roles".
[GH] I suggest to start a new sentence "They require considerations..."
> Section 3.17.3
>     When a packet is received in an RTP session using the packetization
>     for multiparty-aware endpoints, its T140blocks SHALL be extracted in
>     the following way.  The description is adapted to the default
>     redundancy case using the original and two redundant generations.
> Is this supposed to imply that the extension to the generic case with
> other levels of redundancy is trivial for the reader to perform?
[GH] Yes. Should that be indicated?
> Section 3.18
>     This solution has good performance with low text delays as long as
>     the sum of characters per second during any 10-second interval sent
>     from a number of simultaneously sending participants to a receiving
>     participant does not reach the 'CPS' value.  [...]
> "sum of characters per second during any 10-second interval" seems to
> mean "compute CPS for each second, then add them up".  I don't think
> that's the intended meaning.
[GH] I suggest: "as long as the mean number of characters per second 
sent during any 10-second interval from a number of simultaneously 
sending participants to a receiving participant, does not reach the 
'CPS' value."
>                                     Only in large unmanaged conferences
>     with a high number of participants there may on very rare occasions
>     appear situations when many participants happen to send text
>     simultaneously, resulting in unpleasantly jerky presentation of text
>     from each sending participant.  [...]
> This sentence seems a bit long and winding.
[GH] I suggest to divide to: "Only in large unmanaged conferences with a 
high number of participants there may on very rare occasions appear 
situations when many participants happen to send text simultaneously. In 
such circumstances, the result may be unpleasantly jerky presentation of 
text from each sending participant."
> Section 3.20
>        Offer example for "text/red" format including multiparty
>        and security:
>              a=fingerprint: (fingerprint1)
> I think it would be preferred to make up some random fingerprints and
> use them instead of the placeholder string (throughout).
[GH] I had the opposite reaction in an earlier review, when I had random 
fingerprints. I suggest to keep as is.
> Section 3.21
>     offset 660 ms.  The timestamp of packet 106 minus 660 is 20500 which
>     is the timestamp of packet 102 THAT was received.  So B1 does not
> I don't see why "THAT" needs to be in all majuscule letters.
[GH] Right. Accepted.
> Section 3.22
>     to transmission to a receiver.  The value MAY be modified in the
>     "CPS" parameter of the FMTP attribute in the media section for the
>     "text/t140" media.  [...]
> RFC 4103 seems to show the lowercase "cps" parameter name (there are
> subsequent occurrences that I do not quote).
[GH] Thanks , all changed to lower case.
> Section 4.2
>     one presentation area.  The mixer SHALL group text in suitable groups
>     and prepare for presentation of them by inserting a new line between
>     them if the transmitted text did not already end with a new line.  A
> Is "new line" specified somewhere?  Up in toplevel §4 we cover the
> unicode line separator and CRLF sequences but don't use the phrase "new
> line".
[GH] "new line (line separator or CRLF)" is used in a number of places. 
I suggest to modify this one to:
"the mixer SHALL insert a line separator if the already transmitted text 
did not end with a new line (line separator or CRLF)"
> Section 4.2.2
>     Information available to the mixer for composing the label may
>     contain sensitive personal information that SHOULD not be revealed in
> "SHOULD NOT" in all caps
>     sessions not securely authenticated and protected.  Integrity
> "confidentiality protected"
[GH] Accepted
>     considerations regarding how much personal information is included in
>     the label SHOULD therefore be taken when composing the label.
> I don't think "integrity" is the right word here.
[GH] Change to "Privacy" is proposed.
> Section 11
>     Therefore, the mixer needs to be trusted to achieve security in
>     confidentiality and integrity.  [...]
> s/trusted to achieve security in confidentiality and integrity/trusted
> to maintain confidentiality and integrity of the RTT data/
[GH] Accepted
>     The requirement to transfer information about the user in RTCP
>     reports in SDES, CNAME, and NAME fields, and in conference
>     notifications, for creation of labels may have privacy concerns as
> There's something awry with the commas around "for creation of labels".
[GH] I suggest to split in two sentences: "**When used for creation of 
readable labels in the presentation, the receiving user will then get a 
more symbolic label for the source."

[GH]Again, thanks for a thorough review and many good proposals for changes.



Gunnar Hellström