Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-avtcore-multi-party-rtt-mix-18: Answer 3 -

Gunnar Hellström <> Mon, 24 May 2021 21:15 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id EF9673A0997; Mon, 24 May 2021 14:15:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id jvw8Ng0DDaoh; Mon, 24 May 2021 14:15:07 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 29B583A0990; Mon, 24 May 2021 14:15:06 -0700 (PDT)
Received: from [] ( []) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: by (Postfix) with ESMTPSA id C900E20F79; Mon, 24 May 2021 23:15:00 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=dkim; t=1621890901; bh=v8aephgQLJJs9M0oQiiR1jddTnLDgTsvKKcCwfxamZ0=; h=Subject:To:References:From:Date:In-Reply-To:From; b=S54hpwKCYLjLROfQFC6rr5Piu6t1YJKY35/LcqDGjCcAhIjAxTVSRNwsZhc+Ry6NS 93EsY98BFRc9+xaD3poZEbJRg5rcOBxfshVSENNzcrNTNeiZSXKHp7y1+QOLEa09Ow HF/rAUOgC7JBmvVIembX88kTRlq1xyXtSQn8qhj4=
To:, Benjamin Kaduk <>, The IESG <>,, "" <>, Bernard Aboba <>
References: <>
From: =?UTF-8?Q?Gunnar_Hellstr=c3=b6m?= <>
Message-ID: <>
Date: Mon, 24 May 2021 23:14:59 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.2
MIME-Version: 1.0
In-Reply-To: <>
Content-Type: multipart/alternative; boundary="------------CCE126EE60481894ACF986F7"
Content-Language: sv
Archived-At: <>
Subject: Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-avtcore-multi-party-rtt-mix-18: Answer 3 -
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Mon, 24 May 2021 21:15:13 -0000

Continuing the answers:

Den 2021-05-19 kl. 04:53, skrev Benjamin Kaduk via Datatracker:
> Benjamin Kaduk has entered the following ballot position for
> draft-ietf-avtcore-multi-party-rtt-mix-18: Discuss
> When responding, please keep the subject line intact and reply to all
> email addresses included in the To and CC lines. (Feel free to cut this
> introductory paragraph, however.)
> Please refer to
> for more information about DISCUSS and COMMENT positions.
> The document, along with other ballot positions, can be found here:
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> The abstract is perhaps pushing the boundary of length reasonable for an
> abstract.
> There were a couple interesting remarks in the shepherd writeup:
> % The specification has not been implemented yet, so it is possible that
> % issues could arise in implementation. This is more of a concern than
> % for typical AVTCORE documents, since this specification is likely to
> % become a regulatory requirement prior to advancing beyond Proposed
> % Standard.
> Are there still no implementations?  Are we happy with publishign the
> specification at this time in the absence of implementations?

[GH] There is an implementation of section 4.2

4.2.  multiparty mixing for multiparty-unaware endpoints
It works well.

It is true that there are no known implementations of the method for 
multiparty-aware endpoints. However the basic methods: the RTP-mixer and 
the centralized SIP conference model, and the base RTP transport for 
real-time text (RFC 4103) are all commonly implemented.

> % During review, the question was raised as to whether the specification
> % will require development of an RTT mixer, or whether it could be made
> % compatible with existing conferencing servers implementing Selective
> % Forwarding.
> What was the outcome of the discussion?  Should that be reflected in the
> document?
[GH] The discussion is concluded in section 1.2, and a note in 3.20.
The first expected implementation environment is the traditional SIP 
centralized conference server and endpoints (for emergency services) , 
where traditional RTP-mixers are the dominating bases for 
implementations. It seems to me that we are closer to get 
implementations done by keeping to that well known technology than 
trying to apply selective forwarding for this case.

> Abstract
>     mixer model.  The possibility to implement the solution in a wide
>     range of existing RTP implementations made the RTP-mixer model be
>     selected to be fully specified in this document.
> It's a little surprising to see this claim given the absence (per the
> shepherd writeup) of any actual implementations.

[GH] The RTP-mixer is a general technology, specified in RFC 3550. It is 
very commonly applied in conference bridges since long, but for the 
other media; video and audio. But more important here is that the 
RTP-mixer makes use of just one RTP media stream to each receiver. Many 
implementations of RTP seem to have the limitation that they cannot 
handle more than one media stream per RTP media session. Requiring a 
multi-stream solution would have led to excluding a lot of potential 
current endpoints with two-party RTT implementations from becoming 
upgraded to RTT multiparty-aware endpoints.

> Section 1.2
>     Multiple sources per packet
>        A new "text" media subtype would be specified with up to 15
>        sources in each packet.  The mechanism would make use of the RTP
>        mixer model specified in RTP [RFC3550].  Text from up to 15
>        sources can be included in each packet.  [...]
> (How was the "15" number determined?)
[GH] It is a limitation from the maximum size of the CSRC list in the 
RTP header defined in RFC 3550:

CSRC list: 0 to 15 items, 32 bits each The CSRC list identifies the 
contributing sources for the payload contained in this packet. The 
number of identifiers is given by the CC field. If there are more than 
15 contributing sources, only 15 can be identified." In our case it is 
essential to have the sources identified, and complicated to go over 
this limitation. I suggest to add to the text in 1.2 to read: "The 
sources are indicated in strict order in the CSRC list of the RTP 
packets. The CSRC list can have up to 15 members. Therefore, text from 
up to 15 sources can be included in each packet."

> Section 2.3.2
>     A party receiving an offer containing the "rtt-mixer" SDP attribute
>     and being willing to use the RTP-mixer-based method of this
>     specification for sending or receiving or both sending and receiving
>     SHALL include the "rtt-mixer" SDP attribute in the corresponding
>     "text" media section in the answer.
> This requirement doesn't quite seem to match up with what I expect -- an
> answerer that's willing to use rtt-mixer and also willing to use
> something else seems to still be bound by the "SHALL include" in the
> first paragraph, which makes the willingness to use something else a bit
> irrelevant and precludes choosing the other option.  Perhaps we want to
> say only "chooses to use the RTP-mixer-based method of this
> specification"?

[GH] It says "willing" because also two-party coded real-time text is 
still allowed. The multi-party coding is expected to be applied when 
more than two parties participate in the call.

So, the answerer is not willing to use another multiparty method.

In order to make that clear, I suggest to insert this sentence before 
the discussion of other multiparty methods in 2.3.2:

"  Even when the "rtt-mixer" attribute is successfully negotiated, the 
parties MAY send and receive two-party coded real-time text."

> Section 3.2
> What purpose does the initial BOM serve?  I note that, e.g., RFC 5198
> has an explicit BOM "MUST NOT appear at the beginning of these text
> strings" and that RFC 4103 specifies UTF-8 encoding of the text.
> I see in Section 3.17.4 (and 4.2.1) we mention that it might be used for
> keepalive, but in rtt-mix don't we have lots of non-BOM keepalive
> options?

[GH] The initial BOM is for initial opening of ports and firewall paths 
so that traffic rapidly can start to flow both ways. In real-time text 
already the first real media packet with text is important, while for 
most other media (= video and audio) usually many hundred RTP packet 
flow before any important information carrying packet is sent. BOM is 
also said in Unicode to be required in the beginning of UNICODE streams, 
and that is negated by RFC 5198. RFC 4103 is a update of RFC 2793, 
published in May 2000. At that time RFC 2279 was the UTF-8 RFC. RFC 2279 
did not mention any restrictions for use of BOM, so RFC 2793 specified 
the use from Unicode, For interoperability reasons we have not changed 
that in RFC 4103 and not in the current draft. So we can just realizr 
that we are not fully RFC 5198 compliant of historical and 
interoperational reasons.

BOM is not recommended as keep-alive even if some implementations may 
use it.

I suggest to add this sentence in 3.2: " This is useful in many 
configurations to open ports and firewalls and setting up the connection 
between the application and the network."

> Section 3.4
>     If the "CPS" value is reached, longer transmission intervals SHALL be
>     applied and only as much of the text queued for transmission SHALL be
>     sent at the end of each transmission interval that can be allowed
>     without exceeding the "CPS" value, until the transmission rate falls
>     under the "CPS" value again.  [...]
> This doesn't seem as precisely specified as it could be, given that the
> CPS rate is supposed to be enforced over "any 10-second interval".  As
> written, this seems to suggest that the entire 10-second history of
> packet size/spacing needs to be retained, so that at each transmission
> the earliest time for next transmission can be computed that retains the
> CPS limit.  It's not clear that there's real need for such a complicated
> solution vs something that more bluntly backs off the transmit rate and
> uses bucketed averages for tracking the transmission rate.

[GH] I suggest to change the calculation of mean character rate in the 
first paragraph of 3.4 to: " as long as the mean character rate of new 
text to the receiver calculated over the last 10 one-second intervals 
does not exceed the "CPS" value of the receiver."

That calculation method of character rate will then be remembered when 
reading other references to "CPS"

> (I have no idea why it's 330ms for a mixer and 300ms for a non-mixer,
> but assume there is some reason for the difference.)

[GH] I have now inserted an explanation in 3.4 as proposed in the 
DISCUSS resolution:

"    The reason for the value 330 ms is that many sources of text will
    transmit new text with 300 ms intervals during periods of
    continuous user typing, and then reception in the mixer of such new
    text will cause a combined transmission of the new text and the
    unsent redundancy from the previous transmission. Only when the user
    stops typing, the 330 ms interval will be applied to send the

> Section 3.6
>     Text received by a mixer from a participant SHOULD NOT be included in
>     transmission from the mixer to that participant, because the normal
>     behavior of the endpoint is to present locally-produced text locally.
> When would the SHOULD NOT be ignored?  (How might a mixer know that the
> other endpoint is not using the "normal behavior" of presenting
> locally-produced text locally?)
[GH] It is SHOULD just to allow specific applications to deviate. I have 
no reason to think that it would be useful, but I do not want to make it 
forbidden. Then all mixers and endpoints in that application environment 
would do it the same way or have a setting or control mechanism to 
select which solution to use. I suggest no change.
> Section 3.7
>     A mixer SHALL handle reception, recovery from packet loss, deletion
>     of superfluous redundancy, marking of possible text loss and deletion
>     of 'BOM' characters from each participant before queueing received
>     text for transmission to receiving participants.
> Are there specific references available for each of these operations?

[GH] Yes. I suggest to add last " as specified in [RFC4103] for 
single-party sources and section 3.17  for multiparty sources (chained 

> Section 3.9
>     The source with the oldest text received in the mixer or oldest
>     redundant text SHALL be next in turn to get all its available unsent
>     text transmitted.  Any redundant repetitions of earlier transmitted
> Just to confirm: this is really *all* its available unsent text, not
> just however much will fit in one packet/flight/etc.?  Can a participant
> "hog the mic" by continuing to append to that list even as transmission
> has commenced?
[GH] Section 3.9 is in answers to the DISCUSS suggested to be replaced 
by wording in 3.4, where the problem indicated here is amended.
> Section 3.13
> It took me a bit of searching to realize that it is RFC 2198 that
> specifies the additional header that includes the "timestamp offset"
> field.  A specific reference here (or maybe from an earlier section?)
> would have helped me out.
[GH] I suggest to include after the bullet list in section 3.10 the 
             "The principles from [RFC4103] apply for populating the 
header, the redundancy header and the data in the packet with specifics 
specified here and in the following sections."

I hope it is OK to take it in pieces, with next piece coming soon.



> _______________________________________________
> Audio/Video Transport Core Maintenance

Gunnar Hellström