Re: [AVTCORE] Question on multi-party RTT handling (draft-hellstrom-avtcore-multi-party-rtt-source-03)

Gunnar Hellström <> Fri, 22 May 2020 09:40 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 959463A0958 for <>; Fri, 22 May 2020 02:40:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.888
X-Spam-Status: No, score=-1.888 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id bfqHdPbWlK9d for <>; Fri, 22 May 2020 02:40:46 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 9AD3E3A0936 for <>; Fri, 22 May 2020 02:40:43 -0700 (PDT)
Received: from [] ( []) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: by (Postfix) with ESMTPSA id 3E948205EA; Fri, 22 May 2020 11:40:41 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=dkim; t=1590140441; bh=L1dBfXiOOzjHL0CDa98qs8MURcVvWceMiaHvd64IJJQ=; h=Subject:To:References:From:Date:In-Reply-To:From; b=fBo5p5cN5jDwMAtUERdcfXqsNgGOiGMigxOAEWEf0Yy+pjiVhddezx5gRbXzxXOYF A9ObVOtZT3Zom81D9fmjTnt3/uUqucx35cyEBpX5BebT4thsppgopC4A8jETNwauKt i7Te7coPM6QG+7+V8KwLvD6s47qChQxuwFUAQyRY=
To: Yong Xin <>, "" <>
References: <> <> <>
From: =?UTF-8?Q?Gunnar_Hellstr=c3=b6m?= <>
Message-ID: <>
Date: Fri, 22 May 2020 11:40:40 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0
MIME-Version: 1.0
In-Reply-To: <>
Content-Type: multipart/alternative; boundary="------------61EFF866773C916B4E56AB6F"
Content-Language: sv
Archived-At: <>
Subject: Re: [AVTCORE] Question on multi-party RTT handling (draft-hellstrom-avtcore-multi-party-rtt-source-03)
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 22 May 2020 09:40:52 -0000

Hi Yong,

Thanks for good questions,

Den 2020-05-22 kl. 02:53, skrev Yong Xin:
> Dear Hellstrom,
> Thanks for the quick response. The latest spec does address my 
> concern. I have some follow-up questions:
>   * The new payload format “text/rex” can be used with or without
>     redundancy. When redundancy is used, mixer has to use the same
>     redundancy level when transmitting texts from multiple sources. If
>     the different party in the same conference has negotiated a
>     different redundancy level, the mixer has to pick the lowest level
>     to use, right?
No. There are two sides of this answer:

The mixer should do separate mixing for each recipient, using the 
redundancy level agreed with each recipient. This is also because the 
users do not want to see their own transmitted text being received from 
the mixer. The own text is displayed locally by the endpoints. If the 
recipient does not support "text/rex", the mixer also need to do the 
mixing for multi-party unaware endpoints using the "text/red" format  
described in section 13.2.

And the mixer must recover from loss in reception from each source and 
create a queue of clean text from each source before composing the 
packets for transmission. The mixer cannot just resend received packet 
contents with redundancy, because the recovery mechanism requires the 
sequence number gaps for loss detection, and the mixer must create its 
own sequence number series in the transmission.

>   * But in case there’s one party has negotiated “text/rex” without
>     any redundancy level, does that mean mixer has to turn of the
>     redundancy for this conference? Does mixer need to change the
>     redundancy level up and down dynamically as user joins or leaves
>     the conference? Does mixer need to send re-INVITE to re-negotiate
>     the redundancy level with other party when such change happens?
 From the logic above, the answers on these questions are: no. I realize 
that an explanation should be inserted in the beginning of section 3. 
"Actions at transmission by the mixer" to clarify that the source for 
transmission from the mixer is clean text in separate queues regardless 
of which format or protocol they used in the individual receptions.
>   * In section 12, I noticed the 150cps recommendation is still there
>     and has been made as default value for the new packet format, but
>     the transmission interval is back to 300ms (the recommended
>     interval was 100ms in the old spec). I guess with the new packet
>     format, it is not required to use the shorter transmission
>     interval any more.
The transmission interval is mentioned in two paragraphs in section 3. 
One saying that the default is 300 ms, the other saying:

"For multi-party operation, it is RECOMMENDED that the mixer sends a 
packet to each receiver as soon as text has been received from asource 
as long as the maximum number of characters per secondindicated by the 
recipient is not exceeded, and also the number ofpackets sent per second 
to a recipient is kept under a specifiednumber.This number SHALL be 10 
if no other limit is applied for the application.The intention is to 
keep the latency introduced by themixer low."

This is intended to create a balance between low latency and protection 
against bursty packet loss. Even if the latency requirements from 
real-time text users are much lower than from audio and video users, a 
low latency is appreciated, and latency of over 2 seconds end-to-end 
creates conversation problems.  Therefore, this paragraph about when to 
transmit will self-regulate to about 100 ms packet interval from about 3 
simultaneous typing sources.

The 300 ms default assures that the remaining redundancy transmissions 
will be sent even shortly after all sources have stopped typing.

However, this algorithm may make the protection against bursty loss 
weaker than with a steady 300 ms interval. With between 17 and 32 
simultaneous typing users, the latency caused by the mixer will be 
around 300 ms and then passes both regulatory and human requirements.

Now, even if it passes the requirements, 32 is a very unrealistic number 
of simultaneous typing users. In audio conferences it is only possible 
to perceive one source at a time well. The benefit of enabling more is 
just for noticing that someone else want to say something. Requirements 
for this work are collected in 
draft-hellstrom-avtcore-multi-party-rtt-solutions-00, and there the 
performance requirements are set to be valid for up to 5 simultaneously 
transmitting users and the delay caused by the mixer to be less than 500 
ms. I think we should design for these figures.

>   * And you mentioned these characteristics provide for smooth flow of
>     text with acceptable latency from at least 32 sources
>     simultaneously. Since the new packet format can support up to 16
>     sources per packet, the text from 32 sources will have be
>     transmitted in turn. If my calculation is correct, with 300ms
>     transmission interval and redundancy level 2, it will take 900ms
>     (one primary + 2 redundant) for mixer to switch from first 16
>     sources to next 16 sources, so the delay is about 900ms. Is this
>     the acceptable latency in your mind?
See discussions above. Maybe regulators need to say how many 
simultaneous users the requirements are for. I think 5 is a high and 
good figure even if the discussion above indicates 32 to be possible.
>  *
>   * There’re quite a few updates in the spec in the last couple of
>     months. When do you expect this IETF draft will get finalized and
>     approved?
Yes, I hope for a short period of discussions like yours that I 
appreciate to sort out the main principles, and then a period of the 
different levels of last calls and refinement editing. Even if one would 
hope for a more rapid progress, a realistic milestone for sending it to 
IESG has been set for February 2021. It would be good if various 
organisations who would reference it in their specifications (as 3GPP, 
ATIS, NENA, ETSI etc.) would take a look already now and assess if it is 
agreeable and on the right way.



>  *
> Regards,
> Yong
> *From:* Gunnar Hellström <>
> *Sent:* Thursday, May 21, 2020 12:39 AM
> *To:* Yong Xin <>om>;
> *Subject:* Re: Question on multi-party RTT handling 
> (draft-hellstrom-avtcore-multi-party-rtt-source-03)
> The e-mail below is from an external source. Please do not open 
> attachments or click links from an unknown or suspicious origin.
> Dear  Yong,
> Thanks for a good question.
> The draft you are asking about has been replaced by this one:
> and it is modified at the point of your question, and partly because 
> of the issue you saw with the draft you looked in. More follows inline,
> Den 2020-05-21 kl. 00:28, skrev Yong Xin:
>     Dear Mr. Hellstrom,
>     I have a question about how to use RTT mixer (rtt-mix) method with
>     “text/red” format for multi-party call handling, as defined in
>     your IETF draft
>           4. Use of fields in the RTP packets
>     RFC 4103 <>[RFC4103
>     <>] specifies use of RFC 3550
>     <> RTP[RFC3550], and a
>        redundancy format "text/red" for increased robustness.  This
>        specification updates RFC 4102
>     <>[RFC4102
>     <>] and RFC 4103
>     <>[RFC4103
>     <>] by
>        introducing a rule for populating and using the CSRC-list in
>     the RTP
>        packet in order to enhance the performance in multi-party RTT
>        sessions.
>        When transmitted from a mixer, the first member in the CSRC-list
>        SHALL contain the SSRC of the source of the primary T140block
>     in the
>        packet.  The second and further members in the CSRC-list SHALL
>        contain the SSRC of the source of the first, second, etc redundant
>        generations of T140blocks included in the packet. ( the recommended
>        level of redundancy is to use one primary and two redundant
>        generations of T140blocks.)  In some cases, a primary or redundant
>        T140block is empty, but is still represented by a member in the
>        redundancy header.  For such cases, the corresponding CSRC-list
>        member MUST also be included.
>        The CC field SHALL show the number of members in the CSRC list.
>        Note: This specification departs from section 4 of RFC 2198
>     <> [RFC2198
>     <>]
>        which associates the whole of the CSRC-list with the primary
>     data and
>        assumes that the same list applies to reconstructed redundant data.
>        In the present specification a T140block is associated with exactly
>        one CSRC list member as described above.  Also RFC 2198
>     <> [RFC2198
>     <>]
>        anticipates infrequent change to CSRCs; implementers should be
>     aware
>        that the order of the CSRC-list according to this specification
>     will
>       vary during transitions between transmission from the mixer of text
>        originated by different participants.
>        The picture below shows a typical RTP packet with multi-party RTT
>        contents and coding according to the present specification.
>            0 1                   2                   3
>            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>           |V=2|P|X| CC=3  |M| "RED" PT   |   sequence number of primary  |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>           | timestamp of primary encoding "P"               |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>           | synchronization source (SSRC) identifier            |
>     +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
>           |  CSRC list member 1 = SSRC of source of
>     "P"                   |
>           |  CSRC list member 2 = SSRC of source of
>     "R1"                  |
>           |  CSRC list member 3 = SSRC of source of
>     "R2"                  |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>           |1|   T140 PT   | timestamp offset of "R2" | "R2" block length |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>           |1|   T140 PT   | timestamp offset of "R1" | "R1" block length |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>           |0|   T140 PT   | "R2" T.140 encoded redundant
>     data             |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---------------+
>           |   |  "R1" T.140 encoded redundant data        |      
>             |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+
>           |              "P" T.140 encoded primary data             |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>            Figure 1: text/red packet with sources indicated in the
>     CSRC-list.
>     At every transmission time, the mixer can use the primary data
>     block to send new texts from one source, but new texts from other
>     sources will have to wait in their queue for their turn. I assume
>     it is a round-robin fashion to determine the next source. The
>     default text transmission interval is 300ms, which means the texts
>     from other sources have to wait in the queue for at least 300ms
>     before they can be transmitted. I can see you have recommended to
>     reduce the transmission interval from 300ms to 100ms to reduce
>     this delay, but in the case of large conference and assuming every
>     participant is typing the text simultaneously, the waiting time in
>     the queue will become longer. For example, in a 10-party
>     conference, even with 100ms transmission interval, the new texts
>     from last participant will wait for 9x100ms = 900ms to send. This
>     delay will be too long for some emergency service. Increasing the
>     redundancy level will only help to recovery from more consecutive
>     packet loss, but it does not help to reduce this delay. So it
>     looks to me this method is not ideal for large conference, is my
>     understanding correct? Has this issue been discussed in the IETF
>     meeting before? Do you have any recommendation to solve this problem?
> Your understanding is correct. There is discussion about various ways 
> to arrange the mixing in another draft: 
> <>
> It is slightly outdated now, but it contains reasoning about 
> performance and other aspects of different solutions.
> The current draft replacing the one you read, specifies a packet 
> format that enables new text from up to 16 simultaneous text sources. 
> It is possible to send text from more simultaneous sending users, but 
> then there will be a short delay for some. The delay for 1-16 
> simultaneous texters will vary between 0 and 300 milliseconds.
> Even in a large conference, it will in most cases be only one 
> participant sending real-time text, but occasionally two or three. It 
> will be as for voice or for sign language in video: It will be 
> unmanageable for the participants to perceive media from many sources 
> simultaneously. I agree that for text, the opportunities are a bit 
> better than for audio and video. The text at least stays and is 
> readable in a well arranged display where the participants can catch 
> up reading if there were many sending simultaneously.
> You mention the emergency call with 10 participants as an example 
> where a delay of 900 ms would be a risk. In the type of emergency call 
> I think of, where one person in an emergency calls the emergency 
> number and get a connection with an emergency call taker, I can only 
> imagine there be in very unusual cases up to maybe 5 participants, 
> most often taking turns nicely in sending text.
> It can for example be the user, the call taker, a language translator, 
> a first responder and an expert in chemical danger. The simultaneous 
> typing that may occur will be e.g. the user coming with more 
> information while the first responder types some instructions for how 
> to handle the case. The others will in most cases wait for their turn.
> The more common emergency call will have three participants: The 
> calling user, the call taker and a first responder or other agent. And 
> then the two people in the service know how to take turns. So it will 
> be a maximum of two participants typing simultaneously.
> I can imagine a completely other kind of emergency conference, where 
> people call in and report accidents they have seen to check if they 
> are already handled, and they get reports about ongoing emergencies. 
> If it is at all realistic to set up such service as a conference call, 
> there would indeed be small delays before some text is presented. 
> However, the 900 ms in your example is the time that a person normally 
> types a word, and the person supposed to act on all these text streams 
> may need to switch from reading another source to the end and then 
> move to respond or look at the new source. That will always take more 
> than one second. So even here, the replaced draft would result in good 
> performance. And this is not what is meant with an emergency call.
> Maybe there will be some other applications with unmanaged conferences 
> with real-time text where a lot of simultaneous typing will occur.  
> Therefore I moved to specifying for up to 16 simultaneous sources.
> There are also both human and regulatory requirements saying that 
> real-time text MUST not be delayed more than 500 ms or 1 second 
> (depending on what document you read, and where the delay is 
> measured.) So that should be obeyed for normal cases.
> In the replaced draft you refer to, the format is called "text/red" 
> just as for RFC 4103, and negotiated by an sdp attribute. I got 
> indications off-list that that would not be allowed. The change in the 
> use of the CSRC list from what is stated in RFC 2198 would be too big. 
> Therefore I needed to move to call it a new format "text/rex", and 
> negotiate it by payload types in the m-line. When I realized that I 
> needed to take that step, it was also natural to improve the format to 
> be able to carry more text without introducing delays.
> Do you agree that the current draft 
> draft-ietf-avtcore-multi-party-rtt-mix 
> <>-01 
> solves your concerns?
> Thanks,
> Gunnar
>     Thanks,
>     Yong
> -- 
> Gunnar Hellström
> GHAccess
>  <>

Gunnar Hellström