Re: [AVTCORE] Question on multi-party RTT handling (draft-hellstrom-avtcore-multi-party-rtt-source-03)

Gunnar Hellström <> Thu, 21 May 2020 07:39 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 8AE9B3A0A97 for <>; Thu, 21 May 2020 00:39:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.898
X-Spam-Status: No, score=-1.898 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id UjOhdEfo-BxR for <>; Thu, 21 May 2020 00:39:21 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 5222C3A0A79 for <>; Thu, 21 May 2020 00:39:20 -0700 (PDT)
Received: from [] ( []) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: by (Postfix) with ESMTPSA id B58492004F; Thu, 21 May 2020 09:39:17 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=dkim; t=1590046757; bh=nS0iAULrzsQjtW7ORM26uoyLx2BpOKYju49WoSxrPN0=; h=Subject:To:References:From:Date:In-Reply-To:From; b=VyXaXZH6tThzkUXy1wpWTaeciGywCS5Oul9NJFz+Mzo2aiSrz1+PeIT8HJ1Q0/XAP 62PlMKUCQXds19FutDP1+ub1iyynnE97IT3HGU/SEPn0EyacK/yYmeRvY2a8ZEWtz+ lAfwWwAhHyKdsN5aZexRkkPpcBFdiIhj2nj7A50Q=
To: Yong Xin <>, "" <>
References: <>
From: =?UTF-8?Q?Gunnar_Hellstr=c3=b6m?= <>
Message-ID: <>
Date: Thu, 21 May 2020 09:39:14 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0
MIME-Version: 1.0
In-Reply-To: <>
Content-Type: multipart/alternative; boundary="------------3C340D54D9447CCF3606E61C"
Content-Language: sv
Archived-At: <>
Subject: Re: [AVTCORE] Question on multi-party RTT handling (draft-hellstrom-avtcore-multi-party-rtt-source-03)
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 21 May 2020 07:39:26 -0000

Dear  Yong,

Thanks for a good question.

The draft you are asking about has been replaced by this one:

and it is modified at the point of your question, and partly because of 
the issue you saw with the draft you looked in. More follows inline,

Den 2020-05-21 kl. 00:28, skrev Yong Xin:
> Dear Mr. Hellstrom,
> I have a question about how to use RTT mixer (rtt-mix) method with 
> “text/red” format for multi-party call handling, as defined in your 
> IETF draft 
>       4. Use of fields in the RTP packets
> RFC 4103 <>[RFC4103 
> <>] specifies use of RFC 3550 
> <> RTP[RFC3550], and a
>    redundancy format "text/red" for increased robustness.  This
>    specification updates RFC 4102 
> <>[RFC4102 
> <>] and RFC 4103 
> <>[RFC4103 
> <>] by
>    introducing a rule for populating and using the CSRC-list in the RTP
>    packet in order to enhance the performance in multi-party RTT
>    sessions.
>    When transmitted from a mixer, the first member in the CSRC-list
>    SHALL contain the SSRC of the source of the primary T140block in the
>    packet.  The second and further members in the CSRC-list SHALL
>    contain the SSRC of the source of the first, second, etc redundant
>    generations of T140blocks included in the packet. ( the recommended
>    level of redundancy is to use one primary and two redundant
>    generations of T140blocks.)  In some cases, a primary or redundant
>    T140block is empty, but is still represented by a member in the
>    redundancy header.  For such cases, the corresponding CSRC-list
>    member MUST also be included.
>    The CC field SHALL show the number of members in the CSRC list.
>    Note: This specification departs from section 4 of RFC 2198 
> <> [RFC2198 
> <>]
>    which associates the whole of the CSRC-list with the primary data and
>    assumes that the same list applies to reconstructed redundant data.
>    In the present specification a T140block is associated with exactly
>    one CSRC list member as described above.  Also RFC 2198 
> <> [RFC2198 
> <>]
>    anticipates infrequent change to CSRCs; implementers should be aware
>    that the order of the CSRC-list according to this specification will
>   vary during transitions between transmission from the mixer of text
>    originated by different participants.
>    The picture below shows a typical RTP packet with multi-party RTT
>    contents and coding according to the present specification.
>        0 1                   2                   3
>        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>       |V=2|P|X| CC=3  |M|  "RED" PT |   sequence number of primary  |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>       |               timestamp of primary encoding "P"               |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>       |           synchronization source (SSRC) identifier            |
> +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
>       |  CSRC list member 1 = SSRC of source of "P"                   |
>       |  CSRC list member 2 = SSRC of source of "R1"                  |
>       |  CSRC list member 3 = SSRC of source of "R2"                  |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>       |1|   T140 PT   |  timestamp offset of "R2" | "R2" block length |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>       |1|   T140 PT   |  timestamp offset of "R1" | "R1" block length |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>       |0|   T140 PT   | "R2" T.140 encoded redundant data             |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---------------+
>       |   |  "R1" T.140 encoded redundant data        |               |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+
>       |              "P" T.140 encoded primary data             |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>        Figure 1: text/red packet with sources indicated in the CSRC-list.
> At every transmission time, the mixer can use the primary data block 
> to send new texts from one source, but new texts from other sources 
> will have to wait in their queue for their turn. I assume it is a 
> round-robin fashion to determine the next source. The default text 
> transmission interval is 300ms, which means the texts from other 
> sources have to wait in the queue for at least 300ms before they can 
> be transmitted. I can see you have recommended to reduce the 
> transmission interval from 300ms to 100ms to reduce this delay, but in 
> the case of large conference and assuming every participant is typing 
> the text simultaneously, the waiting time in the queue will become 
> longer. For example, in a 10-party conference, even with 100ms 
> transmission interval, the new texts from last participant will wait 
> for 9x100ms = 900ms to send. This delay will be too long for some 
> emergency service. Increasing the redundancy level will only help to 
> recovery from more consecutive packet loss, but it does not help to 
> reduce this delay. So it looks to me this method is not ideal for 
> large conference, is my understanding correct? Has this issue been 
> discussed in the IETF meeting before? Do you have any recommendation 
> to solve this problem?
Your understanding is correct. There is discussion about various ways to 
arrange the mixing in another draft: 

It is slightly outdated now, but it contains reasoning about performance 
and other aspects of different solutions.

The current draft replacing the one you read, specifies a packet format 
that enables new text from up to 16 simultaneous text sources. It is 
possible to send text from more simultaneous sending users, but then 
there will be a short delay for some. The delay for 1-16 simultaneous 
texters will vary between 0 and 300 milliseconds.

Even in a large conference, it will in most cases be only one 
participant sending real-time text, but occasionally two or three. It 
will be as for voice or for sign language in video: It will be 
unmanageable for the participants to perceive media from many sources 
simultaneously. I agree that for text, the opportunities are a bit 
better than for audio and video. The text at least stays and is readable 
in a well arranged display where the participants can catch up reading 
if there were many sending simultaneously.

You mention the emergency call with 10 participants as an example where 
a delay of 900 ms would be a risk. In the type of emergency call I think 
of, where one person in an emergency calls the emergency number and get 
a connection with an emergency call taker, I can only imagine there be 
in very unusual cases up to maybe 5 participants, most often taking 
turns nicely in sending text.

It can for example be the user, the call taker, a language translator, a 
first responder and an expert in chemical danger. The simultaneous 
typing that may occur will be e.g. the user coming with more information 
while the first responder types some instructions for how to handle the 
case. The others will in most cases wait for their turn.

The more common emergency call will have three participants: The calling 
user, the call taker and a first responder or other agent. And then the 
two people in the service know how to take turns. So it will be a 
maximum of two participants typing simultaneously.

I can imagine a completely other kind of emergency conference, where 
people call in and report accidents they have seen to check if they are 
already handled, and they get reports about ongoing emergencies. If it 
is at all realistic to set up such service as a conference call, there 
would indeed be small delays before some text is presented. However, the 
900 ms in your example is the time that a person normally types a word, 
and the person supposed to act on all these text streams may need to 
switch from reading another source to the end and then move to respond 
or look at the new source. That will always take more than one second. 
So even here, the replaced draft would result in good performance. And 
this is not what is meant with an emergency call.

Maybe there will be some other applications with unmanaged conferences 
with real-time text where a lot of simultaneous typing will occur.  
Therefore I moved to specifying for up to 16 simultaneous sources.

There are also both human and regulatory requirements saying that 
real-time text MUST not be delayed more than 500 ms or 1 second 
(depending on what document you read, and where the delay is measured.) 
So that should be obeyed for normal cases.

In the replaced draft you refer to, the format is called "text/red" just 
as for RFC 4103, and negotiated by an sdp attribute. I got indications 
off-list that that would not be allowed. The change in the use of the 
CSRC list from what is stated in RFC 2198 would be too big. Therefore I 
needed to move to call it a new format "text/rex", and negotiate it by 
payload types in the m-line. When I realized that I needed to take that 
step, it was also natural to improve the format to be able to carry more 
text without introducing delays.

Do you agree that the current draft 
draft-ietf-avtcore-multi-party-rtt-mix-01 solves your concerns?



> Thanks,
> Yong
Gunnar Hellström