Re: [AVTCORE] Question on multi-party RTT handling (draft-ietf-avtcore-multi-party-rtt-mix-02)

Gunnar Hellström <> Sat, 23 May 2020 13:24 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 69EE73A0CDD for <>; Sat, 23 May 2020 06:24:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id O_zwRVNguLhM for <>; Sat, 23 May 2020 06:24:32 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id D24B33A0CDB for <>; Sat, 23 May 2020 06:24:31 -0700 (PDT)
Received: from [] ( []) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: by (Postfix) with ESMTPSA id D57BD2011D; Sat, 23 May 2020 15:24:28 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=dkim; t=1590240269; bh=q3ZTUXIZl3dci9NjKr85rydChFe3ZOhJzj0ALmZ+IE8=; h=From:Subject:To:Cc:References:Date:In-Reply-To:From; b=ycgEAIQ8sj+33B/6cDUuE3FTX0qkEv1i2P2dEsU2LvAnWJ1f2kZ2RjbbFi7WDMG1b dzTIj2HpJwzdEYmVHwEisVR8AkY+qaZHsfn0Q6guw6iJrPHXTrRbL6FT/gsZMXnXiY EJsTzWQhir6RhvHk87tFdn6M7yZC/kUAvp7O6Wxc=
From: =?UTF-8?Q?Gunnar_Hellstr=c3=b6m?= <>
To: Yong Xin <>
Cc: "" <>
References: <> <> <> <> <>
Message-ID: <>
Date: Sat, 23 May 2020 15:24:27 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0
MIME-Version: 1.0
In-Reply-To: <>
Content-Type: multipart/alternative; boundary="------------014A85FDFD08E1B2D2CACC51"
Content-Language: sv
Archived-At: <>
Subject: Re: [AVTCORE] Question on multi-party RTT handling (draft-ietf-avtcore-multi-party-rtt-mix-02)
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 23 May 2020 13:24:38 -0000

Hi Yong, please see inline,

Den 2020-05-23 kl. 01:31, skrev Yong Xin:
>     Thanks for the quick response. The latest spec does address my
>     concern. I have some follow-up questions:
>       * The new payload format “text/rex” can be used with or without
>         redundancy. When redundancy is used, mixer has to use the same
>         redundancy level when transmitting texts from multiple
>         sources. If the different party in the same conference has
>         negotiated a different redundancy level, the mixer has to pick
>         the lowest level to use, right?
> No. There are two sides of this answer:
> The mixer should do separate mixing for each recipient, using the 
> redundancy level agreed with each recipient. This is also because the 
> users do not want to see their own transmitted text being received 
> from the mixer. The own text is displayed locally by the endpoints. If 
> the recipient does not support "text/rex", the mixer also need to do 
> the mixing for multi-party unaware endpoints using the "text/red" 
> format described in section 13.2.
> */[YX] Understood, the text mixer is providing N-1 mixing, similar to 
> audio mixing, so the user never receive their own transmitted text 
> from the mixer./*
/*[GH] */Yes. I see I have that specified for the multi-party unaware 
mixing in 13.2.2 by this sentence: "Text received from a participant 
SHOULD NOT be included intransmission to that participant." I suggest I 
include a similar sentence in 13.1 for the multi-party aware case.
> *//*
> And the mixer must recover from loss in reception from each source and 
> create a queue of clean text from each source before composing the 
> packets for transmission. The mixer cannot just resend received packet 
> contents with redundancy, because the recovery mechanism requires the 
> sequence number gaps for loss detection, and the mixer must create its 
> own sequence number series in the transmission.
> */[YX] I agree what you said here. I think I’m little confused when 
> reading the following paragraph in section 3 of the spec. Let me put 
> an example, there’s a 3-party conference and all participants (A, B, 
> C) are conference-aware RTT terminals and support text/rex packet 
> format. User A, B, C negotiates different redundancy level 2, 3, 1 
> respectively. When mixer transmitting text (source from B & C) to user 
> A, what is the number of redundant generations should be used by the 
> mixer in the transmitted packet? Is it 2 or 1?
> /*
[GH] It is 1. I see my wording is confusing. I suggest to improve the 
yellow sentence fron the paragraph below to read: "/It//SHOULD be set to 
the minimum of the number declared by the two parties in the SDP 

It can be discussed if the minimum of what the parties declared is the 
best choice. It is quite common that SDP parameters declare what the 
party wants to receive. In this case the number can be decided by the 
party knowing the network conditions in the network, or it can know it 
has decoding limitations and does not want to receive more than a 
specific number of generations in the packets. It may also have coding 
limitations, so that it cannot create more generations than itself can 
receive. That made me think that the resulting number of generations 
sent should be the minimum of what the two parties in each connection 
declared. I can change that to say that "It SHOULD be set to what the 
other party declared". What is actually used in the stream is found by 
dividing the number of data header entries with the number of members in 
the CSRC list.

This discussion is a bit theoretical, because the default is one primary 
and two redundant generations, and there are rarely any reason to 
deviate from that.

> *//*
> /   The number of redundant generations of T140blocks to include in/
> /   transmitted packets SHALL be deducted from the SDP negotiation. It/
> /SHOULD be set to the minimum of the number declared by the receiver/
> /and the transmitter//.  The same number of redundant generations MUST/
> /   be used for all sources in the transmissions.  The number of/
> /   generations sent to a receiver SHALL be the same during the whole/
> /   session unless it is modified by session renegotiation./
>       * But in case there’s one party has negotiated “text/rex”
>         without any redundancy level, does that mean mixer has to turn
>         of the redundancy for this conference? Does mixer need to
>         change the redundancy level up and down dynamically as user
>         joins or leaves the conference? Does mixer need to send
>         re-INVITE to re-negotiate the redundancy level with other
>         party when such change happens?
> From the logic above, the answers on these questions are: no. I 
> realize that an explanation should be inserted in the beginning of 
> section 3. "Actions at transmission by the mixer" to clarify that the 
> source for transmission from the mixer is clean text in separate 
> queues regardless of which format or protocol they used in the 
> individual receptions.
> */[YX] This is related to the above question. Some clarification in 
> the spec would be helpful./*
>       * In section 12, I noticed the 150cps recommendation is still
>         there and has been made as default value for the new packet
>         format, but the transmission interval is back to 300ms (the
>         recommended interval was 100ms in the old spec). I guess with
>         the new packet format, it is not required to use the shorter
>         transmission interval any more.
> The transmission interval is mentioned in two paragraphs in section 3. 
> One saying that the default is 300 ms, the other saying:
> "For multi-party operation, it is RECOMMENDED that the mixer sends a 
> packet to each receiver as soon as text has been received from a 
> source as long as the maximum number of characters per second 
> indicated by the recipient is not exceeded, and also the number of 
> packets sent per second to a recipient is kept under a specified 
> number.  This number SHALL be 10 if no other limit is applied for the 
> application.  The intention is to keep the latency introduced by the 
> mixer low."
> This is intended to create a balance between low latency and 
> protection against bursty packet loss. Even if the latency 
> requirements from real-time text users are much lower than from audio 
> and video users, a low latency is appreciated, and latency of over 2 
> seconds end-to-end creates conversation problems.  Therefore, this 
> paragraph about when to transmit will self-regulate to about 100 ms 
> packet interval from about 3 simultaneous typing sources.
> The 300 ms default assures that the remaining redundancy transmissions 
> will be sent even shortly after all sources have stopped typing.
> */[YX] So are you saying the recommended transmission interval in the 
> new spec is still 100ms, and the 300ms is actually the time that 
> covers 3 transmissions (one primary transmission plus 2 redundancy 
> transmissions, assuming redundancy level 2), is my understanding 
> correct? I guess this is better clarified in the new spec. The old 
> spec is much clear in this definition. /*
/*[GH]*/ The intention now is to not have a fixed interval, but have it 
variable between 100 and 300 ms. This is in order to not add delay by 
the mixer when it is possible to avoid it.

So, when the mixer receives a packet, it checks for each party it is 
sending to, if it already has sent 10 packets to that party during the 
last second. If not, it sends the just received text immediately on to 
that party together with whatever redundant text there was to be sent. 
If it already had sent 10 packets, then this text needs to wait in queue 
until 300 ms has passed from the latest transmission and then sent 
together with any other new text and the redundancy. Whoops, there was a 
logic error in that algorithm. It can allow a burst of 10 packets sent 
to one party with very short intervals, and then three more packets 
during the same second with 300 ms interval. That causes a risk for loss 
of text if a short burst loss happens when the 10 packets are sent. And 
it results in 13 packets sent when the limit was intended to be 10.

So, it is likely better to say in chapter 3:

" A "text/rex" transmitter SHOULD send packets distributed in time as 
long as there is something (new orredundant T140blocks) to transmit.The 
maximum transmission interval SHOULD then be 300 ms.It is RECOMMENDED to 
send a packet to a receiver as soon as new text to that receiver is 
available, as long as the time after the latest sent packet to the same 
receiver is more than 100 ms, and also the maximum character rate to the 
receiver is not exceeded. The intention is to keep the latency low while 
keeping a good protection against text loss in bursty packet loss 

The introduced delay for up to 16 simultaneously sending users will be 
between 0 and 100 ms. I think it is good that a middlebox like this does 
not introduce more latency. I hope we can have a discussion if we 
instead get too high risk for text loss in case of bursty packet loss.  
With intensive participants the coverage for such loss is only 200-300 
ms, while at low traffic it is 600-900 ms.

> *//*
> However, this algorithm may make the protection against bursty loss 
> weaker than with a steady 300 ms interval. With between 17 and 32 
> simultaneous typing users, the latency caused by the mixer will be 
> around 300 ms and then passes both regulatory and human requirements.
> Now, even if it passes the requirements, 32 is a very unrealistic 
> number of simultaneous typing users. In audio conferences it is only 
> possible to perceive one source at a time well. The benefit of 
> enabling more is just for noticing that someone else want to say 
> something. Requirements for this work are collected in 
> draft-hellstrom-avtcore-multi-party-rtt-solutions-00, and there the 
> performance requirements are set to be valid for up to 5 
> simultaneously transmitting users and the delay caused by the mixer to 
> be less than 500 ms. I think we should design for these figures.
>       * And you mentioned these characteristics provide for smooth
>         flow of text with acceptable latency from at least 32 sources
>         simultaneously. Since the new packet format can support up to
>         16 sources per packet, the text from 32 sources will have be
>         transmitted in turn. If my calculation is correct, with 300ms
>         transmission interval and redundancy level 2, it will take
>         900ms (one primary + 2 redundant) for mixer to switch from
>         first 16 sources to next 16 sources, so the delay is about
>         900ms. Is this the acceptable latency in your mind?
> See discussions above. Maybe regulators need to say how many 
> simultaneous users the requirements are for. I think 5 is a high and 
> good figure even if the discussion above indicates 32 to be possible.
> */[YX] Yes I agree, at least for emergency type of service, I don’t 
> see a multi-party use case that requires more than 5 parties /*
[GH] I want to create a specification that is regarded sufficient for 
all realistic use cases. The most commonly foreseen emergency service 
use case has only two occsionally simultaneously transmitting users.

It is hard to imagine a use case with 5 simultaneously sending 
participants, where it would be important for a user to be able to read 
text from one source with less than one second delay.

It does not happen for audio. As soon as two talks, the result is 
usually not perceivable.(Except in the efforts you experience now in 
Corona-times, with a choir of individual conference participants trying 
to sing together.)

With video, you cannot concentrate on what more than one does. But the 
mixer can usually present up to 9 or 16 or so with maintained temporal 
resolution but less spatial resolution. The receiver selects one to 

One application that could cause many simultaneous sources of text would 
be a conference with voice to text translations in many languages for a 
large audience. That will require a user interface at the receiving end 
that shows one selected text stream and hides everything else. I imagine 
that there are other mechanisms for that already. It cannot be the 
application we design for, but I dont mind if it will be possible.

This reasoning tells again that a goal of 5 simultaneous text sources 
with less than 500 ms delay is high but maybe realistic in some 
situation, and our solution that provides for 16 simultaneous 
transmitting users and 0-100 ms delay meets the requirement well. I do 
not think we will need to mention the 32 simultaneously transmitting 
users still meeting the requirements.

I will adjust the performance chapter and the introduction for this.

> [GH] Thanks.
Gunnar Hellström