Re: [AVTCORE] Question on multi-party RTT handling (draft-ietf-avtcore-multi-party-rtt-mix-02)

Gunnar Hellström <> Wed, 27 May 2020 06:01 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id CA0733A0808 for <>; Tue, 26 May 2020 23:01:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.888
X-Spam-Status: No, score=-1.888 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id 1_BhvZ_hAPNg for <>; Tue, 26 May 2020 23:01:03 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id E86683A07FC for <>; Tue, 26 May 2020 23:01:01 -0700 (PDT)
Received: from [] ( []) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: by (Postfix) with ESMTPSA id D02A522540; Wed, 27 May 2020 08:00:59 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=dkim; t=1590559260; bh=2dWHwr5ngroG0IPbUAoEZUc2b1/LmLbWgr5hGnm+qbc=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=OitbBHcwrILAcuNG3TlqFjOHdEYv4p9b+ktAufYF/FuNvC1fwy+hvEQghwkCLtDQq DvTxOk+BZch7UoNTrov0CdP2oMlCuKTjxzDLmhMc5EaFfC1yi1dsltITEb5c6ca05w SQ+EZRQpHSidSQxfhYRbRz42qIg+JDGqXcQrYyOk=
To: Yong Xin <>
Cc: "" <>
References: <> <> <> <> <> <> <>
From: Gunnar Hellström <>
Message-ID: <>
Date: Wed, 27 May 2020 08:00:58 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.8.1
MIME-Version: 1.0
In-Reply-To: <>
Content-Type: multipart/alternative; boundary="------------FAC39292D6E2E7BF5C87A2E7"
Content-Language: sv
Archived-At: <>
Subject: Re: [AVTCORE] Question on multi-party RTT handling (draft-ietf-avtcore-multi-party-rtt-mix-02)
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 27 May 2020 06:01:09 -0000

Thanks Yong for feedback,

We discussed performance requirements initially. I have checked the 
requirements from regulation outside of the emergency services. There, 
there is only a requirement to identify the source and be able to switch 
source smoothly in an ordely way between users taking turns properly.

In practice I know that there will be some simultaneous typing, at least 
for very brief indications, like "I want to say something" or "that 
suits me". So, good switching performance without much extra delay 
between 5 simultaneous typing users as the documented goal in 
draft-hellstrom-avtcore-multi-party-rtt-solutions-00 seems to be a safe 
and maybe a bit high requirement.

With the intended use for emergency services, the situation is similar 
there. I expect eager users in emergency to sometimes type longer 
explanations while they get instructions. But instead fewer extra 
persons jumping in with comments. So also there, switching without too 
much extra delay between a maximum of 5 simultaneous senders seem to be 
a safe and a bit high requirement.



Den 2020-05-26 kl. 19:12, skrev Yong Xin:
> Thanks Gunnar for the clarification and suggested change in next 
> revision – they all look good to me
> Regards,
> Yong
> *From:* Gunnar Hellström <>
> *Sent:* Saturday, May 23, 2020 6:24 AM
> *To:* Yong Xin <>
> *Cc:*
> *Subject:* Re: Question on multi-party RTT handling 
> (draft-ietf-avtcore-multi-party-rtt-mix-02)
> The e-mail below is from an external source. Please do not open 
> attachments or click links from an unknown or suspicious origin.
> Hi Yong, please see inline,
> Den 2020-05-23 kl. 01:31, skrev Yong Xin:
>         Thanks for the quick response. The latest spec does address my
>         concern. I have some follow-up questions:
>           * The new payload format “text/rex” can be used with or
>             without redundancy. When redundancy is used, mixer has to
>             use the same redundancy level when transmitting texts from
>             multiple sources. If the different party in the same
>             conference has negotiated a different redundancy level,
>             the mixer has to pick the lowest level to use, right?
>     No. There are two sides of this answer:
>     The mixer should do separate mixing for each recipient, using the
>     redundancy level agreed with each recipient. This is also because
>     the users do not want to see their own transmitted text being
>     received from the mixer. The own text is displayed locally by the
>     endpoints. If the recipient does not support "text/rex", the mixer
>     also need to do the mixing for multi-party unaware endpoints using
>     the "text/red" format  described in section 13.2.
>     */[YX] Understood, the text mixer is providing N-1 mixing, similar
>     to audio mixing, so the user never receive their own transmitted
>     text from the mixer./*
> */[GH] /*Yes. I see I have that specified for the multi-party unaware 
> mixing in 13.2.2 by this sentence: "Text received from a participant 
> SHOULD NOT be included in transmission to that participant." I suggest 
> I include a similar sentence in 13.1 for the multi-party aware case.
>     And the mixer must recover from loss in reception from each source
>     and create a queue of clean text from each source before composing
>     the packets for transmission. The mixer cannot just resend
>     received packet contents with redundancy, because the recovery
>     mechanism requires the sequence number gaps for loss detection,
>     and the mixer must create its own sequence number series in the
>     transmission.
>     */[YX] I agree what you said here. I think I’m little confused
>     when reading the following paragraph in section 3 of the spec. Let
>     me put an example, there’s a 3-party conference and all
>     participants (A, B, C) are conference-aware RTT terminals and
>     support text/rex packet format. User A, B, C negotiates different
>     redundancy level 2, 3, 1 respectively. When mixer transmitting
>     text (source from B & C) to user A, what is the number of
>     redundant generations should be used by the mixer in the
>     transmitted packet? Is it 2 or 1? /*
> [GH] It is 1. I see my wording is confusing. I suggest to improve the 
> yellow sentence fron the paragraph below to read: "/It//SHOULD be set 
> to the minimum of the number declared by the two parties in the SDP 
> exchange//."/
> It can be discussed if the minimum of what the parties declared is the 
> best choice. It is quite common that SDP parameters declare what the 
> party wants to receive. In this case the number can be decided by the 
> party knowing the network conditions in the network, or it can know it 
> has decoding limitations and does not want to receive more than a 
> specific number of generations in the packets. It may also have coding 
> limitations, so that it cannot create more generations than itself can 
> receive. That made me think that the resulting number of generations 
> sent should be the minimum of what the two parties in each connection 
> declared. I can change that to say that "It SHOULD be set to what the 
> other party declared". What is actually used in the stream is found by 
> dividing the number of data header entries with the number of members 
> in the CSRC list.
> This discussion is a bit theoretical, because the default is one 
> primary and two redundant generations, and there are rarely any reason 
> to deviate from that.
>     *//*
>     /   The number of redundant generations of T140blocks to include in/
>     /   transmitted packets SHALL be deducted from the SDP
>     negotiation. It/
>     /SHOULD be set to the minimum of the number declared by the receiver/
>     /and the transmitter//.  The same number of redundant generations
>     MUST/
>     /   be used for all sources in the transmissions.  The number of/
>     /   generations sent to a receiver SHALL be the same during the whole/
>     /   session unless it is modified by session renegotiation./
>           * But in case there’s one party has negotiated “text/rex”
>             without any redundancy level, does that mean mixer has to
>             turn of the redundancy for this conference? Does mixer
>             need to change the redundancy level up and down
>             dynamically as user joins or leaves the conference? Does
>             mixer need to send re-INVITE to re-negotiate the
>             redundancy level with other party when such change happens?
>     From the logic above, the answers on these questions are: no. I
>     realize that an explanation should be inserted in the beginning of
>     section 3. "Actions at transmission by the mixer" to clarify that
>     the source for transmission from the mixer is clean text in
>     separate queues regardless of which format or protocol they used
>     in the individual receptions.
>     */[YX] This is related to the above question. Some clarification
>     in the spec would be helpful./*
>           * In section 12, I noticed the 150cps recommendation is
>             still there and has been made as default value for the new
>             packet format, but the transmission interval is back to
>             300ms (the recommended interval was 100ms in the old
>             spec). I guess with the new packet format, it is not
>             required to use the shorter transmission interval any more.
>     The transmission interval is mentioned in two paragraphs in
>     section 3. One saying that the default is 300 ms, the other saying:
>     "For multi-party operation, it is RECOMMENDED that the mixer sends
>     a packet to each receiver as soon as text has been received from a
>     source as long as the maximum number of characters per second
>     indicated by the recipient is not exceeded, and also the number of
>     packets sent per second to a recipient is kept under a specified
>     number.  This number SHALL be 10 if no other limit is applied for
>     the application.  The intention is to keep the latency introduced
>     by the mixer low."
>     This is intended to create a balance between low latency and
>     protection against bursty packet loss. Even if the latency
>     requirements from real-time text users are much lower than from
>     audio and video users, a low latency is appreciated, and latency
>     of over 2 seconds end-to-end creates conversation problems. 
>     Therefore, this paragraph about when to transmit will
>     self-regulate to about 100 ms packet interval from about 3
>     simultaneous typing sources.
>     The 300 ms default assures that the remaining redundancy
>     transmissions will be sent even shortly after all sources have
>     stopped typing.
>     */[YX] So are you saying the recommended transmission interval in
>     the new spec is still 100ms, and the 300ms is actually the time
>     that covers 3 transmissions (one primary transmission plus 2
>     redundancy transmissions, assuming redundancy level 2), is my
>     understanding correct? I guess this is better clarified in the new
>     spec. The old spec is much clear in this definition. /*
> */[GH]/*The intention now is to not have a fixed interval, but have it 
> variable between 100 and 300 ms. This is in order to not add delay by 
> the mixer when it is possible to avoid it.
> So, when the mixer receives a packet, it checks for each party it is 
> sending to, if it already has sent 10 packets to that party during the 
> last second. If not, it sends the just received text immediately on to 
> that party together with whatever redundant text there was to be sent. 
> If it already had sent 10 packets, then this text needs to wait in 
> queue until 300 ms has passed from the latest transmission and then 
> sent together with any other new text and the redundancy. Whoops, 
> there was a logic error in that algorithm. It can allow a burst of 10 
> packets sent to one party with very short intervals, and then three 
> more packets during the same second with 300 ms interval. That causes 
> a risk for loss of text if a short burst loss happens when the 10 
> packets are sent. And it results in 13 packets sent when the limit was 
> intended to be 10.
> So, it is likely better to say in chapter 3:
> " A "text/rex" transmitter SHOULD send packets distributed in time as 
> long as there is something (new or  redundant T140blocks) to transmit. 
> The maximum transmission interval SHOULD then be 300 ms. It is 
> RECOMMENDED to send a packet to a receiver as soon as new text  to 
> that receiver is available, as long as the time after the latest sent 
> packet to the same receiver is more than 100 ms, and also the maximum 
> character rate to the receiver is not exceeded. The intention is to 
> keep the latency low while keeping a good protection against text loss 
> in bursty packet loss conditions."
> The introduced delay for up to 16 simultaneously sending users will be 
> between 0 and 100 ms. I think it is good that a middlebox like this 
> does not introduce more latency. I hope we can have a discussion if we 
> instead get too high risk for text loss in case of bursty packet 
> loss.  With intensive participants the coverage for such loss is only 
> 200-300 ms, while at low traffic it is 600-900 ms.
>     *//*
>     However, this algorithm may make the protection against bursty
>     loss weaker than with a steady 300 ms interval. With between 17
>     and 32 simultaneous typing users, the latency caused by the mixer
>     will be around 300 ms and then passes both regulatory and human
>     requirements.
>     Now, even if it passes the requirements, 32 is a very unrealistic
>     number of simultaneous typing users. In audio conferences it is
>     only possible to perceive one source at a time well. The benefit
>     of enabling more is just for noticing that someone else want to
>     say something. Requirements for this work are collected in
>     draft-hellstrom-avtcore-multi-party-rtt-solutions-00, and there
>     the performance requirements are set to be valid for up to 5
>     simultaneously transmitting users and the delay caused by the
>     mixer to be less than 500 ms. I think we should design for these
>     figures.
>           * And you mentioned these characteristics provide for smooth
>             flow of text with acceptable latency from at least 32
>             sources simultaneously. Since the new packet format can
>             support up to 16 sources per packet, the text from 32
>             sources will have be transmitted in turn. If my
>             calculation is correct, with 300ms transmission interval
>             and redundancy level 2, it will take 900ms (one primary +
>             2 redundant) for mixer to switch from first 16 sources to
>             next 16 sources, so the delay is about 900ms. Is this the
>             acceptable latency in your mind?
>     See discussions above. Maybe regulators need to say how many
>     simultaneous users the requirements are for. I think 5 is a high
>     and good figure even if the discussion above indicates 32 to be
>     possible.
>     */[YX] Yes I agree, at least for emergency type of service, I
>     don’t see a multi-party use case that requires more than 5 parties /*
> [GH] I want to create a specification that is regarded sufficient for 
> all realistic use cases. The most commonly foreseen emergency service 
> use case has only two occsionally simultaneously transmitting users.
> It is hard to imagine a use case with 5 simultaneously sending 
> participants, where it would be important for a user to be able to 
> read text from one source with less than one second delay.
> It does not happen for audio. As soon as two talks, the result is 
> usually not perceivable.(Except in the efforts you experience now in 
> Corona-times, with a choir of individual conference participants 
> trying to sing together.)
> With video, you cannot concentrate on what more than one does. But the 
> mixer can usually present up to 9 or 16 or so with maintained temporal 
> resolution but less spatial resolution. The receiver selects one to 
> perceive.
> One application that could cause many simultaneous sources of text 
> would be a conference with voice to text translations in many 
> languages for a large audience. That will require a user interface at 
> the receiving end that shows one selected text stream and hides 
> everything else. I imagine that there are other mechanisms for that 
> already. It cannot be the application we design for, but I dont mind 
> if it will be possible.
> This reasoning tells again that a goal of 5 simultaneous text sources 
> with less than 500 ms delay is high but maybe realistic in some 
> situation, and our solution that provides for 16 simultaneous 
> transmitting users and 0-100 ms delay meets the requirement well. I do 
> not think we will need to mention the 32 simultaneously transmitting 
> users still meeting the requirements.
> I will adjust the performance chapter and the introduction for this.
>     [GH] Thanks.
> -- 
> Gunnar Hellström
> GHAccess
>  <>

Gunnar Hellström