Re: [codec] #16: Multicast?

"Raymond (Juin-Hwey) Chen" <rchen@broadcom.com> Fri, 23 April 2010 01:04 UTC

Return-Path: <rchen@broadcom.com>
X-Original-To: codec@core3.amsl.com
Delivered-To: codec@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 70AEF3A67F7 for <codec@core3.amsl.com>; Thu, 22 Apr 2010 18:04:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.648
X-Spam-Level:
X-Spam-Status: No, score=-0.648 tagged_above=-999 required=5 tests=[AWL=-0.650, BAYES_50=0.001, HTML_MESSAGE=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id UnVW9tNcRW3H for <codec@core3.amsl.com>; Thu, 22 Apr 2010 18:04:22 -0700 (PDT)
Received: from MMS3.broadcom.com (mms3.broadcom.com [216.31.210.19]) by core3.amsl.com (Postfix) with ESMTP id F0F593A67F3 for <codec@ietf.org>; Thu, 22 Apr 2010 18:04:21 -0700 (PDT)
Received: from [10.9.200.131] by MMS3.broadcom.com with ESMTP (Broadcom SMTP Relay (Email Firewall v6.3.2)); Thu, 22 Apr 2010 18:03:47 -0700
X-Server-Uuid: B55A25B1-5D7D-41F8-BC53-C57E7AD3C201
Received: from IRVEXCHCCR01.corp.ad.broadcom.com ([10.252.49.30]) by IRVEXCHHUB01.corp.ad.broadcom.com ([10.9.200.131]) with mapi; Thu, 22 Apr 2010 18:03:47 -0700
From: "Raymond (Juin-Hwey) Chen" <rchen@broadcom.com>
To: Christian Hoene <hoene@uni-tuebingen.de>, 'stephen botzko' <stephen.botzko@gmail.com>
Date: Thu, 22 Apr 2010 18:03:37 -0700
Thread-Topic: [codec] #16: Multicast?
Thread-Index: AcrhfziX4JGoQYrXTDa3JX/8e0fOvAABOaOgADB221A=
Message-ID: <CB68DF4CFBEF4942881AD37AE1A7E8C74AB3F4A017@IRVEXCHCCR01.corp.ad.broadcom.com>
References: <062.7439ee5d5fd36480e73548f37cb10207@tools.ietf.org> <3E1D8AD1-B28F-41C5-81C6-478A15432224@csperkins.org> <D6C2F445-BE4A-4571-A56D-8712C16887F1@americafree.tv> <C0347188-A2A1-4681-9F1E-0D2ECC4BDB3B@csperkins.org> <u2x6e9223711004210733g823b4777y404b02330c49dec1@mail.gmail.com> <000001cae173$dba012f0$92e038d0$@de> <r2q6e9223711004211010gfdee1a70q972e8239fef10435@mail.gmail.com> <001101cae177$e8aa6780$b9ff3680$@de> <t2t6e9223711004211119i6b107798pa01fc4b1d33debf1@mail.gmail.com> <002d01cae188$a330b2c0$e9921840$@de>
In-Reply-To: <002d01cae188$a330b2c0$e9921840$@de>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-cr-hashedpuzzle: CPE3 DCAO IeZT K//M O3Ww Pq5i RS6i RtlL TICS V/yK WcvW YVRv enaH eruJ f7LT hvQM; 3; YwBvAGQAZQBjAEAAaQBlAHQAZgAuAG8AcgBnADsAaABvAGUAbgBlAEAAdQBuAGkALQB0AHUAZQBiAGkAbgBnAGUAbgAuAGQAZQA7AHMAdABlAHAAaABlAG4ALgBiAG8AdAB6AGsAbwBAAGcAbQBhAGkAbAAuAGMAbwBtAA==; Sosha1_v1; 7; {07308C3E-C940-4EE4-83D5-86951B3FEEF3}; cgBjAGgAZQBuAEAAYgByAG8AYQBkAGMAbwBtAC4AYwBvAG0A; Fri, 23 Apr 2010 01:03:37 GMT; UgBFADoAIABbAGMAbwBkAGUAYwBdACAAIwAxADYAOgAgAE0AdQBsAHQAaQBjAGEAcwB0AD8A
x-cr-puzzleid: {07308C3E-C940-4EE4-83D5-86951B3FEEF3}
acceptlanguage: en-US
MIME-Version: 1.0
X-WSS-ID: 67CE2E7931G100486658-01-01
Content-Type: multipart/alternative; boundary="_000_CB68DF4CFBEF4942881AD37AE1A7E8C74AB3F4A017IRVEXCHCCR01c_"
Cc: "codec@ietf.org" <codec@ietf.org>
Subject: Re: [codec] #16: Multicast?
X-BeenThere: codec@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <codec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/codec>
List-Post: <mailto:codec@ietf.org>
List-Help: <mailto:codec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 23 Apr 2010 01:04:40 -0000

Hi Christian,

My comments about your question of CODEC requirements are in-line.

Raymond

From: codec-bounces@ietf.org [mailto:codec-bounces@ietf.org] On Behalf Of Christian Hoene
Sent: Wednesday, April 21, 2010 12:27 PM
To: 'stephen botzko'
Cc: codec@ietf.org
Subject: Re: [codec] #16: Multicast?

Hi,

if we take those two scenarios (high quality and scalable teleconferencing), what are then the CODEC requirements?

High quality:

-          Quite the same requirement as an end-to-end audio transmission: high quality and low latency.
[Raymond]: High quality is a given, but I would like to emphasize the importance of low latency.
(1) It is well-known that the longer the latency, the lower the perceived quality of the communication link.  The E-model in the ITU-T Recommendation G.107 models such communication quality in MOS_cqe, which among other things depends on the so-called "delay impairment factor" Id.  Basically, MOS_cqe is a monotonically decreasing function of increasing latency, and beyond about 150 ms one-way delay, the perceived quality of the communication link drops rapidly with further delay increase.
(2) The lower the latency, the less audible the echo, and thus the lower the required echo return loss.  Hence, lower latency means easier echo control and simpler echo canceller, and as people already mentioned previously, below a certain delay, an echo is simply perceived as a harmless side-tone and no echo canceller is needed. It seems to me that echo control in conference calls is more difficult than in point-to-point calls.  While I hardly ever heard echoes in domestic point-to-point calls, in my experience with conference calls at work, even with the G.711 codec (which has almost no delay), sometimes I still hear echoes (I just heard another one this afternoon).  If a relatively long-delay IETF codec is used, the echo control will be even more problematic.
(3) In normal phone calls or conference calls, people routinely have a need to interrupt each other, but beyond a certain point, long latency makes it very difficult for people to interrupt each other on the call.  This is because when you try to interrupt another person, that person doesn't hear your interruption until a certain time later, so he keeps talking, but when you hear that he did not stop talking when you interrupted, you stop; then, he hears your interruption, so he stops. When you hear he stops, you start talking again, but then he also hears you stopped (due to the long delay), so he also starts talking again.  The net result is that with a long latency, when you try to interrupt him, you and he end up stopping and starting at roughly the same time for a few cycles, making it difficult to interrupt each other.
(4) We need to keep in mind that the IETF codec may not be the only codec involved in a phone call or a conference call.  We cannot assume that all conference call participants will be using a computer to conduct the call. Not only do people use cell phones for point-to-point phone calls, they also often use cell phones to call in to conference calls.  The one-way delay for a cell phone call through one carriers network is typically around 80 to 110 ms.  A call from a cell phone in a carrier network to another cell phone in a different type of carrier network can easily double this delay to 160 ~ 220 ms and makes the total one-way delay already far exceeding the 150 ms mentioned in (1) above.  Any coding delay added by the IETF codec will be on top of that long delay, and such coding delay will be applied twice when both cell phones call through the IETF codec to a conference bridge.  Even without the IETF codec delay, when I previously called from a Verizon cell phone to an AT&T cell phone, I already experienced the problem mentioned in (3) sometimes.  If the IETF codec has a relatively long delay, adding two times the IETF codec one-way delay to the already long delay of 160 ~ 220 ms will make the situation much worse.  Even if just one cell phone is involved in a conference call, adding twice the one-way delay of a relatively long-delay IETF codec can still easily push the total one-way delay beyond 150 ms.
To summarize, my point is that to help reduce potential echo problems and to ensure a high-quality experience in such a conference call, the IETF codec should have a delay as low as possible while maintaining good enough speech quality and a reasonable bit-rate.

-          Maybe additionally: variable bit rate encoding to achieve a multiplexing gain at the receiver

-          and thus, a fast control loop to cope with variable bitrates on transmission paths.

-          Maybe stereo/multichannel support to send the spatial audio to the headphone or loudspeakers.

Scalable:

-          Efficient encoding/transcoding for multiple different qualities (at the conference bridge)
[Raymond]: I am not sure whether by "efficient", you meant coding efficiency or computational efficiency.  In any case, I would like to take this opportunity to express my view that although codec complexity isn't much of an issue for PC-to-PC calls where there are GHz of processing power available, the codec complexity is an important issue in certain application scenarios.  The following are just some examples.
1) If a conference bridge has to decode a large number of voice channels, mix, and re-encode, and if compressed-domain mixing cannot be done (which is usually the case), then it is important to keep the decoder complexity low.
2) In topology b) of your other email (IPend-to-transcoding_gateway-to-PSTNend), the transcoding gateway, or VoIP gateway, often has to encode and decode thousands of voice channels in a single box, so not only the computational complexity, but also the per-instance RAM size requirement of the codec become very important for achieving high channel density in the gateway.
3) Many telephone terminal devices at the edge of the Internet use embedded processors with limited processing power, and the processors also have to handle many tasks other than speech coding.  If the IETF codec complexity is too high, some of such devices may not have sufficient processing power to run it.  Even if the codec can fit, some battery-powered mobile devices may prefer to run a lower-complexity codec to reduce power consumption and battery drain.  For example, even if you make a Internet phone call from a computer, you may like the convenience of using a Bluetooth headset that allows you to walk around a bit and have hands-free operation.  Currently most Bluetooth headsets have small form factors with a tiny battery.  This puts a severe constraint on power consumption.  Bluetooth headset chips typically have very limited processing capability, and it has to handle many other tasks such as echo cancellation and noise reduction.  There is just not enough processing power to handle a relatively high-complexity codec.  Most BT headsets today relies on the extremely low-complexity, hardware-based CVSD codec at 64 kb/s to transmit narrowband voice, but CVSD has audible coding noise, so it degrades the overall audio quality.  If the IETF codec has low enough complexity, it would be possible to directly encode and decode the IETF codec bit-stream at the BT headset, thus avoiding the quality degradation of CVSD transcoding.
In summary, my point is that the IETF codec should attempt to achieve a codec complexity as low as possible in both MHz consumption and RAM size requirement while maintaining good enough speech quality.

-          The control loop must not react (fast) because (multicast) group communication requires to encode at low quality anyhow.

-          Receiver side activity detection for music and voice having low complexity (for the conference bridge)

-          Efficient mixing of two to four(?) active flows (is this achievable without the complete process of decoding and encoding again?)

Are any teleconferencing requirements missing?

 Christian




---------------------------------------------------------------
Dr.-Ing. Christian Hoene
Interactive Communication Systems (ICS), University of Tübingen
Sand 13, 72076 Tübingen, Germany, Phone +49 7071 2970532
http://www.net.uni-tuebingen.de/

From: stephen botzko [mailto:stephen.botzko@gmail.com]
Sent: Wednesday, April 21, 2010 8:19 PM
To: Christian Hoene
Cc: codec@ietf.org
Subject: Re: [codec] #16: Multicast?

Inline
On Wed, Apr 21, 2010 at 1:27 PM, Christian Hoene <hoene@uni-tuebingen.de<mailto:hoene@uni-tuebingen.de>> wrote:
Hi Stephen,

not too bad. You answered faster than the mailing list distributes...
Not sure how that happened!

Comments inline:


From: stephen botzko [mailto:stephen.botzko@gmail.com<mailto:stephen.botzko@gmail.com>]
Sent: Wednesday, April 21, 2010 7:10 PM
To: Christian Hoene
Cc: codec@ietf.org<mailto:codec@ietf.org>

Subject: Re: [codec] #16: Multicast?

I agree there are lots of use cases.


Though I don't see why high quality has to be given up in order to be scalable.
CH: These are just experiences from our lab. A spatial audio conference server including the acoustic 3D sound rendering needs a LOT of processing power. In the end, we have to remain realistic. Processing power is always limited thus if we need a lot then we cannot serve many clients.
Also, I am not sure why you think central mixing is more scalable than multicast (or why you think it is lower quality either).
CH: With multicast, you need N times 1:N multicast distribution trees (somewhat small tan O(n)=n²).  With central mixing you need N times 2 transmission paths (O(n)=n). Also, this distributed mixing you need N times the mixing at each client. With centralized, you can live with one mixing for all (and some tricks for serving the talkers).
I agree you need more distribution trees for multicast if you allow every site to talk. There is a corresponding benefit, since there is no central choke point and also less bandwidth on shared WAN links.

In the distributed case,  you don't need an N-way mixer at each client, and you also don't need to continuously receive payload on all N streams at each client either.  In practice you can cap N at a relatively small number (in the 3-8 range) no matter how large the conference gets.  In a large conference, you can even choose to drop your comfort noise if you are receiving two or more streams, and just send enough to keep your firewall pinhole open.  This is all assuming a suitable voice activity measure in the RTP packet.  Of course in the worst case, you will receive all N streams.

Cheers,
 Christian

Stephen Botzko
On Wed, Apr 21, 2010 at 12:58 PM, Christian Hoene <hoene@uni-tuebingen.de<mailto:hoene@uni-tuebingen.de>> wrote:
Hi,

the teleconferencing issue gets complex. I am trying to compile the different requirements that have been mentioned on this list.

-          low complexity (with just one active speaker) vs. multiple speaker mixing vs. spatial audio/stereo mixing

-          centralized vs. distributed

-          few participants vs. hundreds of listeners and talkers

-          individual distribution of audio streams vs. IP multicast or RTP group communication

-          efficient encoding of multiple streams having the same content (but different quality).

-           I bet I missed some.

To make things easier, why not to split the teleconferencing scenario in two: High quality and Scalable?

The high quality scenario, intended for a low number of users, could have features like

-          Distributed processing and mixing

-          High computational resources to support spatial audio mixing (at the receiver) and multiple encodings of the same audio stream at different qualities (at the sender)

-          Enough bandwidth to allow direct N to N transmissions of audio streams (no multicast or group communication). This would be good for the latency, too.

The scalable scenario is the opposite:

-          Central processing and mixing for many participants .

-          N to 1 and 1 to N communication using efficient distribution mechanisms (RTP group communication and IP multicast).

-          Low complexity mixing of many using tricks like VAD, encoding at lowest rate to support many receivers having different paths, you name it...

Then, we need not to compare apples with oranges all the time.

Christian

---------------------------------------------------------------
Dr.-Ing. Christian Hoene
Interactive Communication Systems (ICS), University of Tübingen
Sand 13, 72076 Tübingen, Germany, Phone +49 7071 2970532
http://www.net.uni-tuebingen.de/

From: codec-bounces@ietf.org<mailto:codec-bounces@ietf.org> [mailto:codec-bounces@ietf.org<mailto:codec-bounces@ietf.org>] On Behalf Of stephen botzko
Sent: Wednesday, April 21, 2010 4:34 PM
To: Colin Perkins
Cc: trac@tools.ietf.org<mailto:trac@tools.ietf.org>; codec@ietf.org<mailto:codec@ietf.org>
Subject: Re: [codec] #16: Multicast?

in-line

Stephen Botzko
On Wed, Apr 21, 2010 at 8:17 AM, Colin Perkins <csp@csperkins.org<mailto:csp@csperkins.org>> wrote:
On 21 Apr 2010, at 12:20, Marshall Eubanks wrote:
On Apr 21, 2010, at 6:48 AM, Colin Perkins wrote:
On 21 Apr 2010, at 10:42, codec issue tracker wrote:
#16: Multicast?
------------------------------------+----------------------------------
Reporter:  hoene@...                 |       Owner:
 Type:  enhancement             |      Status:  new
Priority:  trivial                 |   Milestone:
Component:  requirements            |     Version:
Severity:  Active WG Document      |    Keywords:
------------------------------------+----------------------------------
The question arrose whether the interactive CODEC MUST support multicast in addition to teleconferencing.

On 04/13/2010 11:35 AM, Christian Hoene wrote:
P.S. On the same note, does anybody here cares about using this CODEC with multicast? Is there a single commercial multicast voice deployment? From what I've seen all multicast does is making IETF voice standards harder to understand or implement.

I think that would be a mistake to ignore multicast - not because of multicast itself, but because of Xcast (RFC 5058) which is a promising technology to replace centralized conference bridges.

Regarding multicast:

I think we shall start at user requirements and scenarios. Teleconference (including mono or spatial audio) might be good starting point. Virtual environments like second live would require multicast communication, too. If the requirements of these scenarios are well understand, we can start to talk about potential solutions like IP multicast, Xcast or conference bridges.


RTP is inherently a group communication protocol, and any codec designed for use with RTP should consider operation in various different types of group communication scenario (not just multicast). RFC 5117 is a good place to start when considering the different types of topology in which RTP is used, and the possible placement of mixing and switching functions which the codec will need to work with.

It is not clear to me what supporting multicast would entail here. If this is a codec over RTP, then what is to stop it from being multicast ?

Nothing. However group conferences implemented using multicast require end system mixing of potentially large numbers of active audio streams, whereas those implemented using conference bridges do the mixing in a single central location, and generally suppress all but one speaker. The differences in mixing and the number of simultaneous active streams that might be received potentially affect the design of the codec.

Conference bridges with central mixing almost always mix multiple speakers.  As you add more streams into the mix, you reduce the chance of missing onset speech and interruptions, but raise the noise floor. So even if complexity is not a consideration, there is value in gating the mixer (instead of always doing a full mix-minus).

More on point, compressed domain mixing and easy detection of VAD have both been advocated on these lists, and both simplify the large-scale mixing problem.

--
Colin Perkins
http://csperkins.org/



_______________________________________________
codec mailing list
codec@ietf.org<mailto:codec@ietf.org>
https://www.ietf.org/mailman/listinfo/codec