Re: [codec] #3: 2.2. Conferencing: Support of binaural audio?

"codec issue tracker" <trac@tools.ietf.org> Sat, 01 May 2010 10:44 UTC

Return-Path: <trac@tools.ietf.org>
X-Original-To: codec@core3.amsl.com
Delivered-To: codec@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id C79E73A6AD2 for <codec@core3.amsl.com>; Sat, 1 May 2010 03:44:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -101.206
X-Spam-Level:
X-Spam-Status: No, score=-101.206 tagged_above=-999 required=5 tests=[AWL=-1.206, BAYES_50=0.001, NO_RELAYS=-0.001, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8K3JXLZG5-di for <codec@core3.amsl.com>; Sat, 1 May 2010 03:44:33 -0700 (PDT)
Received: from zinfandel.tools.ietf.org (unknown [IPv6:2001:1890:1112:1::2a]) by core3.amsl.com (Postfix) with ESMTP id D78E73A67F6 for <codec@ietf.org>; Sat, 1 May 2010 03:44:33 -0700 (PDT)
Received: from localhost ([::1] helo=zinfandel.tools.ietf.org) by zinfandel.tools.ietf.org with esmtp (Exim 4.69) (envelope-from <trac@tools.ietf.org>) id 1O8ABH-00021R-69; Sat, 01 May 2010 03:44:19 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: codec issue tracker <trac@tools.ietf.org>
X-Trac-Version: 0.11.6
Precedence: bulk
Auto-Submitted: auto-generated
X-Mailer: Trac 0.11.6, by Edgewall Software
To: hoene@uni-tuebingen.de
X-Trac-Project: codec
Date: Sat, 01 May 2010 10:44:19 -0000
X-URL: http://tools.ietf.org/codec/
X-Trac-Ticket-URL: http://trac.tools.ietf.org/wg/codec/trac/ticket/3#comment:1
Message-ID: <071.5c139aff3b600414066c330b20c0e191@tools.ietf.org>
References: <062.a837f2ff7647f7cb184f0c86b7e65747@tools.ietf.org>
X-Trac-Ticket-ID: 3
In-Reply-To: <062.a837f2ff7647f7cb184f0c86b7e65747@tools.ietf.org>
X-SA-Exim-Connect-IP: ::1
X-SA-Exim-Rcpt-To: hoene@uni-tuebingen.de, codec@ietf.org
X-SA-Exim-Mail-From: trac@tools.ietf.org
X-SA-Exim-Scanned: No (on zinfandel.tools.ietf.org); SAEximRunCond expanded to false
Cc: codec@ietf.org
Subject: Re: [codec] #3: 2.2. Conferencing: Support of binaural audio?
X-BeenThere: codec@ietf.org
X-Mailman-Version: 2.1.9
Reply-To: codec@ietf.org
List-Id: Codec WG <codec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/codec>
List-Post: <mailto:codec@ietf.org>
List-Help: <mailto:codec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 01 May 2010 10:44:34 -0000

#3: 2.2.  Conferencing: Support of binaural audio?
------------------------------------+---------------------------------------
 Reporter:  hoene@…                 |       Owner:     
     Type:  enhancement             |      Status:  new
 Priority:  major                   |   Milestone:     
Component:  requirements            |     Version:     
 Severity:  -                       |    Keywords:     
------------------------------------+---------------------------------------

Comment(by hoene@…):

 [Hoene]:
 I am trying to compile the different requirements that have been mentioned
 on this list.
 -       low complexity (with just one active speaker) vs. multiple speaker
 mixing vs. spatial audio/stereo mixing
 -       centralized vs. distributed
 -       few participants vs. hundreds of listeners and talkers
 -       individual distribution of audio streams vs. IP multicast or RTP
 group communication
 -       efficient encoding of multiple streams having the same content
 (but different quality).

 To make things easier, why not to split the teleconferencing scenario in
 two: High quality and Scalable?

 The high quality scenario, intended for a low number of users, could have
 features like
 -       Distributed processing and mixing
 -       High computational resources to support spatial audio mixing (at
 the receiver) and multiple encodings of the same audio stream at different
 qualities (at the sender)
 -       Enough bandwidth to allow direct N to N transmissions of audio
 streams (no multicast or group communication). This would be good for the
 latency, too.

 The scalable scenario is the opposite:
 -       Central processing and mixing for many participants .
 -       N to 1 and 1 to N communication using efficient distribution
 mechanisms (RTP group communication and IP multicast).
 -       Low complexity mixing of many using tricks like VAD, encoding at
 lowest rate to support many receivers having different paths, you name
 it...

 High quality:
 -       Quite the same requirement as an end-to-end audio transmission:
 high quality and low latency.
 -       Maybe additionally: variable bit rate encoding to achieve a
 multiplexing gain at the receiver
 -       and thus, a fast control loop to cope with variable bitrates on
 transmission paths.
 -       Maybe stereo/multichannel support to send the spatial audio to the
 headphone or loudspeakers.

 Scalable:
 -       Efficient encoding/transcoding for multiple different qualities
 (at the conference bridge)
 -       The control loop must not react (fast) because (multicast) group
 communication requires to encode at low quality anyhow.
 -       Receiver side activity detection for music and voice having low
 complexity (for the conference bridge)
 -       Efficient mixing of two to four(?) active flows (is this
 achievable without the complete process of decoding and encoding again?)

 [Raymond]: High quality is a given, but I would like to emphasize the
 importance of low latency.
 (1) It is well-known that the longer the latency, the lower the perceived
 quality of the communication link.  [...]
 (2) The lower the latency, the less audible the echo, and thus the lower
 the required echo return loss.  Hence, lower latency means easier echo
 control and simpler echo canceller, and as people already mentioned
 previously, below a certain delay, an echo is simply perceived as a
 harmless side-tone and no echo canceller is needed. It seems to me that
 echo control in conference calls is more difficult than in point-to-point
 calls.  While I hardly ever heard echoes in domestic point-to-point calls,
 in my experience with conference calls at work, even with the G.711 codec
 (which has almost no delay), sometimes I still hear echoes (I just heard
 another one this afternoon).  If a relatively long-delay IETF codec is
 used, the echo control will be even more problematic.
 (3) In normal phone calls or conference calls, people routinely have a
 need to interrupt each other, but beyond a certain point, long latency
 makes it very difficult for people to interrupt each other on the call.
 This is because when you try to interrupt another person, that person
 doesn’t hear your interruption until a certain time later, so he keeps
 talking, but when you hear that he did not stop talking when you
 interrupted, you stop; then, he hears your interruption, so he stops. When
 you hear he stops, you start talking again, but then he also hears you
 stopped (due to the long delay), so he also starts talking again.  The net
 result is that with a long latency, when you try to interrupt him, you and
 he end up stopping and starting at roughly the same time for a few cycles,
 making it difficult to interrupt each other.


 [Jean-Marc:]
 The decoder complexity is very important. Not only because of mixing
 issue, but also because the decoder is generally not allowed to take
 shortcuts to save on complexity (unlike the encoder). As for compressed-
 domain mixing, as you say it is not always available, but *if* we can do
 it (even if only partially), then that can result in a "free" reduction in
 decoder complexity for mixing.

-- 
Ticket URL: <http://trac.tools.ietf.org/wg/codec/trac/ticket/3#comment:1>
codec <http://tools.ietf.org/codec/>