Re: [codec] #14: VAD and CNG?

"Christian Hoene" <hoene@uni-tuebingen.de> Mon, 24 May 2010 16:55 UTC

Return-Path: <hoene@uni-tuebingen.de>
X-Original-To: codec@core3.amsl.com
Delivered-To: codec@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 502BE3A6C51 for <codec@core3.amsl.com>; Mon, 24 May 2010 09:55:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.184
X-Spam-Level:
X-Spam-Status: No, score=-2.184 tagged_above=-999 required=5 tests=[AWL=0.886, BAYES_00=-2.599, HELO_EQ_DE=0.35, J_CHICKENPOX_72=0.6, RCVD_IN_BL_SPAMCOP_NET=1.96, RCVD_IN_DNSWL_MED=-4, RCVD_IN_SORBS_WEB=0.619]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id V-w+SUk9qwMx for <codec@core3.amsl.com>; Mon, 24 May 2010 09:55:54 -0700 (PDT)
Received: from mx06.uni-tuebingen.de (mx06.uni-tuebingen.de [134.2.3.3]) by core3.amsl.com (Postfix) with ESMTP id 77EDD3A6C44 for <codec@ietf.org>; Mon, 24 May 2010 09:55:53 -0700 (PDT)
Received: from hoeneT60 ([82.113.121.147]) (authenticated bits=0) by mx06.uni-tuebingen.de (8.13.6/8.13.6) with ESMTP id o4OGtWm4009646 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO); Mon, 24 May 2010 18:55:40 +0200
From: Christian Hoene <hoene@uni-tuebingen.de>
To: 'Michael Knappe' <mknappe@juniper.net>, 'Brian Rosen' <br@brianrosen.net>, codec@ietf.org
References: <C82009F2.35809%br@brianrosen.net> <C81FF1F7.16BE0%mknappe@juniper.net>
In-Reply-To: <C81FF1F7.16BE0%mknappe@juniper.net>
Date: Mon, 24 May 2010 18:55:28 +0200
Message-ID: <000f01cafb61$efe237e0$cfa6a7a0$@de>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-2"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Office Outlook 12.0
Thread-Index: Acr7UCWVvG5Svlx6n0WnvziJ/5NDvAACtovzAAFh2jA=
Content-Language: de
X-AntiVirus: NOT checked by Avira MailGate (version: 3.0.0-4; host: mx06)
Subject: Re: [codec] #14: VAD and CNG?
X-BeenThere: codec@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <codec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/codec>
List-Post: <mailto:codec@ietf.org>
List-Help: <mailto:codec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 24 May 2010 16:55:55 -0000

Hello,

I do not like the idea that a user should decide on parameters that he does not understand.
Can't we develop some intelligent, automatic decision on when to use DTX? For example, switch VAD on if
a) if bandwidth is limited (dynamically controlled)
b) if transmission energy needs to be saved on mobile devices (controlled/negotiated for both sender and receiver)
c) if a lot of background noise is present (sender initiated, no negotiation required)
d) if the background noise is important (sender initiated, e.g. emergency call, but then again, the sender can amplify the
background noise instead)
Shall the controlling/negotiation take place in-band or using SDP?

With best regards,

 Christian

---------------------------------------------------------------
Dr.-Ing. Christian Hoene
Interactive Communication Systems (ICS), University of Tübingen 
Sand 13, 72076 Tübingen, Germany, Phone +49 7071 2970532 
http://www.net.uni-tuebingen.de/


>-----Original Message-----
>From: Michael Knappe [mailto:mknappe@juniper.net]
>Sent: Monday, May 24, 2010 6:06 PM
>To: Brian Rosen; codec@ietf.org; hoene@uni-tuebingen.de
>Subject: Re: [codec] #14: VAD and CNG?
>
>I agree that VAD (or DTX) needs to be negotiable.
>
>Mike
>
>
>On 5/24/10 7:48 AM, "Brian Rosen" <br@brianrosen.net> wrote:
>
>> I hope we are assuming VAD (or DTX) is negotiable.  It is essential (for
>> emergency calls) that VAD is disabled.  While in many cases the endpoint
>> will know that it is in an emergency call, and disable VAD, it is not always
>> possible to know.
>>
>> Brian
>>
>>
>> On 5/24/10 10:22 AM, "codec issue tracker" <trac@tools.ietf.org> wrote:
>>
>>> #14: VAD and
>>> CNG?
>> ------------------------------------+------------------------------------
>>> ---
>>  Reporter:  hoene@Š                 |       Owner:
>>      Type:  defect
>>> |      Status:  new
>>  Priority:  major                   |   Milestone:
>>>
>> Component:  requirements            |     Version:
>>  Severity:  -
>>> |    Keywords:
>>>
>> ------------------------------------+---------------------------------------
>>
>>>
>> Comment(by hoene@Š):
>>
>>  [Raymond]: I think comfort noise is more than just to
>>> let the telephone
>>  user know that the connection is not dead.  If voice
>>> packets are sent only
>>  during active voice regions of the signal (in DTX) and
>>> not during
>>  silence/background noise regions, and if there is audible
>>> background noise
>>  and comfort noise is not added during background noise
>>> regions, at the
>>  receiving end the background noise will be "on" only during
>>> active voice
>>  and "off" otherwise.  This frequent on-off switching will make
>>> the
>>  background noise sound unnatural and will bother many users.  Adding
>>
>>> comfort noise during background noise regions will make such background
>>  noise
>>> sound more natural.
>>
>>  [Benjamin]:  The cheapest solution, of course, is
>>> transmit-side activity
>>  detection.
>>  Maybe we need to specify a way for a
>>> receiver to request that the
>>  transmitter employ (or not employ) VAD.
>>
>>
>>
>>> [Benjamin]:
>>  I know that CELT makes decoder VAD very efficient, but how is
>>> decoder VAD
>>  better than encoder VAD?  Encoder VAD saves even more CPU, saves
>>
>>> bandwidth, and enables easier jitter buffering.
>>  Are you thinking about some
>>> sort of adaptive thresholding that requires
>>  knowing all streams' volume
>>> levels?
>>  Anyway, VAD can run on both encode and decode sides at the same
>>> time.
>>
>>
>>  [Stephen]: It might be valuable to be able to obtain the VAD without
>>
>>> having to decrypt the bitstream, since that also can become a problem as
>>  the
>>> number of streams grows.
>>  There is a security consideration (traffic
>>> analysis), however, it still
>>  might be worth thinking about.
>>
>>  [JM]:
>>> I know
>>> that CELT makes decoder VAD very efficient,
>>  Not only CELT. You can do that
>>> with an LPC-based codec too.
>>
>>> but how is decoder VAD
>>> better than encoder
>>> VAD?  Encoder VAD saves even more CPU, saves
>>> bandwidth, and enables easier
>>> jitter buffering.
>>
>>  There's a few reasons why I think decoder-side is better:
>>
>>> - The decision for an encoder-size VAD would take some amount of space in the
>>> bit-stream
>>  - If we make an encode-size VAD mandatory, then all encoders will
>>> have to
>>  spend the CPU cycles, even when it's not needed. If it's not
>>> mandatory,
>>  then the decoder cannot rely on it, so it still needs to implement
>>> a VAD
>>  - A decoder VAD does not need to be specified in an exact way, so
>>
>>> implementers can choose different implementations depending on that
>>
>>> information they need.
>>  - You cannot "game" a decode-size VAD.
>>
>>> Are you
>>> thinking about some sort of adaptive thresholding that
>>> requires knowing all
>>> streams' volume levels?
>>
>>  Well, knowing the relative amplitudes of each stream
>>> can allow you to take
>>  more intelligent decisions, e.g. when you have to
>>> choose the "most active
>>  speaker". That's something you can't really get from
>>> an encoder VAD.
>>
>>> Anyway, VAD can run on both encode and decode sides at the
>>> same time.
>>
>>  That would just mean nobody would bother implementing the encode
>>> side.
>>
>>
>>  [Benjamin]:
>>
>>  I think I failed to communicate that by VAD I mean _not
>>> sending packets_
>>  during inactivity.  For the packets that are sent, the
>>> overhead should
>>  average much less than 1 bit per frame.
>>
>>  I'm not suggesting
>>> sending 200 packets a second containing a flag
>>  indicating no voice activity,
>>> followed by carefully coded background
>>  noise.  That would be silly.
>>
>>> - If
>>> we make an encode-size VAD mandatory, then all encoders will have
>>> to spend
>>> the CPU cycles, even when it's not needed. If it's not
>>> mandatory, then the
>>> decoder cannot rely on it, so it still needs to
>>> implement a VAD
>>
>>  I don't
>>> see this as "mandatory".  The encoder can turn off VAD, and
>>  probably should
>>> for full-quality applications.
>>
>>> - A decoder VAD does not need to be
>>> specified in an exact way, so
>>> implementers can choose different
>>> implementations depending on that
>>> information they need.
>>
>>  The only thing
>>> that needs exact specification is the signalling.  The
>>  encoder may use it or
>>> not use it as it pleases.
>>
>>> - You cannot "game" a decode-size VAD.
>>
>>  I don't
>>> know what this means.
>>
>>>> Are you thinking about some sort of adaptive
>>> thresholding that
>>>> requires knowing all streams' volume levels?
>>>
>>> Well,
>>> knowing the relative amplitudes of each stream can allow you to
>>> take more
>>> intelligent decisions, e.g. when you have to choose the
>>> "most active
>>> speaker". That's something you can't really get from an
>>  encoder VAD.
>>>
>>>>
>>> Anyway, VAD can run on both encode and decode sides at the same time.
>>>
>>>
>>> That would just mean nobody would bother implementing the encode side.
>>
>>  I
>>> expect encode-side VAD on a conference call to save more than a factor
>>  of 2
>>> in bandwidth, which makes it very desirable, especially for large
>>
>>> deployments.  People will use it to save bandwidth (especially if it's on by
>>> default in the reference implementation).  The decode-side CPU savings
>>  are
>>> just a minor bonus side-effect.
>>
>>  [JM]:
>>  What you're describing is called DTX
>>> (discontinuous transmission). This
>>  can be useful feature, but it's very
>>> different from what we were
>>  originally talking about in terms of conference
>>> mixing.
>>
>>  [Ben]: Oops. Right.  What I'm trying to say is that DTX, based on
>>> encoder-
>>  side VAD, also greatly reduces the (average) computational burden on
>>> a
>>  conference mixer.  Of course, if everyone's really talking at once then
>>
>>> VAD can't help.
>>
>>  [Roman]: There is one more application to efficiently
>>> combining pre-
>>  encoded
>>  audio: playing announcements or recorded audio.
>>> Standard network or IVR
>>  announcements can be encoded once and efficiently
>>> inserted or combined
>>  into audio stream. If pre-encoded audio is supported and
>>> the client
>>  supports AVT tones, it is trivial to develop a very efficient IVR
>>> server
>>  which does not require any CODEC encoding or decoding.
>>
>>  Efficient
>>> decoder side VAD is also very helpful in case of speech
>>  recognition, where it
>>> allows to save cycles in end-pointer. This way audio
>>  only needs to be decoded
>>> and passed to the speech recognition system only
>>  when voice is present.
>>
>>
>>> Bottom line, if we have both efficient decoder side VAD and combining pre-
>>
>>> encoded audio we can develop some very efficient VXML servers, voice mail and
>>> IVR system, not just conferencing servers.
>>
>>  Of course, this works well only
>>> if the background noise is relatively
>>  stationary.  If the background noise is
>>> dynamically changing, then comfort
>>  noise can't really sound like the true
>>> background noise.  Today I was put
>>  on hold in a phone call with music
>>> playing.  Apparently the system treated
>>  some parts of the music as background
>>> noise and replaced them with comfort
>>  noise. The result is pretty annoying, or
>>> amusing, depending on which way
>>  you look at it.
>>
>>  Therefore, I think comfort
>>> noise has its value if DTX is used and the
>>  background noise is fairly
>>> stationary, but if the background noise or
>>  music is changing dynamically and
>>> high audio quality is desired, then the
>>  DTX and comfort noise should be
>>> turned off and all parts of the signal
>>  need to be transmitted.