Re: [codec] #14: VAD and CNG?

Michael Knappe <mknappe@juniper.net> Mon, 24 May 2010 16:09 UTC

Return-Path: <mknappe@juniper.net>
X-Original-To: codec@core3.amsl.com
Delivered-To: codec@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 57D8E3A6C50 for <codec@core3.amsl.com>; Mon, 24 May 2010 09:09:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.011
X-Spam-Level:
X-Spam-Status: No, score=-4.011 tagged_above=-999 required=5 tests=[AWL=-0.612, BAYES_50=0.001, J_CHICKENPOX_72=0.6, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Z4WhtPuhVACW for <codec@core3.amsl.com>; Mon, 24 May 2010 09:09:52 -0700 (PDT)
Received: from exprod7og115.obsmtp.com (exprod7og115.obsmtp.com [64.18.2.217]) by core3.amsl.com (Postfix) with ESMTP id 21D053A6B33 for <codec@ietf.org>; Mon, 24 May 2010 09:09:31 -0700 (PDT)
Received: from source ([66.129.224.36]) (using TLSv1) by exprod7ob115.postini.com ([64.18.6.12]) with SMTP ID DSNKS/qkiFvjXnAjrsQ5Ba+ghG4j6D9o236M@postini.com; Mon, 24 May 2010 09:09:45 PDT
Received: from EMBX02-HQ.jnpr.net ([fe80::18fe:d666:b43e:f97e]) by P-EMHUB01-HQ.jnpr.net ([fe80::fc92:eb1:759:2c72%11]) with mapi; Mon, 24 May 2010 09:06:01 -0700
From: Michael Knappe <mknappe@juniper.net>
To: Brian Rosen <br@brianrosen.net>, "codec@ietf.org" <codec@ietf.org>, "hoene@uni-tuebingen.de" <hoene@uni-tuebingen.de>
Date: Mon, 24 May 2010 09:05:59 -0700
Thread-Topic: [codec] #14: VAD and CNG?
Thread-Index: Acr7UCWVvG5Svlx6n0WnvziJ/5NDvAACtovz
Message-ID: <C81FF1F7.16BE0%mknappe@juniper.net>
In-Reply-To: <C82009F2.35809%br@brianrosen.net>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
user-agent: Microsoft-Entourage/13.3.0.091002
acceptlanguage: en-US
Content-Type: text/plain; charset="iso-8859-2"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [codec] #14: VAD and CNG?
X-BeenThere: codec@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <codec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/codec>
List-Post: <mailto:codec@ietf.org>
List-Help: <mailto:codec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 24 May 2010 16:09:54 -0000

I agree that VAD (or DTX) needs to be negotiable.

Mike


On 5/24/10 7:48 AM, "Brian Rosen" <br@brianrosen.net> wrote:

> I hope we are assuming VAD (or DTX) is negotiable.  It is essential (for
> emergency calls) that VAD is disabled.  While in many cases the endpoint
> will know that it is in an emergency call, and disable VAD, it is not always
> possible to know.
> 
> Brian
> 
> 
> On 5/24/10 10:22 AM, "codec issue tracker" <trac@tools.ietf.org> wrote:
> 
>> #14: VAD and 
>> CNG?
> ------------------------------------+------------------------------------
>> ---
>  Reporter:  hoene@Š                 |       Owner:
>      Type:  defect
>> |      Status:  new
>  Priority:  major                   |   Milestone:
>> 
> Component:  requirements            |     Version:
>  Severity:  -
>> |    Keywords:  
>> 
> ------------------------------------+---------------------------------------
> 
>> 
> Comment(by hoene@Š):
> 
>  [Raymond]: I think comfort noise is more than just to
>> let the telephone
>  user know that the connection is not dead.  If voice
>> packets are sent only
>  during active voice regions of the signal (in DTX) and
>> not during
>  silence/background noise regions, and if there is audible
>> background noise
>  and comfort noise is not added during background noise
>> regions, at the
>  receiving end the background noise will be "on" only during
>> active voice
>  and "off" otherwise.  This frequent on-off switching will make
>> the
>  background noise sound unnatural and will bother many users.  Adding
> 
>> comfort noise during background noise regions will make such background
>  noise
>> sound more natural.
> 
>  [Benjamin]:  The cheapest solution, of course, is
>> transmit-side activity
>  detection.
>  Maybe we need to specify a way for a
>> receiver to request that the
>  transmitter employ (or not employ) VAD.
> 
> 
> 
>> [Benjamin]:
>  I know that CELT makes decoder VAD very efficient, but how is
>> decoder VAD
>  better than encoder VAD?  Encoder VAD saves even more CPU, saves
> 
>> bandwidth, and enables easier jitter buffering.
>  Are you thinking about some
>> sort of adaptive thresholding that requires
>  knowing all streams' volume
>> levels?
>  Anyway, VAD can run on both encode and decode sides at the same
>> time.
> 
> 
>  [Stephen]: It might be valuable to be able to obtain the VAD without
> 
>> having to decrypt the bitstream, since that also can become a problem as
>  the
>> number of streams grows.
>  There is a security consideration (traffic
>> analysis), however, it still
>  might be worth thinking about.
> 
>  [JM]:
>> I know
>> that CELT makes decoder VAD very efficient,
>  Not only CELT. You can do that
>> with an LPC-based codec too.
> 
>> but how is decoder VAD
>> better than encoder
>> VAD?  Encoder VAD saves even more CPU, saves
>> bandwidth, and enables easier
>> jitter buffering.
> 
>  There's a few reasons why I think decoder-side is better:
> 
>> - The decision for an encoder-size VAD would take some amount of space in the
>> bit-stream
>  - If we make an encode-size VAD mandatory, then all encoders will
>> have to
>  spend the CPU cycles, even when it's not needed. If it's not
>> mandatory,
>  then the decoder cannot rely on it, so it still needs to implement
>> a VAD
>  - A decoder VAD does not need to be specified in an exact way, so
> 
>> implementers can choose different implementations depending on that
> 
>> information they need.
>  - You cannot "game" a decode-size VAD.
> 
>> Are you
>> thinking about some sort of adaptive thresholding that
>> requires knowing all
>> streams' volume levels?
> 
>  Well, knowing the relative amplitudes of each stream
>> can allow you to take
>  more intelligent decisions, e.g. when you have to
>> choose the "most active
>  speaker". That's something you can't really get from
>> an encoder VAD.
> 
>> Anyway, VAD can run on both encode and decode sides at the
>> same time.
> 
>  That would just mean nobody would bother implementing the encode
>> side.
> 
> 
>  [Benjamin]:
> 
>  I think I failed to communicate that by VAD I mean _not
>> sending packets_
>  during inactivity.  For the packets that are sent, the
>> overhead should
>  average much less than 1 bit per frame.
> 
>  I'm not suggesting
>> sending 200 packets a second containing a flag
>  indicating no voice activity,
>> followed by carefully coded background
>  noise.  That would be silly.
> 
>> - If
>> we make an encode-size VAD mandatory, then all encoders will have
>> to spend
>> the CPU cycles, even when it's not needed. If it's not
>> mandatory, then the
>> decoder cannot rely on it, so it still needs to
>> implement a VAD
> 
>  I don't
>> see this as "mandatory".  The encoder can turn off VAD, and
>  probably should
>> for full-quality applications.
> 
>> - A decoder VAD does not need to be
>> specified in an exact way, so
>> implementers can choose different
>> implementations depending on that
>> information they need.
> 
>  The only thing
>> that needs exact specification is the signalling.  The
>  encoder may use it or
>> not use it as it pleases.
> 
>> - You cannot "game" a decode-size VAD.
> 
>  I don't
>> know what this means.
> 
>>> Are you thinking about some sort of adaptive
>> thresholding that
>>> requires knowing all streams' volume levels?
>> 
>> Well,
>> knowing the relative amplitudes of each stream can allow you to
>> take more
>> intelligent decisions, e.g. when you have to choose the
>> "most active
>> speaker". That's something you can't really get from an
>  encoder VAD.
>> 
>>> 
>> Anyway, VAD can run on both encode and decode sides at the same time.
>> 
>> 
>> That would just mean nobody would bother implementing the encode side.
> 
>  I
>> expect encode-side VAD on a conference call to save more than a factor
>  of 2
>> in bandwidth, which makes it very desirable, especially for large
> 
>> deployments.  People will use it to save bandwidth (especially if it's on by
>> default in the reference implementation).  The decode-side CPU savings
>  are
>> just a minor bonus side-effect.
> 
>  [JM]:
>  What you're describing is called DTX
>> (discontinuous transmission). This
>  can be useful feature, but it's very
>> different from what we were
>  originally talking about in terms of conference
>> mixing.
> 
>  [Ben]: Oops. Right.  What I'm trying to say is that DTX, based on
>> encoder-
>  side VAD, also greatly reduces the (average) computational burden on
>> a
>  conference mixer.  Of course, if everyone's really talking at once then
> 
>> VAD can't help.
> 
>  [Roman]: There is one more application to efficiently
>> combining pre-
>  encoded
>  audio: playing announcements or recorded audio.
>> Standard network or IVR
>  announcements can be encoded once and efficiently
>> inserted or combined
>  into audio stream. If pre-encoded audio is supported and
>> the client
>  supports AVT tones, it is trivial to develop a very efficient IVR
>> server
>  which does not require any CODEC encoding or decoding.
> 
>  Efficient
>> decoder side VAD is also very helpful in case of speech
>  recognition, where it
>> allows to save cycles in end-pointer. This way audio
>  only needs to be decoded
>> and passed to the speech recognition system only
>  when voice is present.
> 
> 
>> Bottom line, if we have both efficient decoder side VAD and combining pre-
> 
>> encoded audio we can develop some very efficient VXML servers, voice mail and
>> IVR system, not just conferencing servers.
> 
>  Of course, this works well only
>> if the background noise is relatively
>  stationary.  If the background noise is
>> dynamically changing, then comfort
>  noise can't really sound like the true
>> background noise.  Today I was put
>  on hold in a phone call with music
>> playing.  Apparently the system treated
>  some parts of the music as background
>> noise and replaced them with comfort
>  noise. The result is pretty annoying, or
>> amusing, depending on which way
>  you look at it.
> 
>  Therefore, I think comfort
>> noise has its value if DTX is used and the
>  background noise is fairly
>> stationary, but if the background noise or
>  music is changing dynamically and
>> high audio quality is desired, then the
>  DTX and comfort noise should be
>> turned off and all parts of the signal
>  need to be transmitted.