Re: [codec] #14: VAD and CNG?

I hope we are assuming VAD (or DTX) is negotiable.  It is essential (for
emergency calls) that VAD is disabled.  While in many cases the endpoint
will know that it is in an emergency call, and disable VAD, it is not always
possible to know.

Brian

On 5/24/10 10:22 AM, "codec issue tracker" <trac@tools.ietf.org> wrote:

> #14: VAD and 
> CNG?
------------------------------------+------------------------------------
> ---
 Reporter:  hoene@                 |       Owner:     
     Type:  defect
> |      Status:  new
 Priority:  major                   |   Milestone:
> 
Component:  requirements            |     Version:     
 Severity:  -
> |    Keywords:   
> 
------------------------------------+---------------------------------------

> 
Comment(by hoene@):

 [Raymond]: I think comfort noise is more than just to
> let the telephone
 user know that the connection is not dead.  If voice
> packets are sent only
 during active voice regions of the signal (in DTX) and
> not during
 silence/background noise regions, and if there is audible
> background noise
 and comfort noise is not added during background noise
> regions, at the
 receiving end the background noise will be "on" only during
> active voice
 and "off" otherwise.  This frequent on-off switching will make
> the
 background noise sound unnatural and will bother many users.  Adding

> comfort noise during background noise regions will make such background
 noise
> sound more natural.

 [Benjamin]:  The cheapest solution, of course, is
> transmit-side activity
 detection.
 Maybe we need to specify a way for a
> receiver to request that the
 transmitter employ (or not employ) VAD.

> [Benjamin]:
 I know that CELT makes decoder VAD very efficient, but how is
> decoder VAD
 better than encoder VAD?  Encoder VAD saves even more CPU, saves

> bandwidth, and enables easier jitter buffering.
 Are you thinking about some
> sort of adaptive thresholding that requires
 knowing all streams' volume
> levels?
 Anyway, VAD can run on both encode and decode sides at the same
> time.

 [Stephen]: It might be valuable to be able to obtain the VAD without

> having to decrypt the bitstream, since that also can become a problem as
 the
> number of streams grows.
 There is a security consideration (traffic
> analysis), however, it still
 might be worth thinking about.

 [JM]:
 > I know
> that CELT makes decoder VAD very efficient,
 Not only CELT. You can do that
> with an LPC-based codec too.

 > but how is decoder VAD
 > better than encoder
> VAD?  Encoder VAD saves even more CPU, saves
 > bandwidth, and enables easier
> jitter buffering.

 There's a few reasons why I think decoder-side is better:

> - The decision for an encoder-size VAD would take some amount of space in the
> bit-stream
 - If we make an encode-size VAD mandatory, then all encoders will
> have to
 spend the CPU cycles, even when it's not needed. If it's not
> mandatory,
 then the decoder cannot rely on it, so it still needs to implement
> a VAD
 - A decoder VAD does not need to be specified in an exact way, so

> implementers can choose different implementations depending on that

> information they need.
 - You cannot "game" a decode-size VAD.

 > Are you
> thinking about some sort of adaptive thresholding that
 > requires knowing all
> streams' volume levels?

 Well, knowing the relative amplitudes of each stream
> can allow you to take
 more intelligent decisions, e.g. when you have to
> choose the "most active
 speaker". That's something you can't really get from
> an encoder VAD.

 > Anyway, VAD can run on both encode and decode sides at the
> same time.

 That would just mean nobody would bother implementing the encode
> side.

 [Benjamin]:

 I think I failed to communicate that by VAD I mean _not
> sending packets_
 during inactivity.  For the packets that are sent, the
> overhead should
 average much less than 1 bit per frame.

 I'm not suggesting
> sending 200 packets a second containing a flag
 indicating no voice activity,
> followed by carefully coded background
 noise.  That would be silly.

 > - If
> we make an encode-size VAD mandatory, then all encoders will have
 > to spend
> the CPU cycles, even when it's not needed. If it's not
 > mandatory, then the
> decoder cannot rely on it, so it still needs to
 > implement a VAD

 I don't
> see this as "mandatory".  The encoder can turn off VAD, and
 probably should
> for full-quality applications.

 > - A decoder VAD does not need to be
> specified in an exact way, so
 > implementers can choose different
> implementations depending on that
 > information they need.

 The only thing
> that needs exact specification is the signalling.  The
 encoder may use it or
> not use it as it pleases.

 > - You cannot "game" a decode-size VAD.

 I don't
> know what this means.

 >> Are you thinking about some sort of adaptive
> thresholding that
 >> requires knowing all streams' volume levels?
 >
 > Well,
> knowing the relative amplitudes of each stream can allow you to
 > take more
> intelligent decisions, e.g. when you have to choose the
 > "most active
> speaker". That's something you can't really get from an
 encoder VAD.
 >
 >>
> Anyway, VAD can run on both encode and decode sides at the same time.
 >
 >
> That would just mean nobody would bother implementing the encode side.

 I
> expect encode-side VAD on a conference call to save more than a factor
 of 2
> in bandwidth, which makes it very desirable, especially for large

> deployments.  People will use it to save bandwidth (especially if it's on by
> default in the reference implementation).  The decode-side CPU savings
 are
> just a minor bonus side-effect.

 [JM]:
 What you're describing is called DTX
> (discontinuous transmission). This
 can be useful feature, but it's very
> different from what we were
 originally talking about in terms of conference
> mixing.

 [Ben]: Oops. Right.  What I'm trying to say is that DTX, based on
> encoder-
 side VAD, also greatly reduces the (average) computational burden on
> a
 conference mixer.  Of course, if everyone's really talking at once then

> VAD can't help.

 [Roman]: There is one more application to efficiently
> combining pre-
 encoded
 audio: playing announcements or recorded audio.
> Standard network or IVR
 announcements can be encoded once and efficiently
> inserted or combined
 into audio stream. If pre-encoded audio is supported and
> the client
 supports AVT tones, it is trivial to develop a very efficient IVR
> server
 which does not require any CODEC encoding or decoding.

 Efficient
> decoder side VAD is also very helpful in case of speech
 recognition, where it
> allows to save cycles in end-pointer. This way audio
 only needs to be decoded
> and passed to the speech recognition system only
 when voice is present.

> Bottom line, if we have both efficient decoder side VAD and combining pre-

> encoded audio we can develop some very efficient VXML servers, voice mail and
> IVR system, not just conferencing servers.

 Of course, this works well only
> if the background noise is relatively
 stationary.  If the background noise is
> dynamically changing, then comfort
 noise can't really sound like the true
> background noise.  Today I was put
 on hold in a phone call with music
> playing.  Apparently the system treated
 some parts of the music as background
> noise and replaced them with comfort
 noise. The result is pretty annoying, or
> amusing, depending on which way
 you look at it.

 Therefore, I think comfort
> noise has its value if DTX is used and the
 background noise is fairly
> stationary, but if the background noise or
 music is changing dynamically and
> high audio quality is desired, then the
 DTX and comfort noise should be
> turned off and all parts of the signal
 need to be transmitted.

-- 
Ticket
> URL: <http://trac.tools.ietf.org/wg/codec/trac/ticket/14#comment:2>
codec
> <http://tools.ietf.org/codec/>

______________________________________________
> _
codec mailing 
> list
codec@ietf.org
https://www.ietf.org/mailman/listinfo/codec