Re: [codec] #15: Efficiently combine pre-encoded audio

Roman Shpount <> Mon, 24 May 2010 17:28 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id CBB593A6C64 for <>; Mon, 24 May 2010 10:28:25 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: 0.069
X-Spam-Status: No, score=0.069 tagged_above=-999 required=5 tests=[AWL=-1.154, BAYES_50=0.001, FM_FORGED_GMAIL=0.622, J_CHICKENPOX_72=0.6]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id qnR0QnyOtFSu for <>; Mon, 24 May 2010 10:28:25 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id C759B3A6B47 for <>; Mon, 24 May 2010 10:28:24 -0700 (PDT)
Received: by vws14 with SMTP id 14so503029vws.31 for <>; Mon, 24 May 2010 10:28:11 -0700 (PDT)
Received: by with SMTP id w11mr3941989vch.273.1274722090386; Mon, 24 May 2010 10:28:10 -0700 (PDT)
Received: from ( []) by with ESMTPS id w29sm19917282vcr.2.2010. (version=TLSv1/SSLv3 cipher=RC4-MD5); Mon, 24 May 2010 10:28:08 -0700 (PDT)
Received: by qyk11 with SMTP id 11so6100876qyk.13 for <>; Mon, 24 May 2010 10:28:07 -0700 (PDT)
MIME-Version: 1.0
Received: by with SMTP id t4mr3207111qaa.254.1274722085608; Mon, 24 May 2010 10:28:05 -0700 (PDT)
Received: by with HTTP; Mon, 24 May 2010 10:28:05 -0700 (PDT)
In-Reply-To: <>
References: <> <>
Date: Mon, 24 May 2010 13:28:05 -0400
Message-ID: <>
From: Roman Shpount <>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Subject: Re: [codec] #15: Efficiently combine pre-encoded audio
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Mon, 24 May 2010 17:28:25 -0000

I would like to remind that this issue is not about efficient VAD, but
about efficiently combining pre-encoded streams. There are two use
cases that I can think of:

a. Conference servers, where the active speaker was determined by
either receiver or decoder side VAD. As it was mentioned before, it is
possible to implement an efficient decoder side VAD without
implementing a complete decoder. If we can combine pre-encoded audio
we can combine multiple streams on the conference server without going
through decoder/encoder cycle greatly decreasing both CPU requirements
and mixer delay.

b. Announcement and IVR servers, where a small set of pre-encoded
announcements are played to the user. Standard network or IVR
announcements can be encoded once and efficiently inserted or combined
into audio stream. If pre-encoded audio is supported and the client
supports AVT tones, it is trivial to develop a very efficient IVR
server  which does not require any CODEC encoding or decoding.
Roman Shpount

On Mon, May 24, 2010 at 10:22 AM, codec issue tracker
<> wrote:
> #15: Efficiently combine pre-encoded audio
> ------------------------------------+---------------------------------------
>  Reporter:  hoene@…                 |       Owner:
>     Type:  enhancement             |      Status:  new
>  Priority:  minor                   |   Milestone:
> Component:  requirements            |     Version:
>  Severity:  Active WG Document      |    Keywords:
> ------------------------------------+---------------------------------------
> Comment(by hoene@…):
>  [Cullen]:
>  For conference bridges, it's probably more important to be able to decide
>  who the active speakers are with low CPU complexity than the actually act
>  of mixing the the selected speakers. Consider a typical call with 7 people
>  who might be speakers and  the 3 most active are selected and mixed. In
>  many systems today, most the MIPS goes to decoding all 7 streams to do
>  speaker detection before the resulting 4 streams are formed and encoded.
>  If there was a cheap way to figure out who the active speakers were
>  without doing a full decode of all 7 streams, that would be sort of nice
>  the for conferences bridges.
>  [Brian]: Excellent idea.  Been there, never really did it.  It's complex.
>  Effectively, you need a distributed adaptive threshold mechanism.
>  However, if you had it, user experience in multispeaker environments gets
>  a win.
>  [Benjamin]:  The cheapest solution, of course, is transmit-side activity
>  detection.
>  Maybe we need to specify a way for a receiver to request that the
>  transmitter employ (or not employ) VAD.
>  [JM]:
>  I think you can do better than an encoder VAD. All you need to do is make
>  sure that the relevant information you need for a VAD can easily be
>  decoded from the bit-stream without having to do a full decoding. For
>  example, if you're able to easily extract the gain and spectral envelope,
>  you can do a VAD based on that without even having to look at the other
>  parameters in the bit-stream.
>  [Brian]:
>  The adaptive threshold doesn't have to be distributed, as the conference
>  bridge is selecting the highest scores.
>  You do need a consistent way to compute the scores in the endpoints,
>  ideally using a method which is not simply energy.
>  I realize the bridge can alternatively generate scores from bitstream
>  information; I am thinking that is equivalent to including a metric in the
>  RTP payload.
> --
> Ticket URL: <>
> codec <>
> _______________________________________________
> codec mailing list