Re: [codec] #15: Efficiently combine pre-encoded audio

"Benjamin M. Schwartz" <> Wed, 12 May 2010 16:40 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 992F03A6CF8 for <>; Wed, 12 May 2010 09:40:58 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -4.387
X-Spam-Status: No, score=-4.387 tagged_above=-999 required=5 tests=[AWL=-0.266, BAYES_20=-0.74, RCVD_IN_DNSWL_MED=-4, RCVD_IN_SORBS_WEB=0.619]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id fOknxkR5Y0dp for <>; Wed, 12 May 2010 09:40:57 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 5795928C178 for <>; Wed, 12 May 2010 09:22:36 -0700 (PDT)
Received: from (localhost.localdomain []) by (Postfix) with ESMTP id CA2B96652BD; Wed, 12 May 2010 12:22:25 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed;; h= message-id:date:from:reply-to:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; s=mail; bh= WOw4pc2QQjzZPZhsQPf5Ef79PhxUoerxmksmex5I9JA=; b=n56NfKWGZ3vvcfaY FmqWtQkd5OvkrhXRw5RXsqsJRmz3QLf3mBs22C0+UXZ2o1kVYc4a6+hiu26krcWJ 0q9swZwt0dHUwfPUKnXT/rqSZvLjgATXH198MSAEwRRH0maBTZtajtRWIass0bND vcJjdLQbmJNd0MJn/80FH/nBBnU=
DomainKey-Signature: a=rsa-sha1; c=simple;; h= message-id:date:from:reply-to:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; q=dns; s= mail; b=M2PA8fyZoi++wKar4XyEg/Drvaxelf3S1mc9CBZwfUTnciyeWicqEsTe by2bqjMWM92TM+jLOW2kVSaHlXC6bsENWwvQ04B02v+HzFGw/zpp5AEAZp/XN4kl 7i407skYJnijaqGyFxcaju4dkCVL8wO9G7/hU6OtE6+dUKW4pD4=
Received: from [] ( []) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: bmschwar@fas) by (Postfix) with ESMTPSA id C3E26665221; Wed, 12 May 2010 12:22:25 -0400 (EDT)
Message-ID: <>
Date: Wed, 12 May 2010 12:22:25 -0400
From: "Benjamin M. Schwartz" <>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20100423 Thunderbird/3.0.4
MIME-Version: 1.0
To: Jean-Marc Valin <>,
References: <> <> <> <> <> <> <>
In-Reply-To: <>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [codec] #15: Efficiently combine pre-encoded audio
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 12 May 2010 16:40:58 -0000

On 05/12/2010 12:03 PM, Jean-Marc Valin wrote:
> Benjamin M. Schwartz wrote:
>> but how is decoder VAD
>> better than encoder VAD? Encoder VAD saves even more CPU, saves
>> bandwidth, and enables easier jitter buffering.
> There's a few reasons why I think decoder-side is better:
> - The decision for an encoder-size VAD would take some amount of space
> in the bit-stream

I think I failed to communicate that by VAD I mean _not sending packets_ 
during inactivity.  For the packets that are sent, the overhead should 
average much less than 1 bit per frame.

I'm not suggesting sending 200 packets a second containing a flag 
indicating no voice activity, followed by carefully coded background 
noise.  That would be silly.

> - If we make an encode-size VAD mandatory, then all encoders will have
> to spend the CPU cycles, even when it's not needed. If it's not
> mandatory, then the decoder cannot rely on it, so it still needs to
> implement a VAD

I don't see this as "mandatory".  The encoder can turn off VAD, and 
probably should for full-quality applications.

> - A decoder VAD does not need to be specified in an exact way, so
> implementers can choose different implementations depending on that
> information they need.

The only thing that needs exact specification is the signalling.  The 
encoder may use it or not use it as it pleases.

> - You cannot "game" a decode-size VAD.

I don't know what this means.

>> Are you thinking about some sort of adaptive thresholding that requires
>> knowing all streams' volume levels?
> Well, knowing the relative amplitudes of each stream can allow you to
> take more intelligent decisions, e.g. when you have to choose the "most
> active speaker". That's something you can't really get from an encoder VAD.
>> Anyway, VAD can run on both encode and decode sides at the same time.
> That would just mean nobody would bother implementing the encode side.

I expect encode-side VAD on a conference call to save more than a factor 
of 2 in bandwidth, which makes it very desirable, especially for large 
deployments.  People will use it to save bandwidth (especially if it's on 
by default in the reference implementation).  The decode-side CPU savings 
are just a minor bonus side-effect.