Re: [codec] #15: Efficiently combine pre-encoded audio

"Benjamin M. Schwartz" <bmschwar@fas.harvard.edu> Wed, 12 May 2010 16:40 UTC

Return-Path: <bmschwar@fas.harvard.edu>
X-Original-To: codec@core3.amsl.com
Delivered-To: codec@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 992F03A6CF8 for <codec@core3.amsl.com>; Wed, 12 May 2010 09:40:58 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.387
X-Spam-Level:
X-Spam-Status: No, score=-4.387 tagged_above=-999 required=5 tests=[AWL=-0.266, BAYES_20=-0.74, RCVD_IN_DNSWL_MED=-4, RCVD_IN_SORBS_WEB=0.619]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fOknxkR5Y0dp for <codec@core3.amsl.com>; Wed, 12 May 2010 09:40:57 -0700 (PDT)
Received: from us12.unix.fas.harvard.edu (us12.unix.fas.harvard.edu [140.247.35.203]) by core3.amsl.com (Postfix) with ESMTP id 5795928C178 for <codec@ietf.org>; Wed, 12 May 2010 09:22:36 -0700 (PDT)
Received: from us12.unix.fas.harvard.edu (localhost.localdomain [127.0.0.1]) by us12.unix.fas.harvard.edu (Postfix) with ESMTP id CA2B96652BD; Wed, 12 May 2010 12:22:25 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=fas.harvard.edu; h= message-id:date:from:reply-to:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; s=mail; bh= WOw4pc2QQjzZPZhsQPf5Ef79PhxUoerxmksmex5I9JA=; b=n56NfKWGZ3vvcfaY FmqWtQkd5OvkrhXRw5RXsqsJRmz3QLf3mBs22C0+UXZ2o1kVYc4a6+hiu26krcWJ 0q9swZwt0dHUwfPUKnXT/rqSZvLjgATXH198MSAEwRRH0maBTZtajtRWIass0bND vcJjdLQbmJNd0MJn/80FH/nBBnU=
DomainKey-Signature: a=rsa-sha1; c=simple; d=fas.harvard.edu; h= message-id:date:from:reply-to:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; q=dns; s= mail; b=M2PA8fyZoi++wKar4XyEg/Drvaxelf3S1mc9CBZwfUTnciyeWicqEsTe by2bqjMWM92TM+jLOW2kVSaHlXC6bsENWwvQ04B02v+HzFGw/zpp5AEAZp/XN4kl 7i407skYJnijaqGyFxcaju4dkCVL8wO9G7/hU6OtE6+dUKW4pD4=
Received: from [172.23.141.103] (bwhmaincampuspat25.partners.org [170.223.207.25]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: bmschwar@fas) by us12.unix.fas.harvard.edu (Postfix) with ESMTPSA id C3E26665221; Wed, 12 May 2010 12:22:25 -0400 (EDT)
Message-ID: <4BEAD5C1.4000802@fas.harvard.edu>
Date: Wed, 12 May 2010 12:22:25 -0400
From: "Benjamin M. Schwartz" <bmschwar@fas.harvard.edu>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100423 Thunderbird/3.0.4
MIME-Version: 1.0
To: Jean-Marc Valin <jean-marc.valin@octasic.com>, codec@ietf.org
References: <062.bc75a3b3c4a980df34535f87c9484935@tools.ietf.org> <071.30b67e93d22f0bfedf46b5035d133441@tools.ietf.org> <1F68067D-33B9-4F0C-B31B-B3A56A72DBA4@cisco.com> <4BEAC888.50109@fas.harvard.edu> <4BEACCD7.8080401@octasic.com> <4BEACEBF.7080403@fas.harvard.edu> <4BEAD147.8080307@octasic.com>
In-Reply-To: <4BEAD147.8080307@octasic.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Subject: Re: [codec] #15: Efficiently combine pre-encoded audio
X-BeenThere: codec@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: bens@alum.mit.edu
List-Id: Codec WG <codec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/codec>
List-Post: <mailto:codec@ietf.org>
List-Help: <mailto:codec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 May 2010 16:40:58 -0000

On 05/12/2010 12:03 PM, Jean-Marc Valin wrote:
> Benjamin M. Schwartz wrote:
>> but how is decoder VAD
>> better than encoder VAD? Encoder VAD saves even more CPU, saves
>> bandwidth, and enables easier jitter buffering.
>
> There's a few reasons why I think decoder-side is better:
> - The decision for an encoder-size VAD would take some amount of space
> in the bit-stream

I think I failed to communicate that by VAD I mean _not sending packets_ 
during inactivity.  For the packets that are sent, the overhead should 
average much less than 1 bit per frame.

I'm not suggesting sending 200 packets a second containing a flag 
indicating no voice activity, followed by carefully coded background 
noise.  That would be silly.

> - If we make an encode-size VAD mandatory, then all encoders will have
> to spend the CPU cycles, even when it's not needed. If it's not
> mandatory, then the decoder cannot rely on it, so it still needs to
> implement a VAD

I don't see this as "mandatory".  The encoder can turn off VAD, and 
probably should for full-quality applications.

> - A decoder VAD does not need to be specified in an exact way, so
> implementers can choose different implementations depending on that
> information they need.

The only thing that needs exact specification is the signalling.  The 
encoder may use it or not use it as it pleases.

> - You cannot "game" a decode-size VAD.

I don't know what this means.

>> Are you thinking about some sort of adaptive thresholding that requires
>> knowing all streams' volume levels?
>
> Well, knowing the relative amplitudes of each stream can allow you to
> take more intelligent decisions, e.g. when you have to choose the "most
> active speaker". That's something you can't really get from an encoder VAD.
>
>> Anyway, VAD can run on both encode and decode sides at the same time.
>
> That would just mean nobody would bother implementing the encode side.

I expect encode-side VAD on a conference call to save more than a factor 
of 2 in bandwidth, which makes it very desirable, especially for large 
deployments.  People will use it to save bandwidth (especially if it's on 
by default in the reference implementation).  The decode-side CPU savings 
are just a minor bonus side-effect.

--Ben