Re: [codec] #16: Multicast?

Jean-Marc Valin <> Sat, 24 April 2010 02:52 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 653493A67E2 for <>; Fri, 23 Apr 2010 19:52:59 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -0.369
X-Spam-Status: No, score=-0.369 tagged_above=-999 required=5 tests=[AWL=0.370, BAYES_20=-0.74]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id r29qE+nZCa5C for <>; Fri, 23 Apr 2010 19:52:57 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 3331E3A67F3 for <>; Fri, 23 Apr 2010 19:52:57 -0700 (PDT)
MIME-version: 1.0
Content-transfer-encoding: 7bit
Content-type: text/plain; charset="ISO-8859-1"; format="flowed"
Received: from [] ([]) by (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTP id <> for; Fri, 23 Apr 2010 22:52:46 -0400 (EDT)
Message-id: <>
Date: Fri, 23 Apr 2010 22:52:45 -0400
From: Jean-Marc Valin <>
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv: Gecko/20100317 Thunderbird/3.0.4
To: "Raymond (Juin-Hwey) Chen" <>
References: <> <> <> <> <> <000001cae173$dba012f0$92e038d0$@de> <> <001101cae177$e8aa6780$b9ff3680$@de> <> <002d01cae188$a330b2c0$e9921840$@de> <> <> <>
In-reply-to: <>
Cc: "" <>, 'stephen botzko' <>
Subject: Re: [codec] #16: Multicast?
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 24 Apr 2010 02:52:59 -0000

Hi Raymond,

I think I may have been a bit ambiguous in my previous email. I am 
totally in favor of supporting 5 ms frames. TO me, it is becoming clear 
that we want to support both 5 ms *and* 20 ms frames. I don't think it 
would be too hard to support both.

About my comment about the overheard, my point was not to say that 64 
kb/s overhead is necessarily unacceptable. My main point is that the 
codec's "sweet spot" should probably be such that the payload bit-rate 
be in the same order as the header overhead at the selected frame size. 
So when operating with 5 ms frames, since you're already paying 64 kb/s 
in headers, it's probably worth also having 64 kb/s of payload so that 
you can transmit full-band audio. On the other hand, when operating with 
20 ms frames where the overhead is 16 kb/s, then a 16 kb/s payload is 
probably the sweet spot. That way you scale audio quality at the same 
time as you reduce latency.

I hope this clears up the mis-understanding.



On 2010-04-23 21:43, Raymond (Juin-Hwey) Chen wrote:
> Hi Jean-Marc,
> I agree that the 20 ms frame size or packet size is more efficient in
> bit-rate.  However, this comment doesn't address my original point on
> the need to have a low-delay IETF codec for the conferencing bridge
> scenario, where the voice signal will travel through the codec twice
> (2 tandems), thus doubling the one-way codec delay.
> As you are well aware of, codec design involves many trade-offs
> between the four major attributes of a codec: delay, complexity,
> bit-rate, and quality.  For a given codec architecture, improving one
> attribute normally means sacrificing at least one other attribute.
> Nothing comes for free.  Therefore, yes, to get low delay, you need
> to pay the price of lower bit-rate efficiency, but you can also view
> it another way: to get higher bit-rate efficiency by using a 20 ms
> frame size, you pay the price of a higher codec delay.  The question
> to ask then, is not which frame size is more bit-rate efficient, but
> whether there are application scenarios where a 20 ms frame size will
> simply make the one-way delay way too long and greatly degrade the
> users' communication experience. I believe the answer to the latter
> question is a definite "yes".
> Let's do some math to see why that is so.  Essentially all cellular
> codecs use a frame size of 20 ms, yet the one-way delay of a
> cell-to-landline call is typically 80 to 110 ms, or 4 to 5.5 times
> the codec frame size.  This is because you have not only the codec
> buffering delay, but also processing delay, transmission delay, and
> delay due to processor sharing using real-time OS, etc.  An IP phone
> guru told me that for a typical IP phone application, it is also
> quite common to see a one-way delay of 5 times the codec frame size.
> Let's just take 5X codec frame size as the one-way delay of a typical
> implementation.  Then, even if all conference participants use their
> computers to call the conference bridge, if the IETF codec has a
> frame size of 20 ms, then after the voice signal of a talker goes
> through the IETF codec to the bridge, it already takes 100 ms one-way
> delay.  After the bridge decodes all channels, mixes, and re-encodes
> with the IETF codec and send to every participant, the one-way delay
> is now already up to 200 ms, way more than the 150 ms limit I
> mentioned in my last email.  Now if a talker call into the conference
> bridge through a cell phone call that has 100 ms one-way delay to the
> edge of the Internet, by the time everyone else hears his voice, it
> is already 300 ms later.  Anyone trying to interrupt that cell phone
> caller will experience the talk-stop-talk-stop problem I mentioned
> before.  Now if another cell phone caller call into the conference
> bridge, then the one-way delay of his voice to the first cell phone
> caller will be a whopping 400 ms! That would probably turn it into
> half-duplex effectively.
> When we talk about "high-quality" conference call, it is much more
> than just the quality or distortion level of the voice signal; the
> one-way delay is also an important and integral part of the perceived
> quality of the communication link.  This is clearly documented and
> well-modeled in the E-model of the ITU-T G.107, and the 150 ms limit,
> beyond which the perceived quality sort of "falls off the cliff", was
> also obtained after careful study by telephony experts at the ITU-T.
> It would be wise for the IETF codec WG to heed the warning of the
> ITU-T experts and keep the one-way delay less than 150 ms.
> In contrast, if the IETF codec has a codec frame size and packet size
> of 5 ms, then the on-the-net one-way conferencing delay is only 50
> ms. Even if you use a longer jitter buffer, the one-way delay is
> still unlikely to go above 100 ms, which is still well within the
> ITU-T's 150 ms guideline.
> True, sending 5 ms packets means the packet header overhead would be
> higher, but that's a small price to pay to enable the conference
> participants to have a high-quality experience by avoiding the
> problems associated with a long one-way delay.  The bit-rate penalty
> is not 64 kb/s as you said, but 3/4 of that, or 48 kb/s, because you
> don't get zero packet header overhead for a 20 ms frame size, but 16
> kb/s, so 64 - 16 = 48.
> Now, with the exception of a small percentage of Internet users who
> still use dial-up modems, the vast majority of the Internet users
> today connect to the Internet at a speed of at least several hundred
> kb/s, and most are in the Mbps range.  A 48 kb/s penalty is really a
> fairly small price to pay for the majority of Internet users when it
> can give them a much better high-quality experience with an much
> lower delay.
> Furthermore, it is possible to use header compression technology to
> shrink that 48 kb/s penalty to almost nothing.
> Also, even if a 5 ms packet size is an overkill in some situations, a
> codec with a 5 ms frame size can easily packs two frames of
> compressed bit-stream into a 10 ms packet.  Then the packet header
> overhead bit-rate would be 32 kb/s, so the penalty shrinks by a
> factor of 3 from 48 kb/s to 32 - 16 = 16 kb/s. With 10 ms packets,
> the one-way conferencing delay would be 100 ms, still well within the
> 150 ms guideline. (Actually, since the internal "thread rate" of
> real-time OS can still run at 5 ms intervals, the one-way delay can
> be made less than 100 ms, but that's too much detail to go into.) In
> contrast, a codec with a 20 ms frame size cannot send its bit-stream
> with 10 ms packets, unless it spreads each frame into two packets,
> which is what IETF AVT advises against, because it will effectively
> double the packet loss rate.
> The way I see it, for conference bridge applications at least, I
> think it would be a big mistake for IETF to recommend a codec with a
> frame size of 20 ms or higher.  From my analysis above, by doing that
> we will be stuck with too long a delay and the associated problems.
> Best Regards,
> Raymond
> -----Original Message----- From: Jean-Marc Valin
> [] Sent: Thursday, April 22,
> 2010 9:05 PM To: Raymond (Juin-Hwey) Chen Cc: Christian Hoene;
> 'stephen botzko'; Subject: Re: [codec] #16:
> Multicast?
> Hi,
> See me comments below.
>> [Raymond]: High quality is a given, but I would like to emphasize
>> the importance of low latency.
>> (1) It is well-known that the longer the latency, the lower the
>> perceived quality of the communication link. The E-model in the
>> ITU-T Recommendation G.107 models such communication quality in
>> MOS_cqe, which among other things depends on the so-called "delay
>> impairment factor" /Id/. Basically, MOS_cqe is a monotonically
>> decreasing function of increasing latency, and beyond about 150 ms
>> one-way delay, the perceived quality of the communication link
>> drops rapidly with further delay increase.
> As the author of CELT, I obviously agree that latency is an
> important aspect for this codec :-) That being said, I tend to say
> that 20 ms is still the most widely used frame size, so we might as
> well optimise for that. This is not really a problem because as the
> frame size goes down, the overhead of the IP/UDP/RTP headers go up,
> so the codec bit-rate becomes a bit less of an issue. For example,
> with 5 ms frames, we would already be sending 64 kb/s worth of
> headers (excluding the link layer), so we might as well spend about
> as many bits on the actual payload as we spend on the headers. And
> with 64 kb/s of payload, we can actually have high-quality full-band
> audio.
>> 1) If a conference bridge has to decode a large number of voice
>> channels, mix, and re-encode, and if compressed-domain mixing
>> cannot be done (which is usually the case), then it is important to
>> keep the decoder complexity low.
> Definitely agree here. The decoder complexity is very important. Not
> only because of mixing issue, but also because the decoder is
> generally not allowed to take shortcuts to save on complexity (unlike
> the encoder). As for compressed-domain mixing, as you say it is not
> always available, but *if* we can do it (even if only partially),
> then that can result in a "free" reduction in decoder complexity for
> mixing.
>> 2) In topology b) of your other email
>> (IPend-to-transcoding_gateway-to-PSTNend), the transcoding gateway,
>> or VoIP gateway, often has to encode and decode thousands of voice
>> channels in a single box, so not only the computational
>> complexity, but also the per-instance RAM size requirement of the
>> codec become very important for achieving high channel density in
>> the gateway.
> Agreed here, although I would say that per-instance RAM -- as long
> as it's reasonable -- is probably a bit less important than
> complexity.
>> 3) Many telephone terminal devices at the edge of the Internet use
>> embedded processors with limited processing power, and the
>> processors also have to handle many tasks other than speech coding.
>> If the IETF codec complexity is too high, some of such devices may
>> not have sufficient processing power to run it. Even if the codec
>> can fit, some battery-powered mobile devices may prefer to run a
>> lower-complexity codec to reduce power consumption and battery
>> drain. For example, even if you make a Internet phone call from a
>> computer, you may like the convenience of using a Bluetooth headset
>> that allows you to walk around a bit and have hands-free operation.
>> Currently most Bluetooth headsets have small form factors with a
>> tiny battery. This puts a severe constraint on power consumption.
>> Bluetooth headset chips typically have very limited processing
>> capability, and it has to handle many other tasks such as echo
>> cancellation and noise reduction. There is just not enough
>> processing power to handle a relatively high-complexity codec. Most
>> BT headsets today relies on the extremely low-complexity,
>> hardware-based CVSD codec at 64 kb/s to transmit narrowband voice,
>> but CVSD has audible coding noise, so it degrades the overall audio
>> quality. If the IETF codec has low enough complexity, it would be
>> possible to directly encode and decode the IETF codec bit-stream at
>> the BT headset, thus avoiding the quality degradation of CVSD
>> transcoding.
> Any idea what the complexity requirements would be for this use-case
> to be possible?
> Cheers,
> Jean-Marc