Re: [codec] #16: Multicast?

"Raymond (Juin-Hwey) Chen" <> Sat, 24 April 2010 01:44 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id C21363A681C for <>; Fri, 23 Apr 2010 18:44:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -0.432
X-Spam-Status: No, score=-0.432 tagged_above=-999 required=5 tests=[AWL=-0.433, BAYES_50=0.001]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id q3gTjfsFqqLe for <>; Fri, 23 Apr 2010 18:44:18 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 290D93A67A6 for <>; Fri, 23 Apr 2010 18:44:18 -0700 (PDT)
Received: from [] by with ESMTP (Broadcom SMTP Relay (Email Firewall v6.3.2)); Fri, 23 Apr 2010 18:43:56 -0700
X-Server-Uuid: B55A25B1-5D7D-41F8-BC53-C57E7AD3C201
Received: from ([]) by ([]) with mapi; Fri, 23 Apr 2010 18:45:19 -0700
From: "Raymond (Juin-Hwey) Chen" <>
To: Jean-Marc Valin <>
Date: Fri, 23 Apr 2010 18:43:45 -0700
Thread-Topic: [codec] #16: Multicast?
Thread-Index: AcrimhyNUh3NLsf5RvSfcmuOhmvRfwAo3/xA
Message-ID: <>
References: <> <> <> <> <> <000001cae173$dba012f0$92e038d0$@de> <> <001101cae177$e8aa6780$b9ff3680$@de> <> <002d01cae188$a330b2c0$e9921840$@de> <> <>
In-Reply-To: <>
Accept-Language: en-US
Content-Language: en-US
x-cr-puzzleid: {2C1AE9B2-5797-4739-8F73-2E93DAD8954B}
acceptlanguage: en-US
MIME-Version: 1.0
X-WSS-ID: 67CC935631G101627727-01-01
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Cc: "" <>, 'stephen botzko' <>
Subject: Re: [codec] #16: Multicast?
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 24 Apr 2010 01:44:20 -0000

Hi Jean-Marc,

I agree that the 20 ms frame size or packet size is more efficient in bit-rate.  However, this comment doesn't address my original point on the need to have a low-delay IETF codec for the conferencing bridge scenario, where the voice signal will travel through the codec twice (2 tandems), thus doubling the one-way codec delay.

As you are well aware of, codec design involves many trade-offs between the four major attributes of a codec: delay, complexity, bit-rate, and quality.  For a given codec architecture, improving one attribute normally means sacrificing at least one other attribute.  Nothing comes for free.  Therefore, yes, to get low delay, you need to pay the price of lower bit-rate efficiency, but you can also view it another way: to get higher bit-rate efficiency by using a 20 ms frame size, you pay the price of a higher codec delay.  The question to ask then, is not which frame size is more bit-rate efficient, but whether there are application scenarios where a 20 ms frame size will simply make the one-way delay way too long and greatly degrade the users' communication experience. I believe the answer to the latter question is a definite "yes".

Let's do some math to see why that is so.  Essentially all cellular codecs use a frame size of 20 ms, yet the one-way delay of a cell-to-landline call is typically 80 to 110 ms, or 4 to 5.5 times the codec frame size.  This is because you have not only the codec buffering delay, but also processing delay, transmission delay, and delay due to processor sharing using real-time OS, etc.  An IP phone guru told me that for a typical IP phone application, it is also quite common to see a one-way delay of 5 times the codec frame size.  Let's just take 5X codec frame size as the one-way delay of a typical implementation.  Then, even if all conference participants use their computers to call the conference bridge, if the IETF codec has a frame size of 20 ms, then after the voice signal of a talker goes through the IETF codec to the bridge, it already takes 100 ms one-way delay.  After the bridge decodes all channels, mixes, and re-encodes with the IETF codec and send to every participant, the one-way delay is now already up to 200 ms, way more than the 150 ms limit I mentioned in my last email.  Now if a talker call into the conference bridge through a cell phone call that has 100 ms one-way delay to the edge of the Internet, by the time everyone else hears his voice, it is already 300 ms later.  Anyone trying to interrupt that cell phone caller will experience the talk-stop-talk-stop problem I mentioned before.  Now if another cell phone caller call into the conference bridge, then the one-way delay of his voice to the first cell phone caller will be a whopping 400 ms! That would probably turn it into half-duplex effectively.

When we talk about "high-quality" conference call, it is much more than just the quality or distortion level of the voice signal; the one-way delay is also an important and integral part of the perceived quality of the communication link.  This is clearly documented and well-modeled in the E-model of the ITU-T G.107, and the 150 ms limit, beyond which the perceived quality sort of "falls off the cliff", was also obtained after careful study by telephony experts at the ITU-T.  It would be wise for the IETF codec WG to heed the warning of the ITU-T experts and keep the one-way delay less than 150 ms.

In contrast, if the IETF codec has a codec frame size and packet size of 5 ms, then the on-the-net one-way conferencing delay is only 50 ms. Even if you use a longer jitter buffer, the one-way delay is still unlikely to go above 100 ms, which is still well within the ITU-T's 150 ms guideline.

True, sending 5 ms packets means the packet header overhead would be higher, but that's a small price to pay to enable the conference participants to have a high-quality experience by avoiding the problems associated with a long one-way delay.  The bit-rate penalty is not 64 kb/s as you said, but 3/4 of that, or 48 kb/s, because you don't get zero packet header overhead for a 20 ms frame size, but 16 kb/s, so 64 - 16 = 48.  

Now, with the exception of a small percentage of Internet users who still use dial-up modems, the vast majority of the Internet users today connect to the Internet at a speed of at least several hundred kb/s, and most are in the Mbps range.  A 48 kb/s penalty is really a fairly small price to pay for the majority of Internet users when it can give them a much better high-quality experience with an much lower delay.

Furthermore, it is possible to use header compression technology to shrink that 48 kb/s penalty to almost nothing.

Also, even if a 5 ms packet size is an overkill in some situations, a codec with a 5 ms frame size can easily packs two frames of compressed bit-stream into a 10 ms packet.  Then the packet header overhead bit-rate would be 32 kb/s, so the penalty shrinks by a factor of 3 from 48 kb/s to 32 - 16 = 16 kb/s. With 10 ms packets, the one-way conferencing delay would be 100 ms, still well within the 150 ms guideline. (Actually, since the internal "thread rate" of real-time OS can still run at 5 ms intervals, the one-way delay can be made less than 100 ms, but that's too much detail to go into.) In contrast, a codec with a 20 ms frame size cannot send its bit-stream with 10 ms packets, unless it spreads each frame into two packets, which is what IETF AVT advises against, because it will effectively double the packet loss rate.

The way I see it, for conference bridge applications at least, I think it would be a big mistake for IETF to recommend a codec with a frame size of 20 ms or higher.  From my analysis above, by doing that we will be stuck with too long a delay and the associated problems.   

Best Regards,


-----Original Message-----
From: Jean-Marc Valin [] 
Sent: Thursday, April 22, 2010 9:05 PM
To: Raymond (Juin-Hwey) Chen
Cc: Christian Hoene; 'stephen botzko';
Subject: Re: [codec] #16: Multicast?


See me comments below.

> [Raymond]: High quality is a given, but I would like to emphasize the 
> importance of low latency.
> (1) It is well-known that the longer the latency, the lower the 
> perceived quality of the communication link. The E-model in the ITU-T 
> Recommendation G.107 models such communication quality in MOS_cqe, 
> which among other things depends on the so-called "delay impairment 
> factor" /Id/. Basically, MOS_cqe is a monotonically decreasing 
> function of increasing latency, and beyond about 150 ms one-way delay, 
> the perceived quality of the communication link drops rapidly with 
> further delay increase.

As the author of CELT, I obviously agree that latency is an important 
aspect for this codec :-) That being said, I tend to say that 20 ms is 
still the most widely used frame size, so we might as well optimise for 
that. This is not really a problem because as the frame size goes down, 
the overhead of the IP/UDP/RTP headers go up, so the codec bit-rate 
becomes a bit less of an issue. For example, with 5 ms frames, we would 
already be sending 64 kb/s worth of headers (excluding the link layer), 
so we might as well spend about as many bits on the actual payload as we 
spend on the headers. And with 64 kb/s of payload, we can actually have 
high-quality full-band audio.

> 1) If a conference bridge has to decode a large number of voice 
> channels, mix, and re-encode, and if compressed-domain mixing cannot 
> be done (which is usually the case), then it is important to keep the 
> decoder complexity low.

Definitely agree here. The decoder complexity is very important. Not 
only because of mixing issue, but also because the decoder is generally 
not allowed to take shortcuts to save on complexity (unlike the 
encoder). As for compressed-domain mixing, as you say it is not always 
available, but *if* we can do it (even if only partially), then that can 
result in a "free" reduction in decoder complexity for mixing.

> 2) In topology b) of your other email 
> (IPend-to-transcoding_gateway-to-PSTNend), the transcoding gateway, or 
> VoIP gateway, often has to encode and decode thousands of voice 
> channels in a single box, so not only the computational complexity, 
> but also the per-instance RAM size requirement of the codec become 
> very important for achieving high channel density in the gateway.

Agreed here, although I would say that per-instance RAM -- as long as 
it's reasonable -- is probably a bit less important than complexity.

> 3) Many telephone terminal devices at the edge of the Internet use 
> embedded processors with limited processing power, and the processors 
> also have to handle many tasks other than speech coding. If the IETF 
> codec complexity is too high, some of such devices may not have 
> sufficient processing power to run it. Even if the codec can fit, some 
> battery-powered mobile devices may prefer to run a lower-complexity 
> codec to reduce power consumption and battery drain. For example, even 
> if you make a Internet phone call from a computer, you may like the 
> convenience of using a Bluetooth headset that allows you to walk 
> around a bit and have hands-free operation. Currently most Bluetooth 
> headsets have small form factors with a tiny battery. This puts a 
> severe constraint on power consumption. Bluetooth headset chips 
> typically have very limited processing capability, and it has to 
> handle many other tasks such as echo cancellation and noise reduction. 
> There is just not enough processing power to handle a relatively 
> high-complexity codec. Most BT headsets today relies on the extremely 
> low-complexity, hardware-based CVSD codec at 64 kb/s to transmit 
> narrowband voice, but CVSD has audible coding noise, so it degrades 
> the overall audio quality. If the IETF codec has low enough 
> complexity, it would be possible to directly encode and decode the 
> IETF codec bit-stream at the BT headset, thus avoiding the quality 
> degradation of CVSD transcoding.

Any idea what the complexity requirements would be for this use-case to 
be possible?