Re: [codec] #16: Multicast?

Koen Vos <> Sat, 24 April 2010 20:56 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id D78973A68C5 for <>; Sat, 24 Apr 2010 13:56:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -4.899
X-Spam-Status: No, score=-4.899 tagged_above=-999 required=5 tests=[AWL=-0.159, BAYES_20=-0.74, RCVD_IN_DNSWL_MED=-4]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id Tm43T6uMHdLa for <>; Sat, 24 Apr 2010 13:56:21 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 8B6433A6826 for <>; Sat, 24 Apr 2010 13:56:20 -0700 (PDT)
Received: from (localhost []) by (Postfix) with ESMTP id E2031601322E8 for <>; Sat, 24 Apr 2010 21:56:08 +0100 (IST)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed;; h=message-id :date:from:to:subject:references:in-reply-to:mime-version :content-type:content-transfer-encoding; s=mail; bh=knDgnL2gbyjj oXe/FYlVL6P+p6I=; b=bA5qiNfaF1gRE2yjw+PQfUmHq6pfIhfXL80g4oQBj5Jl SMzkcVpFwk/gnIzhJjRELt0lajIb8t2MaPnAwKjgqWuFiXQ2+cJ5CZwTMfXTNtL9 M4OF9nuducjIoFnVkRI8rvChs/rz2E2tmBvSS1zOv8g2r9h4NbwbEekLaGX8Vto=
DomainKey-Signature: a=rsa-sha1; c=nofws;; h=message-id:date:from :to:subject:references:in-reply-to:mime-version:content-type: content-transfer-encoding; q=dns; s=mail; b=CFg8UBtoXuQor8m5Umt8 rfZiKwItLTGdJ0Mxvy9BQSaZUleCiU88j0TspYBH6yi/Jx0lAMosYO7OpoNzuhae 6WaKbYLYhcfha+ym6/3/tQYDzUxGRB3UnX7lQebvNCsvpu2bVCGuRw0880+n++YV bk5GW0Jz8D2WToqslqOF50Y=
Received: from localhost (localhost []) by (Postfix) with ESMTP id E02AB60132130 for <>; Sat, 24 Apr 2010 21:56:08 +0100 (IST)
X-Virus-Scanned: Debian amavisd-new at
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id V6pIzAC6gKzI for <>; Sat, 24 Apr 2010 21:56:07 +0100 (IST)
Received: by (Postfix, from userid 33) id 7689F601322E6; Sat, 24 Apr 2010 21:56:07 +0100 (IST)
Received: from ( []) by (Horde Framework) with HTTP; Sat, 24 Apr 2010 13:56:07 -0700
Message-ID: <>
Date: Sat, 24 Apr 2010 13:56:07 -0700
From: Koen Vos <>
References: <> <> <> <> <> <000001cae173$dba012f0$92e038d0$@de> <> <001101cae177$e8aa6780$b9ff3680$@de> <> <002d01cae188$a330b2c0$e9921840$@de> <> <> <>
In-Reply-To: <>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"; DelSp="Yes"; format="flowed"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
User-Agent: Internet Messaging Program (IMP) H3 (4.3.4)
Subject: Re: [codec] #16: Multicast?
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 24 Apr 2010 20:56:23 -0000

Quoting "Raymond (Juin-Hwey) Chen"

> An IP phone  guru told me that for a typical IP phone application,  
> it is also quite common to see a one-way delay of 5 times the codec  
> frame size.

Sure - for certain frame sizes.  But 1 ms frames won't give you 5 ms  
one-way delay.

For a well-designed system and a typical Internet connection:
- most delay comes from the network and is not codec related, and
- one-way delay grows almost linearly with frame size.

> Furthermore, it is possible to use header compression technology to  
> shrink that 48 kb/s penalty to almost nothing.

Afaik, only RTP headers can be compressed between arbitrary Internet  
end points.  You're still stuck with IP and UDP headers.


Quoting "Raymond (Juin-Hwey) Chen" <>:
> Hi Jean-Marc,
> I agree that the 20 ms frame size or packet size is more efficient  
> in bit-rate.  However, this comment doesn't address my original  
> point on the need to have a low-delay IETF codec for the  
> conferencing bridge scenario, where the voice signal will travel  
> through the codec twice (2 tandems), thus doubling the one-way codec  
> delay.
> As you are well aware of, codec design involves many trade-offs  
> between the four major attributes of a codec: delay, complexity,  
> bit-rate, and quality.  For a given codec architecture, improving  
> one attribute normally means sacrificing at least one other  
> attribute.  Nothing comes for free.  Therefore, yes, to get low  
> delay, you need to pay the price of lower bit-rate efficiency, but  
> you can also view it another way: to get higher bit-rate efficiency  
> by using a 20 ms frame size, you pay the price of a higher codec  
> delay.  The question to ask then, is not which frame size is more  
> bit-rate efficient, but whether there are application scenarios  
> where a 20 ms frame size will simply make the one-way delay way too  
> long and greatly degrade the users' communication experience. I  
> believe the answer to the latter question is a definite "yes".
> Let's do some math to see why that is so.  Essentially all cellular  
> codecs use a frame size of 20 ms, yet the one-way delay of a  
> cell-to-landline call is typically 80 to 110 ms, or 4 to 5.5 times  
> the codec frame size.  This is because you have not only the codec  
> buffering delay, but also processing delay, transmission delay, and  
> delay due to processor sharing using real-time OS, etc.  An IP phone  
> guru told me that for a typical IP phone application, it is also  
> quite common to see a one-way delay of 5 times the codec frame size.  
>  Let's just take 5X codec frame size as the one-way delay of a  
> typical implementation.  Then, even if all conference participants  
> use their computers to call the conference bridge, if the IETF codec  
> has a frame size of 20 ms, then after the voice signal of a talker  
> goes through the IETF codec to the bridge, it already takes 100 ms  
> one-way delay.  After the bridge decodes all channels, mixes, and  
> re-encodes with the IETF codec and send to every particip
>  ant, the one-way delay is now already up to 200 ms, way more than  
> the 150 ms limit I mentioned in my last email.  Now if a talker call  
> into the conference bridge through a cell phone call that has 100 ms  
> one-way delay to the edge of the Internet, by the time everyone else  
> hears his voice, it is already 300 ms later.  Anyone trying to  
> interrupt that cell phone caller will experience the  
> talk-stop-talk-stop problem I mentioned before.  Now if another cell  
> phone caller call into the conference bridge, then the one-way delay  
> of his voice to the first cell phone caller will be a whopping 400  
> ms! That would probably turn it into half-duplex effectively.
> When we talk about "high-quality" conference call, it is much more  
> than just the quality or distortion level of the voice signal; the  
> one-way delay is also an important and integral part of the  
> perceived quality of the communication link.  This is clearly  
> documented and well-modeled in the E-model of the ITU-T G.107, and  
> the 150 ms limit, beyond which the perceived quality sort of "falls  
> off the cliff", was also obtained after careful study by telephony  
> experts at the ITU-T.  It would be wise for the IETF codec WG to  
> heed the warning of the ITU-T experts and keep the one-way delay  
> less than 150 ms.
> In contrast, if the IETF codec has a codec frame size and packet  
> size of 5 ms, then the on-the-net one-way conferencing delay is only  
> 50 ms. Even if you use a longer jitter buffer, the one-way delay is  
> still unlikely to go above 100 ms, which is still well within the  
> ITU-T's 150 ms guideline.
> True, sending 5 ms packets means the packet header overhead would be  
> higher, but that's a small price to pay to enable the conference  
> participants to have a high-quality experience by avoiding the  
> problems associated with a long one-way delay.  The bit-rate penalty  
> is not 64 kb/s as you said, but 3/4 of that, or 48 kb/s, because you  
> don't get zero packet header overhead for a 20 ms frame size, but 16  
> kb/s, so 64 - 16 = 48.
> Now, with the exception of a small percentage of Internet users who  
> still use dial-up modems, the vast majority of the Internet users  
> today connect to the Internet at a speed of at least several hundred  
> kb/s, and most are in the Mbps range.  A 48 kb/s penalty is really a  
> fairly small price to pay for the majority of Internet users when it  
> can give them a much better high-quality experience with an much  
> lower delay.
> Furthermore, it is possible to use header compression technology to  
> shrink that 48 kb/s penalty to almost nothing.
> Also, even if a 5 ms packet size is an overkill in some situations,  
> a codec with a 5 ms frame size can easily packs two frames of  
> compressed bit-stream into a 10 ms packet.  Then the packet header  
> overhead bit-rate would be 32 kb/s, so the penalty shrinks by a  
> factor of 3 from 48 kb/s to 32 - 16 = 16 kb/s. With 10 ms packets,  
> the one-way conferencing delay would be 100 ms, still well within  
> the 150 ms guideline. (Actually, since the internal "thread rate" of  
> real-time OS can still run at 5 ms intervals, the one-way delay can  
> be made less than 100 ms, but that's too much detail to go into.) In  
> contrast, a codec with a 20 ms frame size cannot send its bit-stream  
> with 10 ms packets, unless it spreads each frame into two packets,  
> which is what IETF AVT advises against, because it will effectively  
> double the packet loss rate.
> The way I see it, for conference bridge applications at least, I  
> think it would be a big mistake for IETF to recommend a codec with a  
> frame size of 20 ms or higher.  From my analysis above, by doing  
> that we will be stuck with too long a delay and the associated  
> problems.
> Best Regards,
> Raymond
> -----Original Message-----
> From: Jean-Marc Valin []
> Sent: Thursday, April 22, 2010 9:05 PM
> To: Raymond (Juin-Hwey) Chen
> Cc: Christian Hoene; 'stephen botzko';
> Subject: Re: [codec] #16: Multicast?
> Hi,
> See me comments below.
>> [Raymond]: High quality is a given, but I would like to emphasize the
>> importance of low latency.
>> (1) It is well-known that the longer the latency, the lower the
>> perceived quality of the communication link. The E-model in the ITU-T
>> Recommendation G.107 models such communication quality in MOS_cqe,
>> which among other things depends on the so-called "delay impairment
>> factor" /Id/. Basically, MOS_cqe is a monotonically decreasing
>> function of increasing latency, and beyond about 150 ms one-way delay,
>> the perceived quality of the communication link drops rapidly with
>> further delay increase.
> As the author of CELT, I obviously agree that latency is an important
> aspect for this codec :-) That being said, I tend to say that 20 ms is
> still the most widely used frame size, so we might as well optimise for
> that. This is not really a problem because as the frame size goes down,
> the overhead of the IP/UDP/RTP headers go up, so the codec bit-rate
> becomes a bit less of an issue. For example, with 5 ms frames, we would
> already be sending 64 kb/s worth of headers (excluding the link layer),
> so we might as well spend about as many bits on the actual payload as we
> spend on the headers. And with 64 kb/s of payload, we can actually have
> high-quality full-band audio.
>> 1) If a conference bridge has to decode a large number of voice
>> channels, mix, and re-encode, and if compressed-domain mixing cannot
>> be done (which is usually the case), then it is important to keep the
>> decoder complexity low.
> Definitely agree here. The decoder complexity is very important. Not
> only because of mixing issue, but also because the decoder is generally
> not allowed to take shortcuts to save on complexity (unlike the
> encoder). As for compressed-domain mixing, as you say it is not always
> available, but *if* we can do it (even if only partially), then that can
> result in a "free" reduction in decoder complexity for mixing.
>> 2) In topology b) of your other email
>> (IPend-to-transcoding_gateway-to-PSTNend), the transcoding gateway, or
>> VoIP gateway, often has to encode and decode thousands of voice
>> channels in a single box, so not only the computational complexity,
>> but also the per-instance RAM size requirement of the codec become
>> very important for achieving high channel density in the gateway.
> Agreed here, although I would say that per-instance RAM -- as long as
> it's reasonable -- is probably a bit less important than complexity.
>> 3) Many telephone terminal devices at the edge of the Internet use
>> embedded processors with limited processing power, and the processors
>> also have to handle many tasks other than speech coding. If the IETF
>> codec complexity is too high, some of such devices may not have
>> sufficient processing power to run it. Even if the codec can fit, some
>> battery-powered mobile devices may prefer to run a lower-complexity
>> codec to reduce power consumption and battery drain. For example, even
>> if you make a Internet phone call from a computer, you may like the
>> convenience of using a Bluetooth headset that allows you to walk
>> around a bit and have hands-free operation. Currently most Bluetooth
>> headsets have small form factors with a tiny battery. This puts a
>> severe constraint on power consumption. Bluetooth headset chips
>> typically have very limited processing capability, and it has to
>> handle many other tasks such as echo cancellation and noise reduction.
>> There is just not enough processing power to handle a relatively
>> high-complexity codec. Most BT headsets today relies on the extremely
>> low-complexity, hardware-based CVSD codec at 64 kb/s to transmit
>> narrowband voice, but CVSD has audible coding noise, so it degrades
>> the overall audio quality. If the IETF codec has low enough
>> complexity, it would be possible to directly encode and decode the
>> IETF codec bit-stream at the BT headset, thus avoiding the quality
>> degradation of CVSD transcoding.
> Any idea what the complexity requirements would be for this use-case to
> be possible?
> Cheers,
> Jean-Marc
> _______________________________________________
> codec mailing list