Re: [codec] #16: Multicast?

Mikael Abrahamsson <> Sat, 01 May 2010 04:05 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 055513A6923 for <>; Fri, 30 Apr 2010 21:05:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -5.771
X-Spam-Status: No, score=-5.771 tagged_above=-999 required=5 tests=[AWL=0.478, BAYES_00=-2.599, HELO_EQ_SE=0.35, RCVD_IN_DNSWL_MED=-4]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id SMyV24yIOVE2 for <>; Fri, 30 Apr 2010 21:05:10 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 99ED03A6894 for <>; Fri, 30 Apr 2010 21:05:09 -0700 (PDT)
Received: by (Postfix, from userid 501) id 1768CA0; Sat, 1 May 2010 06:04:50 +0200 (CEST)
Received: from localhost (localhost []) by (Postfix) with ESMTP id 12CC49C; Sat, 1 May 2010 06:04:50 +0200 (CEST)
Date: Sat, 01 May 2010 06:04:50 +0200
From: Mikael Abrahamsson <>
To: "Raymond (Juin-Hwey) Chen" <>
In-Reply-To: <>
Message-ID: <>
References: <> <> <> <> <> <000001cae173$dba012f0$92e038d0$@de> <> <001101cae177$e8aa6780$b9ff3680$@de> <> <002d01cae188$a330b2c0$e9921840$@de> <> <> <> <> <>
User-Agent: Alpine 1.10 (DEB 962 2008-03-14)
Organization: People's Front Against WWW
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"; format="flowed"
Cc: "" <>
Subject: Re: [codec] #16: Multicast?
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 01 May 2010 04:05:12 -0000

On Fri, 30 Apr 2010, Raymond (Juin-Hwey) Chen wrote:

I think what the 2X 3X factor is handling is what we in the networking 
world calls "serialisation delay". Most equipment today receives the 
packet, looks at it, then sends it out (store and forward). That means 
that on a 2 megabit/s link, it takes:

.0004  (0.4ms)
to send out a 100 byte packet.

With multiple such links for the packet to traverse, it might be possible 
to see those kind of amlifications of frame size (even though I can't get 
it to amplify that much, for instance when going from 5ms to 20ms packet 
interval). Could be if there are a lot of slow links for the packet to 
traverse (512 kilobits/s for instance).

We stop worrying about serialisation delays when speeds go over several 
hundred megabit/s, because on a gigabit ethernet link serialisation delay 
for a 1500 byte packet is 0.012 ms.

> Hi Cullen,
> After my original email below, there were several follow-up emails about this, and in those emails I have replaced this oversimplified 5X formula for total delay with the following more detailed formula for typical IP phone implementations:
>  One-way delay = codec-independent delay + 3*(codec frame size) + (codec look-ahead) + (codec filtering delay if any)
> This formula was obtained from an experienced engineer who has been working on IP phones related fields for more than a decade, and the formula was based on actual observed one-way delay in real-world IP phone implementations.  Similar 3X multiplier is also observed in VoIP gateways.  Even with a fast processor/system optimized from ground up to be low-delay, the measured "codec-dependent" one-way delay of such a VoIP gateway using the G.711 codec with a 5 ms frame/packet size is between 12 and 17 ms, or around 3X the frame size.
> The ITU-T uses 2X codec frame size + look-ahead as the one-way codec delay. In some ideal situations, that 2X multiplier is probably achievable, but in more typical situations, 3X is what you are much more likely to find in real-world implementations.
> Raymond
> -----Original Message-----
> From: Cullen Jennings []
> Sent: Friday, April 30, 2010 8:02 AM
> To: Raymond (Juin-Hwey) Chen
> Cc:
> Subject: Re: [codec] #16: Multicast?
> I don't  agree with this logic.
> You seem to be claiming that network delay scales as a factor of audio frame length. I don't think this is true. If the network takes 100 ms to deliver a 5 ms audio packet from a DSL subscriber in Europe to a conference bridge in the US, I see no reason to believe that a packet with 20 ms of audio is going to have a significantly different network transport time. So if we were  on a conference call with over 300 ms of latency (which is very common BTW), it does not seem that changing from 20 ms to 5 ms is going to result in significant (say over 25%) reduction of latency.
> I'm not aging for or against 5 ms packets, I'm just saying the logic of "network latency is 5x the packet audio size" makes no sense to me.
> Cullen
> On Apr 23, 2010, at 7:43 PM, Raymond (Juin-Hwey) Chen wrote:
>> Hi Jean-Marc,
>> I agree that the 20 ms frame size or packet size is more efficient in bit-rate.  However, this comment doesn't address my original point on the need to have a low-delay IETF codec for the conferencing bridge scenario, where the voice signal will travel through the codec twice (2 tandems), thus doubling the one-way codec delay.
>> As you are well aware of, codec design involves many trade-offs between the four major attributes of a codec: delay, complexity, bit-rate, and quality.  For a given codec architecture, improving one attribute normally means sacrificing at least one other attribute.  Nothing comes for free.  Therefore, yes, to get low delay, you need to pay the price of lower bit-rate efficiency, but you can also view it another way: to get higher bit-rate efficiency by using a 20 ms frame size, you pay the price of a higher codec delay.  The question to ask then, is not which frame size is more bit-rate efficient, but whether there are application scenarios where a 20 ms frame size will simply make the one-way delay way too long and greatly degrade the users' communication experience. I believe the answer to the latter question is a definite "yes".
>> Let's do some math to see why that is so.  Essentially all cellular codecs use a frame size of 20 ms, yet the one-way delay of a cell-to-landline call is typically 80 to 110 ms, or 4 to 5.5 times the codec frame size.  This is because you have not only the codec buffering delay, but also processing delay, transmission delay, and delay due to processor sharing using real-time OS, etc.  An IP phone guru told me that for a typical IP phone application, it is also quite common to see a one-way delay of 5 times the codec frame size.  Let's just take 5X codec frame size as the one-way delay of a typical implementation.  Then, even if all conference participants use their computers to call the conference bridge, if the IETF codec has a frame size of 20 ms, then after the voice signal of a talker goes through the IETF codec to the bridge, it already takes 100 ms one-way delay.  After the bridge decodes all channels, mixes, and re-encodes with the IETF codec and send to every parti
> ip
>> ant, the one-way delay is now already up to 200 ms, way more than the 150 ms limit I mentioned in my last email.  Now if a talker call into the conference bridge through a cell phone call that has 100 ms one-way delay to the edge of the Internet, by the time everyone else hears his voice, it is already 300 ms later.  Anyone trying to interrupt that cell phone caller will experience the talk-stop-talk-stop problem I mentioned before.  Now if another cell phone caller call into the conference bridge, then the one-way delay of his voice to the first cell phone caller will be a whopping 400 ms! That would probably turn it into half-duplex effectively.
>> When we talk about "high-quality" conference call, it is much more than just the quality or distortion level of the voice signal; the one-way delay is also an important and integral part of the perceived quality of the communication link.  This is clearly documented and well-modeled in the E-model of the ITU-T G.107, and the 150 ms limit, beyond which the perceived quality sort of "falls off the cliff", was also obtained after careful study by telephony experts at the ITU-T.  It would be wise for the IETF codec WG to heed the warning of the ITU-T experts and keep the one-way delay less than 150 ms.
>> In contrast, if the IETF codec has a codec frame size and packet size of 5 ms, then the on-the-net one-way conferencing delay is only 50 ms. Even if you use a longer jitter buffer, the one-way delay is still unlikely to go above 100 ms, which is still well within the ITU-T's 150 ms guideline.
>> True, sending 5 ms packets means the packet header overhead would be higher, but that's a small price to pay to enable the conference participants to have a high-quality experience by avoiding the problems associated with a long one-way delay.  The bit-rate penalty is not 64 kb/s as you said, but 3/4 of that, or 48 kb/s, because you don't get zero packet header overhead for a 20 ms frame size, but 16 kb/s, so 64 - 16 = 48.
>> Now, with the exception of a small percentage of Internet users who still use dial-up modems, the vast majority of the Internet users today connect to the Internet at a speed of at least several hundred kb/s, and most are in the Mbps range.  A 48 kb/s penalty is really a fairly small price to pay for the majority of Internet users when it can give them a much better high-quality experience with an much lower delay.
>> Furthermore, it is possible to use header compression technology to shrink that 48 kb/s penalty to almost nothing.
>> Also, even if a 5 ms packet size is an overkill in some situations, a codec with a 5 ms frame size can easily packs two frames of compressed bit-stream into a 10 ms packet.  Then the packet header overhead bit-rate would be 32 kb/s, so the penalty shrinks by a factor of 3 from 48 kb/s to 32 - 16 = 16 kb/s. With 10 ms packets, the one-way conferencing delay would be 100 ms, still well within the 150 ms guideline. (Actually, since the internal "thread rate" of real-time OS can still run at 5 ms intervals, the one-way delay can be made less than 100 ms, but that's too much detail to go into.) In contrast, a codec with a 20 ms frame size cannot send its bit-stream with 10 ms packets, unless it spreads each frame into two packets, which is what IETF AVT advises against, because it will effectively double the packet loss rate.
>> The way I see it, for conference bridge applications at least, I think it would be a big mistake for IETF to recommend a codec with a frame size of 20 ms or higher.  From my analysis above, by doing that we will be stuck with too long a delay and the associated problems.
>> Best Regards,
>> Raymond
>> -----Original Message-----
>> From: Jean-Marc Valin []
>> Sent: Thursday, April 22, 2010 9:05 PM
>> To: Raymond (Juin-Hwey) Chen
>> Cc: Christian Hoene; 'stephen botzko';
>> Subject: Re: [codec] #16: Multicast?
>> Hi,
>> See me comments below.
>>> [Raymond]: High quality is a given, but I would like to emphasize the
>>> importance of low latency.
>>> (1) It is well-known that the longer the latency, the lower the
>>> perceived quality of the communication link. The E-model in the ITU-T
>>> Recommendation G.107 models such communication quality in MOS_cqe,
>>> which among other things depends on the so-called "delay impairment
>>> factor" /Id/. Basically, MOS_cqe is a monotonically decreasing
>>> function of increasing latency, and beyond about 150 ms one-way delay,
>>> the perceived quality of the communication link drops rapidly with
>>> further delay increase.
>> As the author of CELT, I obviously agree that latency is an important
>> aspect for this codec :-) That being said, I tend to say that 20 ms is
>> still the most widely used frame size, so we might as well optimise for
>> that. This is not really a problem because as the frame size goes down,
>> the overhead of the IP/UDP/RTP headers go up, so the codec bit-rate
>> becomes a bit less of an issue. For example, with 5 ms frames, we would
>> already be sending 64 kb/s worth of headers (excluding the link layer),
>> so we might as well spend about as many bits on the actual payload as we
>> spend on the headers. And with 64 kb/s of payload, we can actually have
>> high-quality full-band audio.
>>> 1) If a conference bridge has to decode a large number of voice
>>> channels, mix, and re-encode, and if compressed-domain mixing cannot
>>> be done (which is usually the case), then it is important to keep the
>>> decoder complexity low.
>> Definitely agree here. The decoder complexity is very important. Not
>> only because of mixing issue, but also because the decoder is generally
>> not allowed to take shortcuts to save on complexity (unlike the
>> encoder). As for compressed-domain mixing, as you say it is not always
>> available, but *if* we can do it (even if only partially), then that can
>> result in a "free" reduction in decoder complexity for mixing.
>>> 2) In topology b) of your other email
>>> (IPend-to-transcoding_gateway-to-PSTNend), the transcoding gateway, or
>>> VoIP gateway, often has to encode and decode thousands of voice
>>> channels in a single box, so not only the computational complexity,
>>> but also the per-instance RAM size requirement of the codec become
>>> very important for achieving high channel density in the gateway.
>> Agreed here, although I would say that per-instance RAM -- as long as
>> it's reasonable -- is probably a bit less important than complexity.
>>> 3) Many telephone terminal devices at the edge of the Internet use
>>> embedded processors with limited processing power, and the processors
>>> also have to handle many tasks other than speech coding. If the IETF
>>> codec complexity is too high, some of such devices may not have
>>> sufficient processing power to run it. Even if the codec can fit, some
>>> battery-powered mobile devices may prefer to run a lower-complexity
>>> codec to reduce power consumption and battery drain. For example, even
>>> if you make a Internet phone call from a computer, you may like the
>>> convenience of using a Bluetooth headset that allows you to walk
>>> around a bit and have hands-free operation. Currently most Bluetooth
>>> headsets have small form factors with a tiny battery. This puts a
>>> severe constraint on power consumption. Bluetooth headset chips
>>> typically have very limited processing capability, and it has to
>>> handle many other tasks such as echo cancellation and noise reduction.
>>> There is just not enough processing power to handle a relatively
>>> high-complexity codec. Most BT headsets today relies on the extremely
>>> low-complexity, hardware-based CVSD codec at 64 kb/s to transmit
>>> narrowband voice, but CVSD has audible coding noise, so it degrades
>>> the overall audio quality. If the IETF codec has low enough
>>> complexity, it would be possible to directly encode and decode the
>>> IETF codec bit-stream at the BT headset, thus avoiding the quality
>>> degradation of CVSD transcoding.
>> Any idea what the complexity requirements would be for this use-case to
>> be possible?
>> Cheers,
>> Jean-Marc
>> _______________________________________________
>> codec mailing list
> Cullen Jennings
> For corporate legal information go to:
> _______________________________________________
> codec mailing list

Mikael Abrahamsson    email: