Re: [codec] #16: Multicast?

"Raymond (Juin-Hwey) Chen" <> Sat, 24 April 2010 20:31 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id A523D3A67F5 for <>; Sat, 24 Apr 2010 13:31:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -0.324
X-Spam-Status: No, score=-0.324 tagged_above=-999 required=5 tests=[AWL=-0.325, BAYES_50=0.001]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id mTcXE5A2xvh8 for <>; Sat, 24 Apr 2010 13:31:40 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id CAB873A67D2 for <>; Sat, 24 Apr 2010 13:31:40 -0700 (PDT)
Received: from [] by with ESMTP (Broadcom SMTP Relay (Email Firewall v6.3.2)); Sat, 24 Apr 2010 13:31:18 -0700
X-Server-Uuid: B55A25B1-5D7D-41F8-BC53-C57E7AD3C201
Received: from ([]) by ([]) with mapi; Sat, 24 Apr 2010 13:32:41 -0700
From: "Raymond (Juin-Hwey) Chen" <>
To: Jean-Marc Valin <>
Importance: low
X-Priority: 5
Date: Sat, 24 Apr 2010 13:31:06 -0700
Thread-Topic: [codec] #16: Multicast?
Thread-Index: AcrjWT5Pm2qJGfbjQROPLhhyRy4JIwAiDtVw
Message-ID: <>
References: <> <> <> <> <> <000001cae173$dba012f0$92e038d0$@de> <> <001101cae177$e8aa6780$b9ff3680$@de> <> <002d01cae188$a330b2c0$e9921840$@de> <> <> <> <>
In-Reply-To: <>
Accept-Language: en-US
Content-Language: en-US
x-cr-puzzleid: {628EF28F-14BB-4A55-99F4-AE9876C6F9B4}
acceptlanguage: en-US
MIME-Version: 1.0
X-WSS-ID: 67CD8A9C31G102472339-01-01
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Cc: "" <>, 'stephen botzko' <>
Subject: Re: [codec] #16: Multicast?
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 24 Apr 2010 20:31:42 -0000

Hi Jean-Marc,

Thanks for your helpful clarification.

I totally agree with you and Mikael Abrahamsson that it is unlikely for a single codec frame size to be optimal for all applications.  Different applications can have very different requirements in delay, complexity, bit-rate, and quality, and thus will require different codec trade-offs to get the most optimal solution.  One size just doesn't fit all.  I believe that's why there are so many different speech and audio codecs out there, because no single codec at any single operating point of delay, complexity, bit-rate, and quality can meet the requirements of all possible codec applications.

Given the broad range of different applications that the IETF codec is trying to cover, it is clear to me that the IETF codec needs to have multiple codec "modes" (or "profiles" as some other standard bodies often call them) to cover not only different bit-rates and sampling rates, but also different levels of delay and complexity.  Thus, the IETF codec might need to have a low-bit-rate mode (or high bit-rate efficiency mode), a low-delay mode, a low-complexity mode, etc.

For example, for dial-up modem users, the low-bit-rate mode will be suitable.  For delay-sensitive applications such as the conference bridge applications or any Internet phone call to a cell phone through a cellular network (even point-to-point as oppose to conferencing), the low-delay mode will be suitable.  For devices with very limited processing capabilities such as mono Bluetooth telephone headsets, the low-complexity mode will be suitable.  A conference bridge can even use different IETF codec modes to communicate with different devices in the same conference call at the same time.

Of course, sometimes it may be possible or desirable to combine different modes.  For example, we could have a low-delay-high-quality mode for full-band interactive music performance over the Internet, and we could have a low-delay-low-complexity mode when a Bluetooth headset is used to call a conference bridge.

Regarding your comment that the payload bit-rate should be the same order as the header bit-rate, although it seems to make sense intuitively, I think in reality there are situations where it may not make sense.  For instance, in your example of a 5 ms frame/packet size with a 64 kb/s header bit-rate and 64 kb/s full-band audio bit-rate, what if an IP phone with a 8 kHz sampling rate is used to make a call? Then all the extra bit-rate spent on audio bandwidth above 3.4 kHz (or 4 kHz maximum) is wasted.  Or what if the resulting 128 kb/s is too high for the link because the link still needs to support other data streams?  Similarly, for the 20 ms frame size case, what if you want to transmit full-band audio and 16 kb/s just won't give you the high fidelity level you are looking for?  I understand that you are only talking about a "sweet spot" and are not saying the payload bit-rate and the header bit-rate must be the same, but I am just saying that in certain situations the sweet spot may actually be somewhere else.

Also, I think ideally we should do header compression to minimize the header bit-rate and increase the overall bit-rate efficiency of the system, then there is no "sweet spot" issue as discussed above.  I understand that this may not be possible unless the networking gears support it at both ends, but just from a system perspective, it would seem to me that it doesn't make sense for speech/audio codec developers to try so hard to squeeze the speech/audio signal to as few bits as possible, only to see the packet header having very redundant and repetitive bits sent packet after packet.  There are probably reasons why header compression is not more widely used (which I don't know since it's outside my expertise area), but from a system perspective, it just seems to me to be extremely unbalanced and inefficient to send so many unchanged bits in the header packet after packet, especially in light of the high degree of speech/audio compression done in the payload.

Best Regards,


-----Original Message-----
From: Jean-Marc Valin []
Sent: Friday, April 23, 2010 7:53 PM
To: Raymond (Juin-Hwey) Chen
Cc: Christian Hoene; 'stephen botzko';
Subject: Re: [codec] #16: Multicast?

Hi Raymond,

I think I may have been a bit ambiguous in my previous email. I am
totally in favor of supporting 5 ms frames. TO me, it is becoming clear
that we want to support both 5 ms *and* 20 ms frames. I don't think it
would be too hard to support both.

About my comment about the overheard, my point was not to say that 64
kb/s overhead is necessarily unacceptable. My main point is that the
codec's "sweet spot" should probably be such that the payload bit-rate
be in the same order as the header overhead at the selected frame size.
So when operating with 5 ms frames, since you're already paying 64 kb/s
in headers, it's probably worth also having 64 kb/s of payload so that
you can transmit full-band audio. On the other hand, when operating with
20 ms frames where the overhead is 16 kb/s, then a 16 kb/s payload is
probably the sweet spot. That way you scale audio quality at the same
time as you reduce latency.

I hope this clears up the mis-understanding.



On 2010-04-23 21:43, Raymond (Juin-Hwey) Chen wrote:
> Hi Jean-Marc,
> I agree that the 20 ms frame size or packet size is more efficient in
> bit-rate.  However, this comment doesn't address my original point on
> the need to have a low-delay IETF codec for the conferencing bridge
> scenario, where the voice signal will travel through the codec twice
> (2 tandems), thus doubling the one-way codec delay.
> As you are well aware of, codec design involves many trade-offs
> between the four major attributes of a codec: delay, complexity,
> bit-rate, and quality.  For a given codec architecture, improving one
> attribute normally means sacrificing at least one other attribute.
> Nothing comes for free.  Therefore, yes, to get low delay, you need
> to pay the price of lower bit-rate efficiency, but you can also view
> it another way: to get higher bit-rate efficiency by using a 20 ms
> frame size, you pay the price of a higher codec delay.  The question
> to ask then, is not which frame size is more bit-rate efficient, but
> whether there are application scenarios where a 20 ms frame size will
> simply make the one-way delay way too long and greatly degrade the
> users' communication experience. I believe the answer to the latter
> question is a definite "yes".
> Let's do some math to see why that is so.  Essentially all cellular
> codecs use a frame size of 20 ms, yet the one-way delay of a
> cell-to-landline call is typically 80 to 110 ms, or 4 to 5.5 times
> the codec frame size.  This is because you have not only the codec
> buffering delay, but also processing delay, transmission delay, and
> delay due to processor sharing using real-time OS, etc.  An IP phone
> guru told me that for a typical IP phone application, it is also
> quite common to see a one-way delay of 5 times the codec frame size.
> Let's just take 5X codec frame size as the one-way delay of a typical
> implementation.  Then, even if all conference participants use their
> computers to call the conference bridge, if the IETF codec has a
> frame size of 20 ms, then after the voice signal of a talker goes
> through the IETF codec to the bridge, it already takes 100 ms one-way
> delay.  After the bridge decodes all channels, mixes, and re-encodes
> with the IETF codec and send to every participant, the one-way delay
> is now already up to 200 ms, way more than the 150 ms limit I
> mentioned in my last email.  Now if a talker call into the conference
> bridge through a cell phone call that has 100 ms one-way delay to the
> edge of the Internet, by the time everyone else hears his voice, it
> is already 300 ms later.  Anyone trying to interrupt that cell phone
> caller will experience the talk-stop-talk-stop problem I mentioned
> before.  Now if another cell phone caller call into the conference
> bridge, then the one-way delay of his voice to the first cell phone
> caller will be a whopping 400 ms! That would probably turn it into
> half-duplex effectively.
> When we talk about "high-quality" conference call, it is much more
> than just the quality or distortion level of the voice signal; the
> one-way delay is also an important and integral part of the perceived
> quality of the communication link.  This is clearly documented and
> well-modeled in the E-model of the ITU-T G.107, and the 150 ms limit,
> beyond which the perceived quality sort of "falls off the cliff", was
> also obtained after careful study by telephony experts at the ITU-T.
> It would be wise for the IETF codec WG to heed the warning of the
> ITU-T experts and keep the one-way delay less than 150 ms.
> In contrast, if the IETF codec has a codec frame size and packet size
> of 5 ms, then the on-the-net one-way conferencing delay is only 50
> ms. Even if you use a longer jitter buffer, the one-way delay is
> still unlikely to go above 100 ms, which is still well within the
> ITU-T's 150 ms guideline.
> True, sending 5 ms packets means the packet header overhead would be
> higher, but that's a small price to pay to enable the conference
> participants to have a high-quality experience by avoiding the
> problems associated with a long one-way delay.  The bit-rate penalty
> is not 64 kb/s as you said, but 3/4 of that, or 48 kb/s, because you
> don't get zero packet header overhead for a 20 ms frame size, but 16
> kb/s, so 64 - 16 = 48.
> Now, with the exception of a small percentage of Internet users who
> still use dial-up modems, the vast majority of the Internet users
> today connect to the Internet at a speed of at least several hundred
> kb/s, and most are in the Mbps range.  A 48 kb/s penalty is really a
> fairly small price to pay for the majority of Internet users when it
> can give them a much better high-quality experience with an much
> lower delay.
> Furthermore, it is possible to use header compression technology to
> shrink that 48 kb/s penalty to almost nothing.
> Also, even if a 5 ms packet size is an overkill in some situations, a
> codec with a 5 ms frame size can easily packs two frames of
> compressed bit-stream into a 10 ms packet.  Then the packet header
> overhead bit-rate would be 32 kb/s, so the penalty shrinks by a
> factor of 3 from 48 kb/s to 32 - 16 = 16 kb/s. With 10 ms packets,
> the one-way conferencing delay would be 100 ms, still well within the
> 150 ms guideline. (Actually, since the internal "thread rate" of
> real-time OS can still run at 5 ms intervals, the one-way delay can
> be made less than 100 ms, but that's too much detail to go into.) In
> contrast, a codec with a 20 ms frame size cannot send its bit-stream
> with 10 ms packets, unless it spreads each frame into two packets,
> which is what IETF AVT advises against, because it will effectively
> double the packet loss rate.
> The way I see it, for conference bridge applications at least, I
> think it would be a big mistake for IETF to recommend a codec with a
> frame size of 20 ms or higher.  From my analysis above, by doing that
> we will be stuck with too long a delay and the associated problems.
> Best Regards,
> Raymond
> -----Original Message----- From: Jean-Marc Valin
> [] Sent: Thursday, April 22,
> 2010 9:05 PM To: Raymond (Juin-Hwey) Chen Cc: Christian Hoene;
> 'stephen botzko'; Subject: Re: [codec] #16:
> Multicast?
> Hi,
> See me comments below.
>> [Raymond]: High quality is a given, but I would like to emphasize
>> the importance of low latency.
>> (1) It is well-known that the longer the latency, the lower the
>> perceived quality of the communication link. The E-model in the
>> ITU-T Recommendation G.107 models such communication quality in
>> MOS_cqe, which among other things depends on the so-called "delay
>> impairment factor" /Id/. Basically, MOS_cqe is a monotonically
>> decreasing function of increasing latency, and beyond about 150 ms
>> one-way delay, the perceived quality of the communication link
>> drops rapidly with further delay increase.
> As the author of CELT, I obviously agree that latency is an
> important aspect for this codec :-) That being said, I tend to say
> that 20 ms is still the most widely used frame size, so we might as
> well optimise for that. This is not really a problem because as the
> frame size goes down, the overhead of the IP/UDP/RTP headers go up,
> so the codec bit-rate becomes a bit less of an issue. For example,
> with 5 ms frames, we would already be sending 64 kb/s worth of
> headers (excluding the link layer), so we might as well spend about
> as many bits on the actual payload as we spend on the headers. And
> with 64 kb/s of payload, we can actually have high-quality full-band
> audio.
>> 1) If a conference bridge has to decode a large number of voice
>> channels, mix, and re-encode, and if compressed-domain mixing
>> cannot be done (which is usually the case), then it is important to
>> keep the decoder complexity low.
> Definitely agree here. The decoder complexity is very important. Not
> only because of mixing issue, but also because the decoder is
> generally not allowed to take shortcuts to save on complexity (unlike
> the encoder). As for compressed-domain mixing, as you say it is not
> always available, but *if* we can do it (even if only partially),
> then that can result in a "free" reduction in decoder complexity for
> mixing.
>> 2) In topology b) of your other email
>> (IPend-to-transcoding_gateway-to-PSTNend), the transcoding gateway,
>> or VoIP gateway, often has to encode and decode thousands of voice
>> channels in a single box, so not only the computational
>> complexity, but also the per-instance RAM size requirement of the
>> codec become very important for achieving high channel density in
>> the gateway.
> Agreed here, although I would say that per-instance RAM -- as long
> as it's reasonable -- is probably a bit less important than
> complexity.
>> 3) Many telephone terminal devices at the edge of the Internet use
>> embedded processors with limited processing power, and the
>> processors also have to handle many tasks other than speech coding.
>> If the IETF codec complexity is too high, some of such devices may
>> not have sufficient processing power to run it. Even if the codec
>> can fit, some battery-powered mobile devices may prefer to run a
>> lower-complexity codec to reduce power consumption and battery
>> drain. For example, even if you make a Internet phone call from a
>> computer, you may like the convenience of using a Bluetooth headset
>> that allows you to walk around a bit and have hands-free operation.
>> Currently most Bluetooth headsets have small form factors with a
>> tiny battery. This puts a severe constraint on power consumption.
>> Bluetooth headset chips typically have very limited processing
>> capability, and it has to handle many other tasks such as echo
>> cancellation and noise reduction. There is just not enough
>> processing power to handle a relatively high-complexity codec. Most
>> BT headsets today relies on the extremely low-complexity,
>> hardware-based CVSD codec at 64 kb/s to transmit narrowband voice,
>> but CVSD has audible coding noise, so it degrades the overall audio
>> quality. If the IETF codec has low enough complexity, it would be
>> possible to directly encode and decode the IETF codec bit-stream at
>> the BT headset, thus avoiding the quality degradation of CVSD
>> transcoding.
> Any idea what the complexity requirements would be for this use-case
> to be possible?
> Cheers,
> Jean-Marc