Re: [codec] #16: Multicast?

stephen botzko <> Sat, 24 April 2010 23:37 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id C1CEE3A67FD for <>; Sat, 24 Apr 2010 16:37:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.764
X-Spam-Status: No, score=-1.764 tagged_above=-999 required=5 tests=[AWL=-0.655, BAYES_05=-1.11, HTML_MESSAGE=0.001]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id K10te-iUB71F for <>; Sat, 24 Apr 2010 16:37:06 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 393F33A6782 for <>; Sat, 24 Apr 2010 16:37:05 -0700 (PDT)
Received: by iwn32 with SMTP id 32so7228479iwn.18 for <>; Sat, 24 Apr 2010 16:36:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type; bh=JhzUziPFY/Yglq2Y+Cq6jYGYSjlbBTzpmBnmTv0eV+U=; b=xGpfmguTNOD19OcIfCs8AGlqExZj0tC6gg2J+5+Xis4dUtd4sE293ZaKX08fXYTQg8 2vzNlSaUJs1Mft3OFl8BJWzayeb27EDPXc42K0owdCA1ty9I1Kvg7vzUU6y9rODXMvBo 9MdjyWtHlSQbC/ro36RSBRFxrwMvOnsw+oF2I=
DomainKey-Signature: a=rsa-sha1; c=nofws;; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=Ym3BqF1zNn0Vc0e/1DQIWHmeeYBMfkFfcIx5gxzZNEz/ISaY3xCNTIFzukKSrbmTbM g6eEU7+5w9wVRgmZkmQ3FnV55Zs0oVGDt64c630wGzFCoHmLxliEi8bluQsMOef2hBRO 8dPNBWJx8tMl7COK7vDOI3uoGnfWsipaPVb3g=
MIME-Version: 1.0
Received: by with SMTP id dj5mr651484ibb.58.1272152212076; Sat, 24 Apr 2010 16:36:52 -0700 (PDT)
Received: by with HTTP; Sat, 24 Apr 2010 16:36:51 -0700 (PDT)
In-Reply-To: <>
References: <> <000001cae173$dba012f0$92e038d0$@de> <> <001101cae177$e8aa6780$b9ff3680$@de> <> <002d01cae188$a330b2c0$e9921840$@de> <> <> <> <>
Date: Sat, 24 Apr 2010 19:36:51 -0400
Message-ID: <>
From: stephen botzko <>
To: Koen Vos <>
Content-Type: multipart/alternative; boundary="0016367b6332b3798c048504048d"
Subject: Re: [codec] #16: Multicast?
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 24 Apr 2010 23:37:08 -0000

   Sure - for certain frame sizes.  But 1 ms frames won't give you 5 ms
one-way delay.
Agreed, this delay model is too simplisti to be useful.

There's algorithmic delay (including framing) + flight time + dejittering.
Flight time depends on the network path, not on the frame size.  And the
amount of jitter is due principally to cross-congestion.

Stephen Botzko

On Sat, Apr 24, 2010 at 4:56 PM, Koen Vos <> wrote:

> Quoting "Raymond (Juin-Hwey) Chen"
>  An IP phone  guru told me that for a typical IP phone application, it is
>> also quite common to see a one-way delay of 5 times the codec frame size.
> Sure - for certain frame sizes.  But 1 ms frames won't give you 5 ms
> one-way delay.
> For a well-designed system and a typical Internet connection:
> - most delay comes from the network and is not codec related, and
> - one-way delay grows almost linearly with frame size.
>  Furthermore, it is possible to use header compression technology to shrink
>> that 48 kb/s penalty to almost nothing.
> Afaik, only RTP headers can be compressed between arbitrary Internet end
> points.  You're still stuck with IP and UDP headers.
> best,
> koen.
> Quoting "Raymond (Juin-Hwey) Chen" <>:
>> Hi Jean-Marc,
>> I agree that the 20 ms frame size or packet size is more efficient in
>> bit-rate.  However, this comment doesn't address my original point on the
>> need to have a low-delay IETF codec for the conferencing bridge scenario,
>> where the voice signal will travel through the codec twice (2 tandems), thus
>> doubling the one-way codec delay.
>> As you are well aware of, codec design involves many trade-offs between
>> the four major attributes of a codec: delay, complexity, bit-rate, and
>> quality.  For a given codec architecture, improving one attribute normally
>> means sacrificing at least one other attribute.  Nothing comes for free.
>>  Therefore, yes, to get low delay, you need to pay the price of lower
>> bit-rate efficiency, but you can also view it another way: to get higher
>> bit-rate efficiency by using a 20 ms frame size, you pay the price of a
>> higher codec delay.  The question to ask then, is not which frame size is
>> more bit-rate efficient, but whether there are application scenarios where a
>> 20 ms frame size will simply make the one-way delay way too long and greatly
>> degrade the users' communication experience. I believe the answer to the
>> latter question is a definite "yes".
>> Let's do some math to see why that is so.  Essentially all cellular codecs
>> use a frame size of 20 ms, yet the one-way delay of a cell-to-landline call
>> is typically 80 to 110 ms, or 4 to 5.5 times the codec frame size.  This is
>> because you have not only the codec buffering delay, but also processing
>> delay, transmission delay, and delay due to processor sharing using
>> real-time OS, etc.  An IP phone guru told me that for a typical IP phone
>> application, it is also quite common to see a one-way delay of 5 times the
>> codec frame size.  Let's just take 5X codec frame size as the one-way delay
>> of a typical implementation.  Then, even if all conference participants use
>> their computers to call the conference bridge, if the IETF codec has a frame
>> size of 20 ms, then after the voice signal of a talker goes through the IETF
>> codec to the bridge, it already takes 100 ms one-way delay.  After the
>> bridge decodes all channels, mixes, and re-encodes with the IETF codec and
>> send to every particip
>>  ant, the one-way delay is now already up to 200 ms, way more than the 150
>> ms limit I mentioned in my last email.  Now if a talker call into the
>> conference bridge through a cell phone call that has 100 ms one-way delay to
>> the edge of the Internet, by the time everyone else hears his voice, it is
>> already 300 ms later.  Anyone trying to interrupt that cell phone caller
>> will experience the talk-stop-talk-stop problem I mentioned before.  Now if
>> another cell phone caller call into the conference bridge, then the one-way
>> delay of his voice to the first cell phone caller will be a whopping 400 ms!
>> That would probably turn it into half-duplex effectively.
>> When we talk about "high-quality" conference call, it is much more than
>> just the quality or distortion level of the voice signal; the one-way delay
>> is also an important and integral part of the perceived quality of the
>> communication link.  This is clearly documented and well-modeled in the
>> E-model of the ITU-T G.107, and the 150 ms limit, beyond which the perceived
>> quality sort of "falls off the cliff", was also obtained after careful study
>> by telephony experts at the ITU-T.  It would be wise for the IETF codec WG
>> to heed the warning of the ITU-T experts and keep the one-way delay less
>> than 150 ms.
>> In contrast, if the IETF codec has a codec frame size and packet size of 5
>> ms, then the on-the-net one-way conferencing delay is only 50 ms. Even if
>> you use a longer jitter buffer, the one-way delay is still unlikely to go
>> above 100 ms, which is still well within the ITU-T's 150 ms guideline.
>> True, sending 5 ms packets means the packet header overhead would be
>> higher, but that's a small price to pay to enable the conference
>> participants to have a high-quality experience by avoiding the problems
>> associated with a long one-way delay.  The bit-rate penalty is not 64 kb/s
>> as you said, but 3/4 of that, or 48 kb/s, because you don't get zero packet
>> header overhead for a 20 ms frame size, but 16 kb/s, so 64 - 16 = 48.
>> Now, with the exception of a small percentage of Internet users who still
>> use dial-up modems, the vast majority of the Internet users today connect to
>> the Internet at a speed of at least several hundred kb/s, and most are in
>> the Mbps range.  A 48 kb/s penalty is really a fairly small price to pay for
>> the majority of Internet users when it can give them a much better
>> high-quality experience with an much lower delay.
>> Furthermore, it is possible to use header compression technology to shrink
>> that 48 kb/s penalty to almost nothing.
>> Also, even if a 5 ms packet size is an overkill in some situations, a
>> codec with a 5 ms frame size can easily packs two frames of compressed
>> bit-stream into a 10 ms packet.  Then the packet header overhead bit-rate
>> would be 32 kb/s, so the penalty shrinks by a factor of 3 from 48 kb/s to 32
>> - 16 = 16 kb/s. With 10 ms packets, the one-way conferencing delay would be
>> 100 ms, still well within the 150 ms guideline. (Actually, since the
>> internal "thread rate" of real-time OS can still run at 5 ms intervals, the
>> one-way delay can be made less than 100 ms, but that's too much detail to go
>> into.) In contrast, a codec with a 20 ms frame size cannot send its
>> bit-stream with 10 ms packets, unless it spreads each frame into two
>> packets, which is what IETF AVT advises against, because it will effectively
>> double the packet loss rate.
>> The way I see it, for conference bridge applications at least, I think it
>> would be a big mistake for IETF to recommend a codec with a frame size of 20
>> ms or higher.  From my analysis above, by doing that we will be stuck with
>> too long a delay and the associated problems.
>> Best Regards,
>> Raymond
>> -----Original Message-----
>> From: Jean-Marc Valin []
>> Sent: Thursday, April 22, 2010 9:05 PM
>> To: Raymond (Juin-Hwey) Chen
>> Cc: Christian Hoene; 'stephen botzko';
>> Subject: Re: [codec] #16: Multicast?
>> Hi,
>> See me comments below.
>>  [Raymond]: High quality is a given, but I would like to emphasize the
>>> importance of low latency.
>>> (1) It is well-known that the longer the latency, the lower the
>>> perceived quality of the communication link. The E-model in the ITU-T
>>> Recommendation G.107 models such communication quality in MOS_cqe,
>>> which among other things depends on the so-called "delay impairment
>>> factor" /Id/. Basically, MOS_cqe is a monotonically decreasing
>>> function of increasing latency, and beyond about 150 ms one-way delay,
>>> the perceived quality of the communication link drops rapidly with
>>> further delay increase.
>> As the author of CELT, I obviously agree that latency is an important
>> aspect for this codec :-) That being said, I tend to say that 20 ms is
>> still the most widely used frame size, so we might as well optimise for
>> that. This is not really a problem because as the frame size goes down,
>> the overhead of the IP/UDP/RTP headers go up, so the codec bit-rate
>> becomes a bit less of an issue. For example, with 5 ms frames, we would
>> already be sending 64 kb/s worth of headers (excluding the link layer),
>> so we might as well spend about as many bits on the actual payload as we
>> spend on the headers. And with 64 kb/s of payload, we can actually have
>> high-quality full-band audio.
>>  1) If a conference bridge has to decode a large number of voice
>>> channels, mix, and re-encode, and if compressed-domain mixing cannot
>>> be done (which is usually the case), then it is important to keep the
>>> decoder complexity low.
>> Definitely agree here. The decoder complexity is very important. Not
>> only because of mixing issue, but also because the decoder is generally
>> not allowed to take shortcuts to save on complexity (unlike the
>> encoder). As for compressed-domain mixing, as you say it is not always
>> available, but *if* we can do it (even if only partially), then that can
>> result in a "free" reduction in decoder complexity for mixing.
>>  2) In topology b) of your other email
>>> (IPend-to-transcoding_gateway-to-PSTNend), the transcoding gateway, or
>>> VoIP gateway, often has to encode and decode thousands of voice
>>> channels in a single box, so not only the computational complexity,
>>> but also the per-instance RAM size requirement of the codec become
>>> very important for achieving high channel density in the gateway.
>> Agreed here, although I would say that per-instance RAM -- as long as
>> it's reasonable -- is probably a bit less important than complexity.
>>  3) Many telephone terminal devices at the edge of the Internet use
>>> embedded processors with limited processing power, and the processors
>>> also have to handle many tasks other than speech coding. If the IETF
>>> codec complexity is too high, some of such devices may not have
>>> sufficient processing power to run it. Even if the codec can fit, some
>>> battery-powered mobile devices may prefer to run a lower-complexity
>>> codec to reduce power consumption and battery drain. For example, even
>>> if you make a Internet phone call from a computer, you may like the
>>> convenience of using a Bluetooth headset that allows you to walk
>>> around a bit and have hands-free operation. Currently most Bluetooth
>>> headsets have small form factors with a tiny battery. This puts a
>>> severe constraint on power consumption. Bluetooth headset chips
>>> typically have very limited processing capability, and it has to
>>> handle many other tasks such as echo cancellation and noise reduction.
>>> There is just not enough processing power to handle a relatively
>>> high-complexity codec. Most BT headsets today relies on the extremely
>>> low-complexity, hardware-based CVSD codec at 64 kb/s to transmit
>>> narrowband voice, but CVSD has audible coding noise, so it degrades
>>> the overall audio quality. If the IETF codec has low enough
>>> complexity, it would be possible to directly encode and decode the
>>> IETF codec bit-stream at the BT headset, thus avoiding the quality
>>> degradation of CVSD transcoding.
>> Any idea what the complexity requirements would be for this use-case to
>> be possible?
>> Cheers,
>> Jean-Marc
>> _______________________________________________
>> codec mailing list
> _______________________________________________
> codec mailing list