Re: [codec] #16: Multicast?

Quoting "Raymond (Juin-Hwey) Chen"

> An IP phone  guru told me that for a typical IP phone application,  
> it is also quite common to see a one-way delay of 5 times the codec  
> frame size.

Sure - for certain frame sizes.  But 1 ms frames won't give you 5 ms  
one-way delay.

For a well-designed system and a typical Internet connection:
- most delay comes from the network and is not codec related, and
- one-way delay grows almost linearly with frame size.

> Furthermore, it is possible to use header compression technology to  
> shrink that 48 kb/s penalty to almost nothing.

Afaik, only RTP headers can be compressed between arbitrary Internet  
end points.  You're still stuck with IP and UDP headers.

best,
koen.

Quoting "Raymond (Juin-Hwey) Chen" <rchen@broadcom.com>:
> Hi Jean-Marc,
>
> I agree that the 20 ms frame size or packet size is more efficient  
> in bit-rate.  However, this comment doesn't address my original  
> point on the need to have a low-delay IETF codec for the  
> conferencing bridge scenario, where the voice signal will travel  
> through the codec twice (2 tandems), thus doubling the one-way codec  
> delay.
>
> As you are well aware of, codec design involves many trade-offs  
> between the four major attributes of a codec: delay, complexity,  
> bit-rate, and quality.  For a given codec architecture, improving  
> one attribute normally means sacrificing at least one other  
> attribute.  Nothing comes for free.  Therefore, yes, to get low  
> delay, you need to pay the price of lower bit-rate efficiency, but  
> you can also view it another way: to get higher bit-rate efficiency  
> by using a 20 ms frame size, you pay the price of a higher codec  
> delay.  The question to ask then, is not which frame size is more  
> bit-rate efficient, but whether there are application scenarios  
> where a 20 ms frame size will simply make the one-way delay way too  
> long and greatly degrade the users' communication experience. I  
> believe the answer to the latter question is a definite "yes".
>
> Let's do some math to see why that is so.  Essentially all cellular  
> codecs use a frame size of 20 ms, yet the one-way delay of a  
> cell-to-landline call is typically 80 to 110 ms, or 4 to 5.5 times  
> the codec frame size.  This is because you have not only the codec  
> buffering delay, but also processing delay, transmission delay, and  
> delay due to processor sharing using real-time OS, etc.  An IP phone  
> guru told me that for a typical IP phone application, it is also  
> quite common to see a one-way delay of 5 times the codec frame size.  
>  Let's just take 5X codec frame size as the one-way delay of a  
> typical implementation.  Then, even if all conference participants  
> use their computers to call the conference bridge, if the IETF codec  
> has a frame size of 20 ms, then after the voice signal of a talker  
> goes through the IETF codec to the bridge, it already takes 100 ms  
> one-way delay.  After the bridge decodes all channels, mixes, and  
> re-encodes with the IETF codec and send to every particip
>  ant, the one-way delay is now already up to 200 ms, way more than  
> the 150 ms limit I mentioned in my last email.  Now if a talker call  
> into the conference bridge through a cell phone call that has 100 ms  
> one-way delay to the edge of the Internet, by the time everyone else  
> hears his voice, it is already 300 ms later.  Anyone trying to  
> interrupt that cell phone caller will experience the  
> talk-stop-talk-stop problem I mentioned before.  Now if another cell  
> phone caller call into the conference bridge, then the one-way delay  
> of his voice to the first cell phone caller will be a whopping 400  
> ms! That would probably turn it into half-duplex effectively.
>
> When we talk about "high-quality" conference call, it is much more  
> than just the quality or distortion level of the voice signal; the  
> one-way delay is also an important and integral part of the  
> perceived quality of the communication link.  This is clearly  
> documented and well-modeled in the E-model of the ITU-T G.107, and  
> the 150 ms limit, beyond which the perceived quality sort of "falls  
> off the cliff", was also obtained after careful study by telephony  
> experts at the ITU-T.  It would be wise for the IETF codec WG to  
> heed the warning of the ITU-T experts and keep the one-way delay  
> less than 150 ms.
>
> In contrast, if the IETF codec has a codec frame size and packet  
> size of 5 ms, then the on-the-net one-way conferencing delay is only  
> 50 ms. Even if you use a longer jitter buffer, the one-way delay is  
> still unlikely to go above 100 ms, which is still well within the  
> ITU-T's 150 ms guideline.
>
> True, sending 5 ms packets means the packet header overhead would be  
> higher, but that's a small price to pay to enable the conference  
> participants to have a high-quality experience by avoiding the  
> problems associated with a long one-way delay.  The bit-rate penalty  
> is not 64 kb/s as you said, but 3/4 of that, or 48 kb/s, because you  
> don't get zero packet header overhead for a 20 ms frame size, but 16  
> kb/s, so 64 - 16 = 48.
>
> Now, with the exception of a small percentage of Internet users who  
> still use dial-up modems, the vast majority of the Internet users  
> today connect to the Internet at a speed of at least several hundred  
> kb/s, and most are in the Mbps range.  A 48 kb/s penalty is really a  
> fairly small price to pay for the majority of Internet users when it  
> can give them a much better high-quality experience with an much  
> lower delay.
>
> Furthermore, it is possible to use header compression technology to  
> shrink that 48 kb/s penalty to almost nothing.
>
> Also, even if a 5 ms packet size is an overkill in some situations,  
> a codec with a 5 ms frame size can easily packs two frames of  
> compressed bit-stream into a 10 ms packet.  Then the packet header  
> overhead bit-rate would be 32 kb/s, so the penalty shrinks by a  
> factor of 3 from 48 kb/s to 32 - 16 = 16 kb/s. With 10 ms packets,  
> the one-way conferencing delay would be 100 ms, still well within  
> the 150 ms guideline. (Actually, since the internal "thread rate" of  
> real-time OS can still run at 5 ms intervals, the one-way delay can  
> be made less than 100 ms, but that's too much detail to go into.) In  
> contrast, a codec with a 20 ms frame size cannot send its bit-stream  
> with 10 ms packets, unless it spreads each frame into two packets,  
> which is what IETF AVT advises against, because it will effectively  
> double the packet loss rate.
>
> The way I see it, for conference bridge applications at least, I  
> think it would be a big mistake for IETF to recommend a codec with a  
> frame size of 20 ms or higher.  From my analysis above, by doing  
> that we will be stuck with too long a delay and the associated  
> problems.
>
> Best Regards,
>
> Raymond
>
> -----Original Message-----
> From: Jean-Marc Valin [mailto:jean-marc.valin@usherbrooke.ca]
> Sent: Thursday, April 22, 2010 9:05 PM
> To: Raymond (Juin-Hwey) Chen
> Cc: Christian Hoene; 'stephen botzko'; codec@ietf.org
> Subject: Re: [codec] #16: Multicast?
>
> Hi,
>
> See me comments below.
>
>> [Raymond]: High quality is a given, but I would like to emphasize the
>> importance of low latency.
>>
>> (1) It is well-known that the longer the latency, the lower the
>> perceived quality of the communication link. The E-model in the ITU-T
>> Recommendation G.107 models such communication quality in MOS_cqe,
>> which among other things depends on the so-called "delay impairment
>> factor" /Id/. Basically, MOS_cqe is a monotonically decreasing
>> function of increasing latency, and beyond about 150 ms one-way delay,
>> the perceived quality of the communication link drops rapidly with
>> further delay increase.
>>
>
> As the author of CELT, I obviously agree that latency is an important
> aspect for this codec :-) That being said, I tend to say that 20 ms is
> still the most widely used frame size, so we might as well optimise for
> that. This is not really a problem because as the frame size goes down,
> the overhead of the IP/UDP/RTP headers go up, so the codec bit-rate
> becomes a bit less of an issue. For example, with 5 ms frames, we would
> already be sending 64 kb/s worth of headers (excluding the link layer),
> so we might as well spend about as many bits on the actual payload as we
> spend on the headers. And with 64 kb/s of payload, we can actually have
> high-quality full-band audio.
>
>> 1) If a conference bridge has to decode a large number of voice
>> channels, mix, and re-encode, and if compressed-domain mixing cannot
>> be done (which is usually the case), then it is important to keep the
>> decoder complexity low.
>
> Definitely agree here. The decoder complexity is very important. Not
> only because of mixing issue, but also because the decoder is generally
> not allowed to take shortcuts to save on complexity (unlike the
> encoder). As for compressed-domain mixing, as you say it is not always
> available, but *if* we can do it (even if only partially), then that can
> result in a "free" reduction in decoder complexity for mixing.
>
>> 2) In topology b) of your other email
>> (IPend-to-transcoding_gateway-to-PSTNend), the transcoding gateway, or
>> VoIP gateway, often has to encode and decode thousands of voice
>> channels in a single box, so not only the computational complexity,
>> but also the per-instance RAM size requirement of the codec become
>> very important for achieving high channel density in the gateway.
>>
>
> Agreed here, although I would say that per-instance RAM -- as long as
> it's reasonable -- is probably a bit less important than complexity.
>
>> 3) Many telephone terminal devices at the edge of the Internet use
>> embedded processors with limited processing power, and the processors
>> also have to handle many tasks other than speech coding. If the IETF
>> codec complexity is too high, some of such devices may not have
>> sufficient processing power to run it. Even if the codec can fit, some
>> battery-powered mobile devices may prefer to run a lower-complexity
>> codec to reduce power consumption and battery drain. For example, even
>> if you make a Internet phone call from a computer, you may like the
>> convenience of using a Bluetooth headset that allows you to walk
>> around a bit and have hands-free operation. Currently most Bluetooth
>> headsets have small form factors with a tiny battery. This puts a
>> severe constraint on power consumption. Bluetooth headset chips
>> typically have very limited processing capability, and it has to
>> handle many other tasks such as echo cancellation and noise reduction.
>> There is just not enough processing power to handle a relatively
>> high-complexity codec. Most BT headsets today relies on the extremely
>> low-complexity, hardware-based CVSD codec at 64 kb/s to transmit
>> narrowband voice, but CVSD has audible coding noise, so it degrades
>> the overall audio quality. If the IETF codec has low enough
>> complexity, it would be possible to directly encode and decode the
>> IETF codec bit-stream at the BT headset, thus avoiding the quality
>> degradation of CVSD transcoding.
>>
>
> Any idea what the complexity requirements would be for this use-case to
> be possible?
>
> Cheers,
>
> Jean-Marc
>
>
>
> _______________________________________________
> codec mailing list
> codec@ietf.org
> https://www.ietf.org/mailman/listinfo/codec
>