[codec] #33: Impact of transmission delay

"codec issue tracker" <trac@tools.ietf.org> Mon, 24 May 2010 14:16 UTC

MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: codec issue tracker <trac@tools.ietf.org>
Precedence: bulk
Auto-Submitted: auto-generated
To: hoene@uni-tuebingen.de
Date: Mon, 24 May 2010 14:15:47 -0000
Message-ID: <062.f33215ed513b5540d72cce71ceca2a9a@tools.ietf.org>
Cc: codec@ietf.org
Subject: [codec] #33: Impact of transmission delay
Reply-To: codec@ietf.org

#33: Impact of transmission delay
------------------------------------+---------------------------------------
 Reporter:  hoene@…                 |       Owner:     
     Type:  defect                  |      Status:  new
 Priority:  major                   |   Milestone:     
Component:  requirements            |     Version:     
 Severity:  -                       |    Keywords:     
------------------------------------+---------------------------------------
 [Koen]:
 For typical VoIP applications, Moore's law has lessened the pressure to
 reduce bitrates, delay and complexity, and has shifted the focus to
 fidelity instead.

 [Benjamin]:
 I think this is a typo, and you mean "lessened the pressure to reduce
 bitrates and complexity, and has shifted the focus to fidelity and delay
 instead".

 [Koen]:
 Not a typo: codecs have become more wasteful with delay, while delivering
 better fidelity.  G.718 evolved out of AMR-WB and has more than twice the
 delay.  Same for G.729.1 versus G.729.  This is not by accident.

 The main rationale for codec delay being less important today is that
 faster hardware has reduced end-to-end delay in every step along the way.
 As a result, a typical VoIP connection now operates at a flatter part of
 the "impairment-vs-delay" curve, meaning that reducing delay by N ms at a
 given fidelity gives a smaller improvement to end users today than it did
 some years ago.  Therefore, the weight on minimizing delay in the "codec
 design problem" has gone down, and the optimum codec operating point has
 naturally shifted towards higher delay, in favor of fidelity.

 I've mentioned before that average delay on Internet connections seems to
 be 40% to 50% lower now than just 5 years ago, which is just one
 contributor to lower end-to-end delay.  That doesn't mean high-delay
 connections don't exist - they do, for instance over dial-up or 3G.
 But in those cases it's still better to use a moderate packet rate (and
 bitrate), to minimize congestion risk.

 The confusion may come from the fact that the trade-off between fidelity
 and delay changes towards high quality levels: once fidelity saturates,
 delay gets priority.  Even more so because such high fidelity enables new,
 delay-sensitive applications like distributed music performances.  This is
 reflected in the ultra-low delay requirements in the requirements
 document.

 To summarize, the case for using sub-20 ms frame sizes with medium-
 fidelity quality is now weaker than ever, because the relative importance
 of fidelity has gone up.

 [Christian]:
 may I present some results of the ITU-T SG12 on the perceptual effects of
 delay?
 For many years, it was assumed that 150ms is the boundary for interactive
 voice conversations (see Nobuhiko Kitawaki, and Kenzo Itoh: Pure Delay
 Effects on Speech Quality in Telecommunications, IEEE J. on Selected Areas
 in Commun., Vol.9, No.4, pp.586-593, May 1991) Until 400ms quality is
 still acceptable (about toll quality). The ITU-T G.107 quality model
 reflects this opinion.
 However, in the recent years, new results have shown that the impact of
 delay on conversation quality is NOT as strong as assumed. At the ITU-T,
 numerous contributions have been made on this issue:
 Contribution of BT “Comparison of E-Model and subjective test data for
 pure-delay conditions” from 2007-01-08:
 Link http://www.itu.int/md/T05-SG12-C-0030/en
 The conversational tests were done in controlled environments with nine
 pairs of subjects. Two subjects had the common tasks of their set of
 sorting pictures in the same order. Other conditions: No echos, G.711, no
 frame loss

 [PICTURE at http://www.ietf.org/mail-
 archive/web/codec/current/msg01588.html]

 Legend:
 MOS-CQS are subjective conversational tests
 MOS-CQE is the E-Modell (G.107)
 MOS-LQO are result from PESQ.
 The delay is a one-way delay. Beside MOS values, they also studied the
 subjective rating of percentage difficultly (%D). Starting at about 150ms
 is goes up at reaches 35% at 900ms. After that it remains constant.

 Also, LM Ericsson described very interesting results in “Investigation of
 the influence of pure delay, packet loss and audio-video synchronization
 for different conversation tasks” from 2007-09-24.
 http://www.itu.int/md/T05-SG12-C-0119/en
 For example: The done conversational tests similar to ITU-T P.805. The
 conversation lasted about 3 to 5 minutes. 11 pairs of experts were taken
 part.

 [PICTURE at http://www.ietf.org/mail-
 archive/web/codec/current/msg01588.html]

 The tasks at 160ms were done about 50s faster than the same task at 600ms

 And in the second tests about 60 naïve subjects and experts were taken
 part to solve a conversational task.


 If they were asked for interactivity the ratings look worse.


 Overall, it seems that the limit of 150ms is greatly overestimated. A much
 relaxed timing is allowed.

 [Benjamin]:

 (1) The results conflict with common sense.  A round-trip delay of 800 ms
 makes normal conversation extremely irritating in practice.  I'm not
 surprised these results don't show up in laboratory tests, because fast
 conversations with interjections and rapid responses typically require a
 social context not available in a lab test.

 It's possible that the ITU regards "extremely irritating" as "acceptable",
 since effective conversation is still possible.  In that case, I would say
 that the working group intends to enable applications with much better
 than "acceptable" quality.

 (2) Tests may have been done in G.711 narrowband, which introduces its own
 intelligibility problems and reduces quality expectation.  Higher fidelity
 makes latency more apparent.  Similarly, the equipment used may have
 introduced quality impairments that make the delay merely one problem
 among many.

 (3) I presume the tests were done with careful equipment setup to avoid
 echo.  The perceived quality impact of echo at 200 ms one-way delay is
 enormous, as shown in

 http://downloads.hindawi.com/journals/asp/2008/185248.pdf

 Using an echo-canceller impairs quality significantly.  Imperfect echo
 cancellation leaves some residual artifact, which is also irritating at
 long delays.

 The tests (even in the paper above) were performed using a telephone
 handset and earpiece.  High-quality telephony with a freestanding speaker
 instead of an earpiece demands especially low delay due to the
 difficulties with echo cancellation.

 [Marshall]:


 This depends a lot on what sort of discussion is at issue (and, also on
 the culture of the participants).

 For example, in my experience telepresence sessions tend to be structured
 meetings and can typically tolerate even half second delays without too
 much disruption, while for a one-on-one conversation on the same equipment
 the same delay can be pretty objectionable.

 Having said that, I myself also find the previously attached graphs a
 little odd, and want to see a written description of just what sort of
 experiments they describe.

 [Brian]:
 I agree with this.  I was in a group that did some research on this
 (unpublished, unfortunately) and we confirmed that there is a cliff,
 around 500 ms round trip, after which conversation is impaired.  It is
 remarkably consistent, is more or less independent of culture (with one
 interesting exception), and is really a cliff: under it and further
 improvement is hard to notice, over it and conversation is impaired, and
 the difference between say 750 and 1500ms isn't all that significant.

 Engineers who believe delay is a "less is better" quantity need to be
 educated that it is not.  It is a threshold

 [JM]:
 Considering that the network delay is not a constant, you no longer have
 an absolute cliff. So reducing the delay means you can increase the
 distance without falling off the cliff.

 [Benjamin]:
 One test in that paper told trained subjects to "Take turns reading random
 numbers aloud as fast as possible", on a pair of handsets with narrowband
 uncompressed audio and no echo.  Subjects were able to detect round-trip
 delays down to 90 ms.  Conversational efficiency was impaired even with
 round-trip delay of 100 ms.

 Let me emphasize again that these delays are round-trip, not one-way,
 there is no echo, and the task, while designed to expose latency, is
 probably less demanding than musical performance.

 ...

 I accept Brian Rosen's claim that a slow conversation doesn't normally
 suffer greatly from round-trip latencies up to 500 ms, but under some
 circumstances much lower latencies are valuable.  Let's make sure they're
 achievable for those who can use them.

 [Raymond]: Other than potential echo issues, the biggest problem with a
 one-way delay longer than a few hundred ms is that such a long delay makes
 it very difficult to interrupt each other, resulting in the start-stop-
 start-stop cycles I previously talked about.  Therefore, I agree with Ben
 that if the lab test did not have echoes and did not involve the test
 subjects trying to interrupt each other, then the test results may appear
 more benign than what one would experience in the real world.

 Note that the top curve in the first figure below is for “listening-only
 tests”.  Well, in that case there was no interaction/interruption at all,
 so if there was no echoes, either, then it is no wonder that the curve
 stayed essentially flat.  I do wonder what made the curve go down at 1300
 ms; I guess to understand this we need to know what the lab set up was for
 this test.  Thus, I echo Marshall’s opinion that we need the original
 paper/contribution.

 My personal experience with the delay impairment is much worse than the
 middle curve (MOS-CQS) would suggest and is close to the bottom curve
 (MOS-CQE).  Back in early 1980s the phone calls I made from southern
 California to East Asia were carried through geosynchronous satellites
 with a one-way delay slightly more than 500 ms (see
 http://en.wikipedia.org/wiki/Geostationary_orbit).  I absolutely hated it,
 because turn-taking was severely impaired and the only way to interrupt
 the person at the other side was to keep talking (rudely, I may say) until
 the other person finally stopped.  Then, starting in late 1980s undersea
 cables were used to carry my traditional circuit-switched calls to the
 same person in East Asia, and all of a sudden the delay was much shorter
 and interrupting each other felt as easy as face-to-face conversation.
 It’s a night-and-day difference!  Even in early 2000s when I used my cell
 phone to call my son’s cell phone in another cellular network, I could
 tell that there was a significant delay that noticeably impaired our turn-
 taking and our ability to interrupt each other, and I didn’t like it at
 all.  Now you know why I advocate low-delay voice communications, have
 been working on low-delay speech coding for two decades, and have even
 published a book chapter on low-delay speech coding :o)

 [Stephen]:
 From my own experience (not testing) I agree with Brian's claim that 500
 ms round trip is acceptable for most conversation.
 It does depend on what you are doing, and there are certainly tasks where
 much lower delays are needed.

 [Mike]:
 Agreed that achieving low enough latencies for sidetone perception should
 not be a goal of the wg, but we should be aiming if at all possible for
 better than 250 ms one-way delay in typical (and non-tandemed)
 deployments. The knee of the one-way delay impairment factor begins rising
 non-linearly somewhere between 150 and 250 ms.

 [Raymond]: If you read the published technical papers on G.718 and
 G.729.1 carefully, I think you will find that the real reason for the
 increased delay is not because they needed a longer delay to achieve
 better fidelity for speech, but because they wanted to extend speech
 codecs to also get good performance when coding general audio (music,
 etc.).  To get good music coding performance, most audio codecs use
 Modified Discrete Cosine Transform (MDCT) with at least a transform window
 size that is fairly large, so most of the audio codecs have longer coding
 delays than speech codecs.

 To code music well, G.718 and G.729.1 developers naturally had to use long
 MDCT transform windows on top of the codec delay already in AMR- WB and
 G.729. Even so, the resulting longer delays of G.718 and
 G.729.1 are still not any longer than typical delays of audio codecs; in
 fact, they are probably somewhat shorter.

 My point is that the increased delays of G.718 and G.729.1 are purely a
 result of changing from "speech-only" to "speech and music". It's not
 because the G.718 and G.729.1 developers knew the network delay was
 getting shorter so they could be more wasteful with delay.
 Furthermore, even after they changed the codecs to handle music as well as
 speech, they still chose to make their codec delays shorter than the
 delays of most audio codecs.  Why?  They wanted to make their codec delays
 as short as they could.  In fact, they even made an effort to introduce a
 "low-delay mode" into both G.718 and G.729.1. That shows they were pretty
 concerned about the higher delays they needed to have in order to code
 music well.

 By the way, G.718 does NOT have "more than twice the delay" of AMR-WB as
 you said.  AMR-WB has a 20 ms frame size, 5 ms look-ahead, and
 1.875 ms of filtering delay, for a total algorithmic buffering delay of
 26.875 ms.  The "normal mode" of G.718 has a buffering delay of
 42.875 ms for 16 kHz wideband input/output. That's only 59.5% higher than
 AMR-WB.  For Layers 1 and 2 coding of speech, the "low-delay mode" shaves
 10 ms off to give a delay of 32.875 ms, or only 22.3% higher than AMR-WB.

 When G.729.1 was first standardized in May 2006, there was already a low-
 delay mode for narrowband speech at 8 and 12 kb/s with a algorithmic
 buffering delay of 25 ms.  Later in August 2007, the developers made an
 effort to add another low-delay mode for wideband at 14 kb/s that has a
 buffering delay of 28.94 ms.  If they wanted to sacrifice delay to get
 higher fidelity as you suggested, then why would they bother to go back
 and add another low-delay mode for wideband?

 In fact, only a few months ago in their G.729.1 paper in IEEE
 Communications Magazine, October 2009, Varga, Proust, and Taddei still
 emphasized in multiple instances the importance of achieving a low coding
 delay.  I will quote two of the instances:

 "The low-delay mode... was added to the first wideband layer at 14 kb/s of
 G.729.1 (August 2007).  The motivation was to address applications such as
 VoIP in enterprise networks where low end-to-end delay is crucial" and

 "Indeed, delay is an important performance parameter, and transmitting
 speech with low end-to-end delay is also required in several applications
 making use of wideband signals".

 In summary, I do not see a clear trend where codec developers are becoming
 more wasteful with delay in order to get higher fidelity. If anything, in
 recent years I saw a trend of low-delay audio coding, such as low-delay
 AAC and the CELT codec, and I saw the effort by
 G.718 and G.729.1 developers to introduce low-delay modes.

 In any case, I thought a few days ago a consensus was already reached in
 the WG email reflector that the IETF codec needs to have a low- delay mode
 with a 5 to 10 ms codec frame size so that it can handle delay-sensitive
 applications (that is 5 out of 6 applications listed in the charter and
 codec requirement document).  Therefore, I think the discussion in your
 last email and my current email is mostly of academic interest only and
 doesn't and shouldn't affect how the IETF codec is to be designed.

 [Mike]:
 Agreed that achieving low enough latencies for sidetone perception should
 not be a goal of the wg, but we should be aiming if at all possible for
 better than 250 ms one-way delay in typical (and non-tandemed)
 deployments. The knee of the one-way delay impairment factor begins rising
 non-linearly somewhere between 150 and 250 ms.

 CONSENSUS: Impairments start somewhere between 150 and 250ms one-way
 delay.

-- 
Ticket URL: <http://trac.tools.ietf.org/wg/codec/trac/ticket/33>
codec <http://tools.ietf.org/codec/>

[codec] #33: Impact of transmission delay codec issue tracker
Re: [codec] #33: Impact of transmission delay codec issue tracker