Re: [codec] #16: Multicast?

"Raymond (Juin-Hwey) Chen" <> Tue, 25 May 2010 01:40 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 0B23E3A7257 for <>; Mon, 24 May 2010 18:40:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: 1.074
X-Spam-Level: *
X-Spam-Status: No, score=1.074 tagged_above=-999 required=5 tests=[AWL=-1.227, BAYES_50=0.001, MANGLED_WRLDWD=2.3]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id KBxpBJhFoyGE for <>; Mon, 24 May 2010 18:40:41 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id E8F1B3A70EC for <>; Mon, 24 May 2010 18:39:34 -0700 (PDT)
Received: from [] by with ESMTP (Broadcom SMTP Relay (Email Firewall v6.3.2)); Mon, 24 May 2010 18:38:50 -0700
X-Server-Uuid: D3C04415-6FA8-4F2C-93C1-920E106A2031
Received: from ([]) by ([]) with mapi; Mon, 24 May 2010 18:38:50 -0700
From: "Raymond (Juin-Hwey) Chen" <>
To: Cullen Jennings <>
Date: Mon, 24 May 2010 18:38:47 -0700
Thread-Topic: [codec] #16: Multicast?
Thread-Index: Acr2sGCVc3eeVeE1QUeAeCyK+oTN9QE8OoVw
Message-ID: <>
References: <> <> <> <> <> <000001cae173$dba012f0$92e038d0$@de> <> <001101cae177$e8aa6780$b9ff3680$@de> <> <002d01cae188$a330b2c0$e9921840$@de> <> <> <> <> <> <> <> <> <> <>
In-Reply-To: <>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
MIME-Version: 1.0
X-WSS-ID: 67E5F5A038O217469696-01-01
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Cc: "" <>
Subject: Re: [codec] #16: Multicast?
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Codec WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 25 May 2010 01:40:43 -0000

Hi Cullen,

Sorry for the delay of my reply.  I was busy last week and could not respond

Thank you for sharing the details of your delay measurements on Cisco 7960 IP
phones.  What you observed does NOT conflict with what I have been saying.
The reason is that the 20 ms and 30 ms you quoted are the "packet sizes", not
the "codec frame sizes".  Codec frame size and packet size have different
impacts on one-way delay.  The G.711 codec that you used is a sample-by-
sample codec. Theoretically its "codec frame size" is only one sample, or
0.125 ms, so the (3 x 30 ms - 3 x 20 ms) formula is not the right target
for comparison.

Furthermore, many telephones have G.711 encoder and decoder directly built
into the chip hardware of A/D and D/A, so they can directly digitize the
input audio signal into 8-bit G.711 codewords and directly playback 8-bit
G.711 codewords as the output audio signal; thus, there is essentially no
processing delay for G.711.  Even if the G.711 encoding/decoding is done in
software or firmware, the G.711 codec complexity is so low that it takes
almost no time to do G.711 processing.  The almost-zero processing delay can
contribute to the extra low delay of G.711-based VoIP systems.

There have been so many discussions about how the codec frame size and packet
size may affect the one-way delay, there has been confusion, and there have
been criticism that there wasn't any rigorous theoretical analysis, so I
thought I would spend some time to give a more rigorous delay analysis below
so we can hopefully settle such disputes. At the end of my analysis, you will
see how the lower bound and upper bound of the one-way delay depend on the
codec frame size AND the packet size under various conditions. Please read on
if you are interested; ignore if you are not; or you can quickly scroll down
to Equations (1) through (3), which are the main results of my delay
analysis, and read the last few paragraphs after Eq. (3).

Before I did the following delay analysis, I consulted extensively with three
Broadcom senior technical leads who have many years of extensive real-time
system architecture and design experiences in IP phones, VoIP gateways, and
video systems (such as cable/satellite set-top boxes), respectively.  What
they told me were consistent with each other and consistent with what I have
been saying.

Before I start the analysis, let me first discuss the multi-tasking, or Real-
Time Scheduling (RTS) delay, because it is a critical component of the total
one-way delay and needs to be clarified first.

In real-time audio or video systems, many tasks have definite completion
deadlines beyond which the real-time operation will be lost and there will be
audible or visible glitches. One way to handle a real-time task is by
interrupting the processor in the hope that the processor will put down
whatever it is doing and service the interrupt first.  If there is only one
real-time task and all other tasks in the system do not have real-time
requirements, then the interrupt will be serviced immediately and there is no
RTS delay.  However, this is rarely the case, since the system typically also
has other real-time tasks. (For example, an IP phone needs to handle the
encoding of the send-path signal, decoding of the receive-path signal, echo
canceller, side-tone, and other real-time tasks at the same time.) Then, the
interrupts generated by different real-time tasks need to be prioritized.
There can be only one highest-priority task.  Any of the other tasks will
have a lower priority and need to wait for its turn if it tries to interrupt
a higher-priority task. That wait time, plus the time it takes the processor
to complete the task, is the RTS delay of that task. The entire audio or
video stream will need to be buffered and delayed by at least the worst-case
wait time in order to have a smooth playback without any gaps or glitch.

If there are a large number of real-time tasks in the system, then a
prioritized interrupt-driven RTS system will become very complex and messy,
and the associated context switching for all the interrupts will reduce the
system efficiency.  Therefore, in IP phones, VoIP gateways, and
cable/satellite set-top boxes, usually a different kind of real-time
scheduling scheme is used, where each real-time task is allowed to run to
completion, but to simplify RT scheduling, all real-time tasks are requested
in a periodic manner, or with similar assumptions such as a minimum interval.
In many of these designs, all real time tasks on any one processor have the
same period (or "thread interval") for maximum real time efficiency.  In the
case of real-time voice communication systems, the most convenient and common
thread interval is the codec frame size.  Thus, the codec frame size
determines how much RTS delay the system has.  I have consulted my Broadcom
colleague Sandy MacInnis, a senior architect who specializes in video and
system design, and who is knowledgeable about real time scheduling.  He was
the chair of the MPEG Systems committee for MPEG-1 and MPEG-2 (i.e. MPEG
Transport, MPEG Programs streams, and MPEG-1 Systems).  I will quote him

"For most efficient scheduling, all tasks should have the same period, and in
the general case, each task may be served any time from immediately after the
request to the last instant before the next request. So, for such efficient,
general and robust systems, the RTS (real time scheduling) latency is up to
one request period, which in this case is a frame duration. When the request
is serviced earlier, the data has to be buffered up because the end-end delay
needs to be constant. While someone might say that they think an RTS scheme
can service requests with consistently less latency than a frame time, I
would challenge them for a theoretical basis that shows they can do so
reliably. What happens when all the requests happen at the same time? That
can certainly happen, in general.  ...  An extremely standard basic
assumption of RTS, and in particular Rate Monotonic Scheduling (RMS), is that
for each task, the deadline equals the period. That means that from the time
a requester makes a request, the RTS system needs to ensure that the request
is completely serviced (finished, not just started) before the period from
that request to the next request expires. Other assumptions are possible, but
longer deadlines don't usually help much and they make the system more
complex, and shorter deadlines make scheduling harder.  If there is a set of
tasks with exactly the same period, i.e. synchronous, then it's possible to
schedule the shared resource to 100% of capacity while ensuring RT
performance. However, in the more typical case, the various tasks do not have
the same period, in which case in general the maximum utilization of the
shared resource that can be scheduled for real time tasks is significantly
less than 100%. Whether the system is real-time schedulable or not can be
determined in various ways, including critical instant analysis.  In either
case, in general the latency of any given request can be anywhere from zero
plus processing time, to exactly the period = deadline."

For a PC with a very powerful processor and a very light real-time load, it
may be reasonable to expect the processor to perform the encoding and
decoding tasks very shortly after they are requested, with the requests being
driven by interrupts, and the processing time of each task may be very short
relative to the interval between requests. The resulting RTS delay may be as
low as a few percent of the frame interval.  This is possible because a
typical PC has much higher processing power than is required by a speech

The same is not true for VoIP gateways or IP phones, where the processor is
heavily loaded with real-time tasks and is often just barely fast enough to
handle the designated number of voice channels (many for gateways and one for
IP phones).  For example, rather than having a 2 to 3 GHz processor as in a
PC, the processor used to do speech coding in a low-end IP phone may only
have a clock rate of slightly more than 100 MHz.  In this case, it is
reasonable to expect that the time required to service each request,
including processing time, may be as much as the full frame interval.

OK, now that the RTS delay has been discussed, let me proceed with my delay
analysis.  I will break down the delay into many components, with each
component occurring after the components listed earlier.  Let the codec frame
size be F ms and the packet size be P ms.  Let each packet contain N codec
frames, so P = N*F.  For simplicity, we will not consider the codec look-
ahead L ms and codec filtering delay R ms in this analysis and will just add
them at the end because we know their multiplier is 1X.

The one-way mouth-to-ear delay includes the following codec-dependent delay

(1) Encoder buffering delay: d1 = a1*F, where a1 = 1.
This is the time it takes to buffer all input samples of a codec frame.

(2) Encoder RTS delay: d2 = a2*F, where 0 < a2 <= 1.
This includes the encoder processing delay; see the discussion above.

(3) Packetization delay: d3 = a3*F, where a3 = (N-1).
This is the amount of time the first frame in the packet need to wait until
the last frame of encoded bits in the packet is ready.

(4) Packet transmission delay: d4 = a4*F, where 0 < a4 <= N.
This is the time it takes to ship all bits in the packets out of the
transmitter; this can also be considered the decoder bit buffering delay,
since it is the time the decoder needs to wait to get all bits in the packet.
If the speed of the communication channel is very high, then d4 can be a very
small fraction of the packet size P = N*F ms, but it will not be zero.  If
the channel speed is exactly the same as the bit-rate of the packet
(including the packet header), then d4 = P = N*F ms.  Even for the case of
high-speed channel, if we view the bit transmission task as a real-time
scheduling problem for the micro-controller (which may run at a different
thread rate than the DSP), then the scheduling wait time plus the processing
time (i.e. the time to actually transmit bits) may still take up to one
thread interval, which is P = N*F ms in this case.

(5) Decoder RTS delay: d5 = a5*F, where 0 < a5 <= 1.
This includes the decoder processing delay; see the discussion above.

There may be other delay components that may depend on the codec frame size.
For example, in gateways where a few layers of processors are used, each
processor may have its own real-time scheduling delays for all tasks that it
handles.  However, at least the delay components listed above are the major
ones that are commonly encountered.  If we omit the other possible codec-
dependent components for the moment but add back the codec look-ahead L and
codec filtering delay R (if any), the total codec-dependent one-way delay is

D = d1 + d2 +... + d5 + L + R = {1 + (0,1] + (N-1) + (0,N] + (0,1]}*F + L + R

Hence, the one-way delay D has a possible range of

N*F + L + R < D <= (2*N + 2)*F + L + R, or

P + L + R < D <= 2*P + 2*F + L + R             Eq. (1)

For heavily loaded real-time systems such as VoIP gateways or IP phones, if
we assume the worst case of one full frame of encoder RTS delay and decoder
RTS delay, then a2 = 1 and a5 = 1, and we get a tighter range for the one-way

P + 2*F + L + R < D <= 2*P + 2*F + L + R       Eq. (2)

In the special case of N = 1 (each packet contains only one codec frame),
then we get

3*F + L + R < D <= 4*F + L + R                 Eq. (3)

The delay lower bounds in Eq. (1) through Eq. (3) above (under their
individual assumptions) are consistent with what I have been saying.
If the other omitted codec-dependent delay components are significant, or
if the system implementers have not been careful about minimizing the delay,
then the delay upper bounds can be even higher than what are shown in Eq. (1) through Eq. (3).

In your Cisco 7960 IP phone delay measurements, P = 20 ms or 30 ms, L = 0, R
= 0, and theoretically F = 0.125 ms.  If you look at Eq. (2) above, then it
is clear that you won't see 3 times the packet size difference as the delay
difference.  However, here the codec frame size is 0.125 ms, not 20 or 30 ms,
so this result doesn't conflict with what I have been saying (i.e. 3X codec
frame size).

Of course, in reality it is unlikely that an IP phone will use 0.125 ms as
the thread interval.  A more likely thread interval is P.  Then, my delay
analysis above does not apply directly.  However, it is not difficult to
follow the same logic and procedure to see what will happen in this case.  If
G.711 encoding and decoding is built right into the A/D and D/A, then the 8-
bit G.711 codewords directly arrives at the input buffer or leave the output
buffer and the RTS system does not need to schedule G.711 encoding and
decoding tasks, so d2 = d5 = 0. Also, in this case d1 = P, d3 = 0, and 0 < d4
<= P.  Thus, the total one-way delay is P < D <= 2*P.

Even if the G.711 encoding and decoding operations are done in
software/firmware, the G.711 complexity is so low that it takes the processor
almost no time to do encoding and decoding.  In this case, the IP phone is
closer to the case of a PC that has much more processing power than is
required for speech coding, and if the Cisco engineers did a good job of
optimizing RTS to minimize d2 and d5, then d2 and d5 would be closer to 0
than to P.  Then, the total one-way codec-dependent delay would be closer to
P than to 3*P.  This is probably what you have observed.

Best Regards,


-----Original Message-----
From: Cullen Jennings []
Sent: Tuesday, May 18, 2010 10:34 AM
To: Raymond (Juin-Hwey) Chen
Cc: Koen Vos;
Subject: Re: [codec] #16: Multicast?

On May 12, 2010, at 12:28 PM, Raymond (Juin-Hwey) Chen wrote:

> Hi Cullen,
> Hmm... That's interesting.  Would you please share more details of
> your measurement equipment setup, the codec used, the codec frame
> size, the number of codec frames in each packet, the way you
> measured the delay, and the measured delay value, etc.?

Sure - it's really simple to set up. I use a signal generator that makes a tone burst. Typically I do something like 550 Hz tone that is a 200 ms long burst that occurs once a second. I play this into a speaker near the microphone on one phone and also put it into 1 channel of a scope.  I think put a microphone near the speaker of the other phone and put that on another channel of the scope and measure the mouth to ear delay. It's really easy to see the starts of the two bursts from the speaker and microphone and measure the delay within a few ms. I've done lots of clever things over the years using stats from software in the phones but this technique is easy and pretty fool proof on getting good results. The phones were plugged into Netgear 100Mbps hub as I sometimes look at the timing of the ethernet packets too. I set up phones for G.711, one frame per packet - I choose this as  it is easy to change the length of each frame but results are similar with other codecs.

I was using two Cisco 7960 phones - no idea what version of the software and may have been a development build but it's pretty unlikely that current production software would have different results from what I tested. I set the phones for G.711 with 20 ms packets, measured delay, then set the phones for 30 ms packets and measured delay. For a given packet length, when I make multiple phone calls or reboot one of the phones, the measurements stay consistent.

The resulting change in delay between the two experiments was much less than 30ms that a (3x 30 ms - 3 x 20 ms) would have predicted. I feel like a total tool not providing the details of the exact measurements but lots of measurements like this Cisco considers confidential and I'm just not up to arguing with folks about what can and can not be said publicly. I probably should have done a test with 10 ms packets too but I did not. Yes I realize how nuts it is to consider something that anyone can easily measure as confidential. If anyone really cares, I will go do the work to be able to provide the numbers.

> I didn't come up with this 3*(codec frame size) delay number for IP
> phones myself.  A very senior technical lead in Broadcom's IP phone
> chips group told me that, and Broadcom is currently the #1 world-
> wide market share leader in IP phone chips, accounting for more than
> half of the world's IP phone chip shipments.
> Most of the world's
> tier-1 IP phone manufacturers use our IP phone chips at least in
> some of their product lines.

Yah, again, Cisco considers chipsets confidential but if you poke around with google,  such as

Folks claim the 7960 uses a Broadcom chip - of course that looks like it is for the ethernet switch on the phone not much to do with software that impacts audio latency.

> I would be very interested to learn more about your measurements to
> try to reconcile these seemingly contradictory statements from two
> different sources.  Thanks.

Be glad to talk to whoever it is. I really don't know relevant this all is too figuring out what packets sizes we need to support.

> Best Regards,
> Raymond
> -----Original Message-----
> From: Cullen Jennings []
> Sent: Wednesday, May 12, 2010 8:00 AM
> To: Raymond (Juin-Hwey) Chen
> Cc: Koen Vos;
> Subject: Re: [codec] #16: Multicast?
> On May 4, 2010, at 7:15 PM, Raymond (Juin-Hwey) Chen wrote:
>> the 3*(codec frame size) delay is very real for IP phone
> This does not match the measurements I have. And I certainly don't have 100+ year voip experience but I do have two of the #1 selling enterprise phones connected to an oscilloscope. Test with other phones suggest most the major vendors of IP hard phones have fairly comparable performance when it comes to delay.
> Cullen

Cullen Jennings
For corporate legal information go to: