Re: [codec] #19: How large is the frame size depended delay / the serialization delay / frame size depended processing delay?
"codec issue tracker" <trac@tools.ietf.org> Thu, 24 June 2010 15:31 UTC
Return-Path: <trac@tools.ietf.org>
X-Original-To: codec@core3.amsl.com
Delivered-To: codec@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 87E153A680D for <codec@core3.amsl.com>; Thu, 24 Jun 2010 08:31:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -101.153
X-Spam-Level:
X-Spam-Status: No, score=-101.153 tagged_above=-999 required=5 tests=[AWL=-1.153, BAYES_50=0.001, NO_RELAYS=-0.001, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pr8KxjFc0+Zz for <codec@core3.amsl.com>; Thu, 24 Jun 2010 08:30:44 -0700 (PDT)
Received: from zinfandel.tools.ietf.org (unknown [IPv6:2001:1890:1112:1::2a]) by core3.amsl.com (Postfix) with ESMTP id 7FC8928C0DC for <codec@ietf.org>; Thu, 24 Jun 2010 08:30:40 -0700 (PDT)
Received: from localhost ([::1] helo=zinfandel.tools.ietf.org) by zinfandel.tools.ietf.org with esmtp (Exim 4.72) (envelope-from <trac@tools.ietf.org>) id 1ORoO9-0000sO-7D; Thu, 24 Jun 2010 08:30:49 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: codec issue tracker <trac@tools.ietf.org>
X-Trac-Version: 0.11.7
Precedence: bulk
Auto-Submitted: auto-generated
X-Mailer: Trac 0.11.7, by Edgewall Software
To: hoene@uni-tuebingen.de
X-Trac-Project: codec
Date: Thu, 24 Jun 2010 15:30:49 -0000
X-URL: http://tools.ietf.org/codec/
X-Trac-Ticket-URL: http://trac.tools.ietf.org/wg/codec/trac/ticket/19#comment:5
Message-ID: <071.69c574ff48426175f73948ea16470629@tools.ietf.org>
References: <062.f8b0d2abf056a9655a81ee25366bb354@tools.ietf.org>
X-Trac-Ticket-ID: 19
In-Reply-To: <062.f8b0d2abf056a9655a81ee25366bb354@tools.ietf.org>
X-SA-Exim-Connect-IP: ::1
X-SA-Exim-Rcpt-To: hoene@uni-tuebingen.de, codec@ietf.org
X-SA-Exim-Mail-From: trac@tools.ietf.org
X-SA-Exim-Scanned: No (on zinfandel.tools.ietf.org); SAEximRunCond expanded to false
Cc: codec@ietf.org
Subject: Re: [codec] #19: How large is the frame size depended delay / the serialization delay / frame size depended processing delay?
X-BeenThere: codec@ietf.org
X-Mailman-Version: 2.1.9
Reply-To: codec@ietf.org
List-Id: Codec WG <codec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/codec>
List-Post: <mailto:codec@ietf.org>
List-Help: <mailto:codec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/codec>, <mailto:codec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 24 Jun 2010 15:31:48 -0000
#19: How large is the frame size depended delay / the serialization delay / frame size depended processing delay? ------------------------------------+--------------------------------------- Reporter: hoene@… | Owner: Type: enhancement | Status: new Priority: minor | Milestone: Component: requirements | Version: Severity: - | Keywords: ------------------------------------+--------------------------------------- Comment(by hoene@…): [Raymond]: Thank you [Cullen] for sharing the details of your delay measurements on Cisco 7960 IP phones. What you observed does NOT conflict with what I have been saying. The reason is that the 20 ms and 30 ms you quoted are the "packet sizes", not the "codec frame sizes". Codec frame size and packet size have different impacts on one-way delay. The G.711 codec that you used is a sample-by- sample codec. Theoretically its "codec frame size" is only one sample, or 0.125 ms, so the (3 x 30 ms - 3 x 20 ms) formula is not the right target for comparison. Furthermore, many telephones have G.711 encoder and decoder directly built into the chip hardware of A/D and D/A, so they can directly digitize the input audio signal into 8-bit G.711 codewords and directly playback 8-bit G.711 codewords as the output audio signal; thus, there is essentially no processing delay for G.711. Even if the G.711 encoding/decoding is done in software or firmware, the G.711 codec complexity is so low that it takes almost no time to do G.711 processing. The almost-zero processing delay can contribute to the extra low delay of G.711-based VoIP systems. There have been so many discussions about how the codec frame size and packet size may affect the one-way delay, there has been confusion, and there have been criticism that there wasn't any rigorous theoretical analysis, so I thought I would spend some time to give a more rigorous delay analysis below so we can hopefully settle such disputes. At the end of my analysis, you will see how the lower bound and upper bound of the one-way delay depend on the codec frame size AND the packet size under various conditions. Please read on if you are interested; ignore if you are not; or you can quickly scroll down to Equations (1) through (3), which are the main results of my delay analysis, and read the last few paragraphs after Eq. (3). Before I did the following delay analysis, I consulted extensively with three Broadcom senior technical leads who have many years of extensive real-time system architecture and design experiences in IP phones, VoIP gateways, and video systems (such as cable/satellite set-top boxes), respectively. What they told me were consistent with each other and consistent with what I have been saying. Before I start the analysis, let me first discuss the multi-tasking, or Real- Time Scheduling (RTS) delay, because it is a critical component of the total one-way delay and needs to be clarified first. In real-time audio or video systems, many tasks have definite completion deadlines beyond which the real-time operation will be lost and there will be audible or visible glitches. One way to handle a real-time task is by interrupting the processor in the hope that the processor will put down whatever it is doing and service the interrupt first. If there is only one real-time task and all other tasks in the system do not have real-time requirements, then the interrupt will be serviced immediately and there is no RTS delay. However, this is rarely the case, since the system typically also has other real-time tasks. (For example, an IP phone needs to handle the encoding of the send-path signal, decoding of the receive- path signal, echo canceller, side-tone, and other real-time tasks at the same time.) Then, the interrupts generated by different real-time tasks need to be prioritized. There can be only one highest-priority task. Any of the other tasks will have a lower priority and need to wait for its turn if it tries to interrupt a higher-priority task. That wait time, plus the time it takes the processor to complete the task, is the RTS delay of that task. The entire audio or video stream will need to be buffered and delayed by at least the worst-case wait time in order to have a smooth playback without any gaps or glitch. If there are a large number of real-time tasks in the system, then a prioritized interrupt-driven RTS system will become very complex and messy, and the associated context switching for all the interrupts will reduce the system efficiency. Therefore, in IP phones, VoIP gateways, and cable/satellite set-top boxes, usually a different kind of real-time scheduling scheme is used, where each real-time task is allowed to run to completion, but to simplify RT scheduling, all real-time tasks are requested in a periodic manner, or with similar assumptions such as a minimum interval. In many of these designs, all real time tasks on any one processor have the same period (or "thread interval") for maximum real time efficiency. In the case of real-time voice communication systems, the most convenient and common thread interval is the codec frame size. Thus, the codec frame size determines how much RTS delay the system has. I have consulted my Broadcom colleague Sandy MacInnis, a senior architect who specializes in video and system design, and who is knowledgeable about real time scheduling. He was the chair of the MPEG Systems committee for MPEG-1 and MPEG-2 (i.e. MPEG Transport, MPEG Programs streams, and MPEG-1 Systems). I will quote him below: "For most efficient scheduling, all tasks should have the same period, and in the general case, each task may be served any time from immediately after the request to the last instant before the next request. So, for such efficient, general and robust systems, the RTS (real time scheduling) latency is up to one request period, which in this case is a frame duration. When the request is serviced earlier, the data has to be buffered up because the end-end delay needs to be constant. While someone might say that they think an RTS scheme can service requests with consistently less latency than a frame time, I would challenge them for a theoretical basis that shows they can do so reliably. What happens when all the requests happen at the same time? That can certainly happen, in general. ... An extremely standard basic assumption of RTS, and in particular Rate Monotonic Scheduling (RMS), is that for each task, the deadline equals the period. That means that from the time a requester makes a request, the RTS system needs to ensure that the request is completely serviced (finished, not just started) before the period from that request to the next request expires. Other assumptions are possible, but longer deadlines don't usually help much and they make the system more complex, and shorter deadlines make scheduling harder. If there is a set of tasks with exactly the same period, i.e. synchronous, then it's possible to schedule the shared resource to 100% of capacity while ensuring RT performance. However, in the more typical case, the various tasks do not have the same period, in which case in general the maximum utilization of the shared resource that can be scheduled for real time tasks is significantly less than 100%. Whether the system is real-time schedulable or not can be determined in various ways, including critical instant analysis. In either case, in general the latency of any given request can be anywhere from zero plus processing time, to exactly the period = deadline." For a PC with a very powerful processor and a very light real-time load, it may be reasonable to expect the processor to perform the encoding and decoding tasks very shortly after they are requested, with the requests being driven by interrupts, and the processing time of each task may be very short relative to the interval between requests. The resulting RTS delay may be as low as a few percent of the frame interval. This is possible because a typical PC has much higher processing power than is required by a speech coder. The same is not true for VoIP gateways or IP phones, where the processor is heavily loaded with real-time tasks and is often just barely fast enough to handle the designated number of voice channels (many for gateways and one for IP phones). For example, rather than having a 2 to 3 GHz processor as in a PC, the processor used to do speech coding in a low- end IP phone may only have a clock rate of slightly more than 100 MHz. In this case, it is reasonable to expect that the time required to service each request, including processing time, may be as much as the full frame interval. OK, now that the RTS delay has been discussed, let me proceed with my delay analysis. I will break down the delay into many components, with each component occurring after the components listed earlier. Let the codec frame size be F ms and the packet size be P ms. Let each packet contain N codec frames, so P = N*F. For simplicity, we will not consider the codec look- ahead L ms and codec filtering delay R ms in this analysis and will just add them at the end because we know their multiplier is 1X. The one-way mouth-to-ear delay includes the following codec-dependent delay components: (1) Encoder buffering delay: d1 = a1*F, where a1 = 1. This is the time it takes to buffer all input samples of a codec frame. (2) Encoder RTS delay: d2 = a2*F, where 0 < a2 <= 1. This includes the encoder processing delay; see the discussion above. (3) Packetization delay: d3 = a3*F, where a3 = (N-1). This is the amount of time the first frame in the packet need to wait until the last frame of encoded bits in the packet is ready. (4) Packet transmission delay: d4 = a4*F, where 0 < a4 <= N. This is the time it takes to ship all bits in the packets out of the transmitter; this can also be considered the decoder bit buffering delay, since it is the time the decoder needs to wait to get all bits in the packet. If the speed of the communication channel is very high, then d4 can be a very small fraction of the packet size P = N*F ms, but it will not be zero. If the channel speed is exactly the same as the bit-rate of the packet (including the packet header), then d4 = P = N*F ms. Even for the case of high-speed channel, if we view the bit transmission task as a real-time scheduling problem for the micro-controller (which may run at a different thread rate than the DSP), then the scheduling wait time plus the processing time (i.e. the time to actually transmit bits) may still take up to one thread interval, which is P = N*F ms in this case. (5) Decoder RTS delay: d5 = a5*F, where 0 < a5 <= 1. This includes the decoder processing delay; see the discussion above. There may be other delay components that may depend on the codec frame size. For example, in gateways where a few layers of processors are used, each processor may have its own real-time scheduling delays for all tasks that it handles. However, at least the delay components listed above are the major ones that are commonly encountered. If we omit the other possible codec- dependent components for the moment but add back the codec look- ahead L and codec filtering delay R (if any), the total codec-dependent one-way delay is then D = d1 + d2 +... + d5 + L + R = {1 + (0,1] + (N-1) + (0,N] + (0,1]}*F + L + R Hence, the one-way delay D has a possible range of N*F + L + R < D <= (2*N + 2)*F + L + R, or P + L + R < D <= 2*P + 2*F + L + R Eq. (1) For heavily loaded real-time systems such as VoIP gateways or IP phones, if we assume the worst case of one full frame of encoder RTS delay and decoder RTS delay, then a2 = 1 and a5 = 1, and we get a tighter range for the one-way delay: P + 2*F + L + R < D <= 2*P + 2*F + L + R Eq. (2) In the special case of N = 1 (each packet contains only one codec frame), then we get 3*F + L + R < D <= 4*F + L + R Eq. (3) The delay lower bounds in Eq. (1) through Eq. (3) above (under their individual assumptions) are consistent with what I have been saying. If the other omitted codec-dependent delay components are significant, or if the system implementers have not been careful about minimizing the delay, then the delay upper bounds can be even higher than what are shown in Eq. (1) through Eq. (3). In your Cisco 7960 IP phone delay measurements, P = 20 ms or 30 ms, L = 0, R = 0, and theoretically F = 0.125 ms. If you look at Eq. (2) above, then it is clear that you won't see 3 times the packet size difference as the delay difference. However, here the codec frame size is 0.125 ms, not 20 or 30 ms, so this result doesn't conflict with what I have been saying (i.e. 3X codec frame size). Of course, in reality it is unlikely that an IP phone will use 0.125 ms as the thread interval. A more likely thread interval is P. Then, my delay analysis above does not apply directly. However, it is not difficult to follow the same logic and procedure to see what will happen in this case. If G.711 encoding and decoding is built right into the A/D and D/A, then the 8- bit G.711 codewords directly arrives at the input buffer or leave the output buffer and the RTS system does not need to schedule G.711 encoding and decoding tasks, so d2 = d5 = 0. Also, in this case d1 = P, d3 = 0, and 0 < d4 <= P. Thus, the total one-way delay is P < D <= 2*P. Even if the G.711 encoding and decoding operations are done in software/firmware, the G.711 complexity is so low that it takes the processor almost no time to do encoding and decoding. In this case, the IP phone is closer to the case of a PC that has much more processing power than is required for speech coding, and if the Cisco engineers did a good job of optimizing RTS to minimize d2 and d5, then d2 and d5 would be closer to 0 than to P. Then, the total one-way codec-dependent delay would be closer to P than to 3*P. This is probably what you have observed. [Koen]: Thanks for the detailed explanation, this clarifies your earlier statements about the 3x multiplier. The essence, if I understand you correctly, is that there still exist low- end platforms with barely enough processing power to run a VoIP call. If such platforms use a naive FIFO scheduler, they'll create up to one frame of processing delay for encoder and decoder each, on top of the frame of buffering delay. The good news is that Moore's law will continue to drive down the fraction of platforms with such processing delay problems. I'm a bit surprised by your analysis of "packet transmission delay", as it has little bearing on our multiplier (ie the change in delay as a function of frame size). See old posts. [Raymond]: It doesn't have to be low-end platforms. I wouldn't consider high-density VoIP gateways "low-end". What matters is whether the processor is heavily loaded (i.e. busy at a high percentage of time) with real-time tasks (and thus is just fast enough). I think this is true for typical implementations of IP phones and VoIP gateways. I also wouldn't use the term "a naïve FIFO scheduler" to describe the "run to completion" real-time scheduler that I talked about in my last email, because that term seems to imply that it is a very simple-minded and inferior approach used by an inexperienced person who doesn't know anything better. My understanding from talking to the three senior technical leads of Broadcom is that the reality is when you have many real-time tasks that you need to handle concurrently, using a prioritized interrupt-driven scheduler is just way too complex and messy, and it doesn't even guarantee that you will get a lower delay if you do go through the trouble. In contrast, the kind of "run to completion" real- time scheduler that I talked about is a more elegant solution as it simplifies the scheduling problem substantially and also allows you to have more efficient utilization of the processor. Other than these two points, your understanding of my main point is correct. > The good news is that Moore's law will continue to drive down the > fraction of platforms with such processing delay problems. [Raymond]: This may be true for PC but probably not true in general. PC is a general-purpose computing device that has to handle numerous possible tasks, and a voice phone call takes only a very small fraction of the worst-case computational power requirement of a PC. In contrast, for special-purpose dedicated hardware devices such as IP phones or VoIP gateways, it would make no sense to use a processor that is many times faster than the worst-case computational power requirement. For the sake of cost and power efficiency, the designers of such special- purpose devices will want to use a processor that's just slightly faster than required, because then they can use the cheapest and/or lowest power- consuming processor that's fast enough to get the job done. If they choose to use a processor much faster than is required, then competitors using processors just fast enough can have lower costs and power consumption and can take market share away from them. A case in point: after its first appearance several decades ago, 8-bit microprocessors are still widely used in many devices today despite the several orders of magnitude of speed improvement provided by Moore's Law, because those devices just don't need anything faster, so using anything faster would be a waste of money and power consumption. My point is that we should not expect that future IP phones or gateways will operate at a very low percentage point of the processor load just because Moore's Law can improve processor speed over time. Therefore, don't expect the 3X multiplier for codec frame size to go down much below where they are now. In fact, if in addition to a VoIP call, a PC is heavily loaded with a lot of other concurrent tasks, many of which may be real-time tasks (e.g. video, playing/burning CD/DVD, networking, etc.), then it will be difficult for the PC to have small encoding and decoding RTS delays (d2 and d5 in my delay analysis). In this case, the codec frame size multiplier will be closer to 3X than to 1X, unless you are willing to let the voice stream occasionally run out of real time and produce an audible glitch (which is not acceptable from the voice quality perspective). If you agree with this and agree that a PC sometimes does get very heavily loaded, then if you don't want the voice stream to run out of real time, the worst-case codec-dependent delay for PC can still be around 3X the codec frame size. > I'm a bit surprised by your analysis of "packet transmission delay", > as it has little bearing on our multiplier (ie the change in delay as > a function of frame size). See old posts. [Raymond]: I am not sure I understand what you are saying. You probably misunderstood the goal of my analysis. I mentioned in my last email that my delay analysis aimed to derive the lower and upper bounds of the codec- dependent one-way delay as functions of both the codec frame size AND the packet size. That "packet transmission delay" does depend on the packet size, so it should be included. Also, including it doesn't increase the lower bound of the delay (and the codec frame size multiplier there); it only affects the upper bound. Or, are you saying the "packet transmission delay" depends on the packet size, not the codec frame size, and therefore is not codec-dependent? Well, we know the packet size should be a positive integer multiple of the codec frame size. Once the codec frame size is determined, there are only limited choices of packet sizes you can use, so in this sense the packet size does depend on the codec frame size. Therefore, the "packet transmission delay" indirectly depends on the choice of the codec. [Koen]: In other words, future manufacturers won't spend a few dimes on reducing delay, even though today they're happy to add several dollars to the price just to enable wideband? That's a statement about the relative importance of delay. For the discussion about transmission delay vs. frame size, see e.g. http://www.ietf.org/mail-archive/web/codec/current/msg01477.html [Hoene]: yesterdays, I had a brief look on ITU-T G.114 http://www1.cs.columbia.edu/~andreaf/new/documents/other/T-REC-G.114-200305.pdf It might help in your discussion... [Sanny MacInnis]: Sorry for stepping in here... full disclosure: I'm not a speech coding expert, and I work at Broadcom, where Raymond works. I too would like to end this discussion; it seems to have diverged from a discussion of the requirements for the CODEC algorithm to have a mode with low algorithmic delay, which AFAIK is already agreed anyway, to some rather tangential discussions related to, but not really addressing, real time scheduling of the algorithm on a processor. The point from Raymond that is the head of this particular discussion trail is RTS, i.e. real time scheduling. I know his note about that is long; it might be worth reading it again. It's not a fair assumption that 100% of a shared resource - in this instance, a processor - can be spent performing real-time-scheduled tasks. If there is a set of RT (real time) tasks that have different periods, and periods = deadlines, all being scheduled on the same processor, the best you can do is less than 100%. How close you can get depends on the details; it might be e.g. 68%, or it could be significantly less; there's a lot of literature on this. If the system is optimally designed for the purposes of RTS, i.e. all other tasks are treated as non-real time and have lower priority than all real time tasks, there are no priority inversions, task switching is very efficient, etc. the RTS performance can come close to theory, but if any of these assumptions are not true, it be significantly worse. If the total RT demands are only a very small fraction of the total shared resource, i.e. processor cycles, it tends to be easier to perform the scheduling and ensure that it works correctly. Such a scenario may be more important than RTS indicates if the system is not well designed for real time operation, i.e. a PC. And, such systems draw MUCH more power than well-designed embedded products. Conversely, low power and modest clock rates are good design principles for embedded products, if those that are wall (mains) powered. E.g. someone noted leakage power at 65nm - have you looked at 40nm? It just keeps getting worse. Designing for slower max clock rate saves substantial power. There are good reasons why a common convention of real time scheduling is the assumption that period = deadline. As Raymond noted, other design assumptions are possible, but they have their own problems. Note also, as Raymond pointed out, that RTS also applies to intermediate points in the end-end system, such as gateways. Such a device may have very powerful processors, and if so, it should be for the specific purpose performing a large number of RT tasks, loading the processors as much as can be guaranteed. I would hope that this committee is not planning to be in the position of dictating that all implementations of the algorithm require a processor that is so fast that the system can guarantee service that latency is much less than the period of an audio frame. And if not, then a reasonable assumption is that, in general, the deadline of service latency does equal the period of an audio frame. That assumption is part of one of upper- limit calculations from Raymond. [Raymond]: I too don't want to see this discussion drags on, but some of your comments seem misleading to me, so I would like to respond with some quick comments. Wideband is a new feature in some devices and is a check box that a product manager needs to check off to remain competitive. That doesn't mean wideband is more important than existing features in a device. Also, I am not sure the cost difference is a few dimes versus several dollars. In some devices the extra cost of adding wideband is minimal. Furthermore, it is not only a cost issue but also a power consumption issue. No one in his or her right mind will use a processor that's 5X to 10X faster than necessary just in order to reduce the encoder and decoder RTS delays to a small fraction of the codec frame size; this is just the way it is and has nothing to do with the relative importance of delay or anything else. You were presenting it as if this were a reasonable choice that device designers could easily make but chose not to make, but that's just not true. It has always been the case that the designers will use processors just fast enough for the job, perhaps with a little margin for the unexpected, but not 5X or 10X. Given this, the bottom line is that ~ 3X codec frame size is the "norm" or a "necessary result" for special-purpose hardware devices rather than by a design choice, and you are just lucky to get < 2X in PC-based VoIP calls because PCs were not designed for voice calls but for other much more computationally demanding tasks. (Even there you can't guarantee that PCs will always give you a multiplier of < 2X. What if the PC is heavily loaded with other tasks? Then you are more likely to get 3X if you don't want your voice stream to run out of real time.) -- Ticket URL: <http://trac.tools.ietf.org/wg/codec/trac/ticket/19#comment:5> codec <http://tools.ietf.org/codec/>
- [codec] #19: How large is Serialization delay? codec issue tracker
- Re: [codec] #19: How large is Serialization delay? stephen botzko
- Re: [codec] #19: How large is the frame size depe… codec issue tracker
- Re: [codec] #19: How large is the frame size depe… codec issue tracker
- Re: [codec] #19: How large is the frame size depe… codec issue tracker
- Re: [codec] #19: How large is the frame size depe… codec issue tracker
- Re: [codec] #19: How large is the frame size depe… codec issue tracker
- Re: [codec] requirements #19 (new): How large is … codec issue tracker