Re: [codec] #20: Computational complexity?

"codec issue tracker" <trac@tools.ietf.org> Thu, 24 June 2010 15:33 UTC

MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: codec issue tracker <trac@tools.ietf.org>
Precedence: bulk
Auto-Submitted: auto-generated
To: hoene@uni-tuebingen.de
Date: Thu, 24 Jun 2010 15:33:38 -0000
Message-ID: <071.5e6ef09350c42a52fe51191cc3cf486e@tools.ietf.org>
References: <062.8524135614c0f45c18915362cc459235@tools.ietf.org>
In-Reply-To: <062.8524135614c0f45c18915362cc459235@tools.ietf.org>
Cc: codec@ietf.org
Subject: Re: [codec] #20: Computational complexity?
Reply-To: codec@ietf.org

#20: Computational complexity?
------------------------------------+---------------------------------------
 Reporter:  hoene@…                 |       Owner:     
     Type:  defect                  |      Status:  new
 Priority:  major                   |   Milestone:     
Component:  requirements            |     Version:     
 Severity:  -                       |    Keywords:     
------------------------------------+---------------------------------------

Comment(by hoene@…):

 [Hoene]: in the last weeks we had a couple of discussion on the complexity
 or algorithmic efficiency of the codec. Knowing the algorithmic efficiency
 is important because:
 1.      the complexity has an impact on power consumption and system
 costs,
 2.      the hardware can be selected to fit pre-known complexity
 requirement, and
 3.      different codec proposals can be compared if they are show similar
 performance in other aspects.
 Before any complexity comparisons can be made, we need an objective,
 precise, reliable, and repeatable metric on how to measure the algorithmic
 efficiency. Let me describe three different approaches:

 Classic Approach used in Codec Standardization

 In the last 17 years, the ITU-T Study Group 16 measure the complexity of
 codecs using a library called ITU-T Basic Operators (described in ITU-T
 G.191), that counts the kind and number of operations and the amount of
 memory. The last version of the standard (03/10), not yet publically
 available, supports - beside fix-point operations of different widths -
 also floating operations.  Each operation can be counted automatically and
 weighted accordingly. The following source code is an excerpt from
 baseop32.h

 /*___________________________________________________________________________
  |
 |
  |   Prototypes for basic arithmetic operators
 |
 |___________________________________________________________________________|
 */

 Word16 add (Word16 var1, Word16 var2);    /* Short add,           1   */
 Word16 sub (Word16 var1, Word16 var2);    /* Short sub,           1   */
 Word16 abs_s (Word16 var1);               /* Short abs,           1   */
 Word16 shl (Word16 var1, Word16 var2);    /* Short shift left,    1   */
 Word16 shr (Word16 var1, Word16 var2);    /* Short shift right,   1   */
 …
 Word16 div_s (Word16 var1, Word16 var2); /* Short division,       18  */
 Word16 norm_l (Word32 L_var1);           /* Long norm,             1  */

 In the upcoming ITU-T G.GSAD standard another approach has been used as
 shown in the following code example. For each operation, WMPOS function
 have been added, which count the number of operations. If the algorithmic
 efficiency of an algorithm has to be measured, the program is run and the
 operations are counted.

     for (i=0; i<NUM_BAND; i++)
     {
 #ifdef WMOPS_FX
         move32();move32();
         move32();move32();
 #endif

         state_fx->band_enrg_long_fx[i] = 30;
         state_fx->band_enrg_fx[i] = 30;
         state_fx->band_enrg_bgd_fx[i] = 30;
         state_fx->min_band_enrg_fx[i] = 30;
     }

 These two similar examples worked well in the years and are an established
 procedure to count operations and complexity. Still, it has some drawbacks
 a)      Existing algorithm must be modified manually to address the
 counting of operations. This is time consuming.
 b)      The CPU model is simplistic as it does not consider memory access
 (e.g. cache), parallel executions, or other kind of optimization done in
 modern microprocessor and compilers. Thus, the number of instruction might
 not correlated well with the actual execution time on modern CPUs.
 Profiling: The Classic Approach used in Software Engineering

 Citing Wikipedia: „In software engineering, program profiling, software
 profiling or simply profiling, a form of dynamic program analysis (as
 opposed tostatic code analysis), is the investigation of a program's
 behavior using information gathered as the program executes. The usual
 purpose of this analysis is to determine which sections of a program to
 optimize - to increase its overall speed, decrease its memory requirement
 or sometimes both.
 •       A (code) profiler is a performance analysis tool that, most
 commonly, measures only the frequency and duration of function calls, but
 there are other specific types of profilers (e.g. memory profilers) in
 addition to more comprehensive profilers, capable of gathering extensive
 performance data.
 •       An instruction set simulator which is also — by necessity — a
 profiler, can measure the totality of a program's behaviour from
 invocation to termination.“
 Thus, a typical profiler such as gprof can be used to measure and
 understand the complexity of a codec implementation. It is precise because
 it is uses on modern computers. However, if used for standardization, it
 has the following drawback.
 a)      The execution times depend on the CPU architecture, the PC in
 general, the OS and parallel running programs. As such it cannot produce
 reliable and repeatable results. The results are not comparable if done
 under slightly changed conditions.
 Suggestion: Use a Cycle Accurate Simulator
 Wikipedia: „A Cycle Accurate Simulator (CAS) is a computer program that
 simulates a microarchitecture cycle-accurate. In contrast an instruction
 set simulator simulates an Instruction Set Architecture usually faster but
 not cycle-accurate to a specific implementation of this architecture.“
 Then, the execution times are precise and they are repeatable. If two
 parties make measurements using different CPUs, they still get the same
 results. Of course, a cycle accurate simulator is slower than the real CPU
 at a factor of about 100.

 Actually, there is an open-source Cycle accurate simulator called PTLsim
 avaible that simulates an Pentium IV. „PTLsim is a cycle accurate x86
 microprocessor simulator and virtual machine for the x86 and x86-64
 instruction sets. PTLsim models a modern superscalar out of order x86-64
 compatible processor core at a configurable level of detail ranging from
 full-speed native execution on the host CPU all the way down to RTL level
 models of all key pipeline structures. In addition, all microcode, the
 complete cache hierarchy, memory subsystem and supporting hardware devices
 are modeled with true cycle accuracy. PTLsim supports the full x86-64
 instruction set of the Pentium 4+, Athlon 64 and similar machines with all
 extensions (x86-64, SSE/SSE2/SSE3, MMX, x87). It is currently the only
 tool available to the public to support true cycle accurate modeling of
 real x86 microarchitectures. […] PTLsim is used extensively at hundreds of
 major universities, industry research labs and the well known x86
 microprocessor vendors Intel and AMD.“
 Thus, I would suggest to use the Cycle Accurate Simulator PTLsim to
 measure the complexity of the IETF audio codec contributions because
 a)      it makes precise performance measurements using a modern, widely
 deployed CPU platform,
 b)      the measurement are repeatable and can be made by everybody,
 c)      existing source code can be used without significant changes, and
 d)      it is time-saving and easy to use.

 [Botzko]:
 Are you implying that the complexity is only a requirement for x86
 machines?

 [Hoene]: Hoene: No, complexity is a requirement for all machines. However,
 x86 machines are distributed and used widely on PC, notebook and also
 sometimes also in embedded systems. Also, the contributed source code is
 only available for PCs like system not for DSPs. Thus, any testing on DSP
 will not be possible.
 However, you are free to suggest other reference platforms such as
 •       ARM: http://facsim.snu.ac.kr/
 •       C64x+ CPU Cycle Accurate Simulator:
 http://processors.wiki.ti.com/index.php/C64x%2B_CPU_Cycle_Accurate_Simulator
 Just a matter of choice (and availability).

 [Botzko]: One significant disadvantage of profiling is that it gives an
 unfair advantage to code which is optimized for the x86 platform.

 This may not be a problem at the end of the development process, however
 in the beginning it is best to stay focused on the core algorithm.

 Even at the end, more generic code may be the best reference, since the
 x86 optimizations don't have to be undone if you are re-optimizing for a
 non-x86 platform.

-- 
Ticket URL: <https://svn.tools.ietf.org/wg/codec/trac/ticket/20#comment:3>
codec <http://tools.ietf.org/codec/>

[codec] #20: Computational complexity? codec issue tracker
Re: [codec] #20: Computational complexity? codec issue tracker
Re: [codec] #20: Computational complexity? codec issue tracker
Re: [codec] #20: Computational complexity? codec issue tracker
Re: [codec] requirements #20 (new): Computational… codec issue tracker
Re: [codec] #20: Computational complexity? codec issue tracker