Re: [codec] Measuring the algorithmic efficiency of the IWAC

Hi Stephen,

comments inline

Are you implying that the complexity is only a requirement for x86 machines?  

Hoene: No, complexity is a requirement for all machines. However, x86 machines are distributed and used widely on PC, notebook and
also sometimes also in embedded systems. Also, the contributed source code is only available for PCs like system not for DSPs. Thus,
any testing on DSP will not be possible.

However, you are free to suggest other reference platforms such as

*         ARM:  <http://facsim.snu.ac.kr/> http://facsim.snu.ac.kr/

*         C64x+ CPU Cycle Accurate Simulator:  <http://processors.wiki.ti.com/index.php/C64x%2B_CPU_Cycle_Accurate_Simulator>
http://processors.wiki.ti.com/index.php/C64x%2B_CPU_Cycle_Accurate_Simulator

Just a matter of choice (and availability).

>>>
different codec proposals can be compared if they are show similar performance in other aspects.
>>>
There's been no discussion on this point.  In my view, at some point in the process we set a complexity bound which all codecs
proposals will have to meet.  

Hoene: Agreed. However, before agreeing on bound, we have to agree on a measure and metric. My last email is on suggestion on how to
measure precise and time saving while covering the most typical usage scenarios.

Christian

That is quite different from the above statement.

Regards
Stephen Botzko

On Mon, May 31, 2010 at 5:27 AM, Christian Hoene <hoene@uni-tuebingen.de> wrote:

Hi,

in the last weeks we had a couple of discussion on the complexity or algorithmic
<http://en.wikipedia.org/wiki/Algorithmic_efficiency>  efficiency of the codec. Knowing the algorithmic efficiency is important
because:

1.      the complexity has an impact on power consumption and system costs, 

2.      the hardware can be selected to fit pre-known complexity requirement, and

3.      different codec proposals can be compared if they are show similar performance in other aspects.

Before any complexity comparisons can be made, we need an objective, precise, reliable, and repeatable metric on how to measure the
algorithmic efficiency. Let me describe three different approaches:

Classic Approach used in Codec Standardization

In the last 17 years, the ITU-T Study Group 16 measure the complexity of codecs using a library called ITU-T Basic Operators
(described <http://www.itu.int/rec/T-REC-G.191-200508-I/en>  in ITU-T G.191), that counts the kind and number of operations and the
amount of memory. The last version of the standard (03/10), not yet publically available, supports - beside fix-point operations of
different widths - also floating operations.  Each operation can be counted automatically and weighted accordingly. The following
source code is an excerpt from baseop32.h

/*___________________________________________________________________________

 |                                                                           |

 |   Prototypes for basic arithmetic operators                               |

 |___________________________________________________________________________|

*/

Word16 add (Word16 var1, Word16 var2);    /* Short add,           1   */

Word16 sub (Word16 var1, Word16 var2);    /* Short sub,           1   */

Word16 abs_s (Word16 var1);               /* Short abs,           1   */

Word16 shl (Word16 var1, Word16 var2);    /* Short shift left,    1   */

Word16 shr (Word16 var1, Word16 var2);    /* Short shift right,   1   */

Word16 div_s (Word16 var1, Word16 var2); /* Short division,       18  */

Word16 norm_l (Word32 L_var1);           /* Long norm,             1  */

In the upcoming ITU-T G.GSAD standard another approach has been used as shown in the following code example. For each operation,
WMPOS function have been added, which count the number of operations. If the algorithmic efficiency of an algorithm has to be
measured, the program is run and the operations are counted. 

    for (i=0; i<NUM_BAND; i++)

    {

#ifdef WMOPS_FX

        move32();move32();

        move32();move32();

#endif

        state_fx->band_enrg_long_fx[i] = 30;

        state_fx->band_enrg_fx[i] = 30;

        state_fx->band_enrg_bgd_fx[i] = 30;   

        state_fx->min_band_enrg_fx[i] = 30;

    }

These two similar examples worked well in the years and are an established procedure to count operations and complexity. Still, it
has some drawbacks

a)      Existing algorithm must be modified manually to address the counting of operations. This is time consuming. 

b)      The CPU model is simplistic as it does not consider memory access (e.g. cache), parallel executions, or other kind of
optimization done in modern microprocessor and compilers. Thus, the number of instruction might not correlated well with the actual
execution time on modern CPUs.

Profiling: The Classic Approach used in Software Engineering

Citing Wikipedia <http://en.wikipedia.org/wiki/Profiling_%28computer_programming%29> : In software engineering
<http://en.wikipedia.org/wiki/Software_engineering> , program profiling, software profiling or simply profiling, a form of dynamic
program analysis <http://en.wikipedia.org/wiki/Dynamic_program_analysis>  (as opposed tostatic code analysis
<http://en.wikipedia.org/wiki/Static_code_analysis> ), is the investigation of a program's behavior using information gathered as
the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize
<http://en.wikipedia.org/wiki/Optimization_%28computer_science%29>  - to increase its overall speed, decrease its memory requirement
or sometimes both.

*         A (code) profiler is a performance analysis tool that, most commonly, measures only the frequency and duration of function
calls, but there are other specific types of profilers (e.g. memory profilers
<http://en.wikipedia.org/w/index.php?title=Memory_profiler&action=edit&redlink=1> ) in addition to more comprehensive profilers,
capable of gathering extensive performance data.

*         An instruction set simulator <http://en.wikipedia.org/wiki/Instruction_set_simulator>  which is also  by necessity  a
profiler, can measure the totality of a program's behaviour from invocation to termination.

Thus, a typical profiler such as gprof can be used to measure and understand the complexity of a codec implementation. It is precise
because it is uses on modern computers. However, if used for standardization, it has the following drawback.

a)      The execution times depend on the CPU architecture, the PC in general, the OS and parallel running programs. As such it
cannot produce reliable and repeatable results. The results are not comparable if done under slightly changed conditions.

Suggestion: Use a Cycle Accurate Simulator

Wikipedia: A Cycle Accurate Simulator (CAS) is a computer program that simulates a
<http://en.wikipedia.org/wiki/Microarchitecture> microarchitecture cycle-accurate. In contrast an
<http://en.wikipedia.org/wiki/Instruction_set_simulator> instruction set simulator simulates an
<http://en.wikipedia.org/wiki/Instruction_Set_Architecture> Instruction Set Architecture usually faster but not cycle-accurate to a
specific implementation of this architecture.

Then, the execution times are precise and they are repeatable. If two parties make measurements using different CPUs, they still get
the same results. Of course, a cycle accurate simulator is slower than the real CPU at a factor of about 100.

Actually, there is an open-source Cycle accurate simulator called PTLsim <http://www.ptlsim.org/>  avaible that simulates an Pentium
IV. PTLsim is a cycle accurate x86 microprocessor simulator and virtual machine for the x86 and x86-64 instruction sets. PTLsim
models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging from full-speed
native execution on the host CPU all the way down to RTL level models of all key pipeline structures. In addition, all microcode,
the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy. PTLsim supports
the full x86-64 instruction set of the Pentium 4+, Athlon 64 and similar machines with all extensions (x86-64, SSE/SSE2/SSE3, MMX,
x87). It is currently the only tool available to the public to support true cycle accurate modeling of real x86 microarchitectures.
[] PTLsim is used extensively at hundreds of major universities, industry research labs and the well known x86 microprocessor
vendors Intel and AMD.

Thus, I would suggest to use the Cycle Accurate Simulator PTLsim to measure the complexity of the IETF audio codec contributions
because

a)        it makes precise performance measurements using a modern, widely deployed CPU platform,

b)        the measurement are repeatable and can be made by everybody,

c)        existing source code can be used without significant changes, and 

d)        it is time-saving and easy to use.

With best regards,

 Christian Hoene

---------------------------------------------------------------

Dr.-Ing. Christian Hoene

Interactive Communication Systems (ICS), University of Tübingen 

Sand 13, 72076 Tübingen, Germany, Phone +49 7071 2970532 

http://www.net.uni-tuebingen.de/

_______________________________________________
codec mailing list
codec@ietf.org
https://www.ietf.org/mailman/listinfo/codec