Re: [AVT] RE: <draft-ietf-avt-rtp-vmr-wb-03.txt>: sampling rate

Colin Perkins <csp@csperkins.org> writes:
>On 13 Sep 2004, at 16:41, Magnus Westerlund wrote:
>> I think we have two issues:
>>
>> A. Is there any benefit to indicate or request that the sampling
>> frequency used at the sender.
>
>Yes. This is why RTP has the "rate" parameter, and uses the sampling rate
>as the RTP timestamp rate.

        My apologies, but I don't see how that answers the question as to
what the benefit is.  How does the receiver make use of this information?
Why is this better for the receiver?

        I can see some advantage in having the RTP rate parameter be the
normal _output_ sample rate from the codec (which may be uncorrelated to
the input rate!) in that you could use that to implement an audio codec
with non-fixed frame sizes.

>> B. Is it necessary to use the sampling frequency as RTP timestamp rate.
>
>It's highly desirable.

        Again, asserting this without a reason doesn't convince me.

>> Colin, if one looks at issue B. Is it really needed to use the RTP
>> timestamp frequency equal to the sampling rate used? I would say NO to
>> that question.
>
>Yes, it is necessary to use an RTP timestamp equal to the sampling rate.

        Again, no reasons given.

>> My reasoning is the following.
>>
>> - Many audio input is sampled from a source at a higher rate then the
>> encoder may handle. Thus a resampling and pre-processing stage is
>> employed based on the encoders input frequency rather then producing that
>> rate initially from the hardware. Some of the reason is that the
>> pre-processing may actually yield better results than what the hardware
>> at given input rate can gain. Another reason may be that one like to
>> avoid switching the hardware between rate if changing the encoding.
>>
>> - The frame based decoders does not need to know the encoders input
>> rate. The encoder may anyway resample this into other rates for internal
>> processing and band limited signals. I would claim that VMR-WB, AMR-WB+
>> and AAC are all example of codecs that perform this kind of tricks. On
>> the receiver side they produce a output signal that has any sampling
>> frequency the receiver finds most useful. Either causing clipping of the
>> higher frequencies, but more commonly to a higher clock rate, despite
>> that no more information is provided simply for ease of use.
>>
>> - The frame based codecs do only need a RTP timestamp that allows the
>> receiver to correctly reconstruct the time line when the encoding is done
>> with the most audio bandwidth. In the VMR-WB case this is 16kHz. AMR-WB+
>> is even more strange, as we have selected an RTP timestamp rate that
>> results in that all internal sampling frequencies will result in integer
>> timestamp ticks. Thus actually allowing one to correctly calculate frame
>> alignment when the internal sampling frequency changes. That the
>> frequency also is possible to recalculate into several common sampling
>> frequencies with few partial sample alignments was also considered.
>>
>> Thus I would use this to argue that indicating the actual sampling
>> frequency is not necessarily as long as the receiver is capable of
>> correctly reconstruct the media stream with its timing information in
>> full resolution.
>
>True, but it greatly simplifies the system if all codecs use the sampling
>rate as the RTP clock rate. You can make things work if each codec uses a
>different rate, but it's desirable that RTP is consistent where
>possible. Why is this codec so special that it needs to break this rule?

        Again, I simply don't see how things are simplified by having the
RTP timestamp rate be the sender-side sample rate.  If the timestamp rate
is 2x or 4x the input rate, or if it's the lowest common multiple (if you
know what I mean) of a range of sample rates the encoder might use, how
does that hurt the receiver?  In fact, if it _is_ possible for the encoder
to use multiple rates (like VMR-WB), isn't it a lot easier on both sides
if we can just us 16KHz for RTP and avoid having the play games if we want
to switch input rates dynamically?  Setting the timestamp at the sender end
is trivial.

        Also, sticking with a strict timestamp rate == sample rate dictate 
means that various classes of codecs won't fit well, will be harder to
write and/or use with RTP, or won't be developed because the transports
for them would be painful.  Or, perhaps worse for the participant of this
group, such codecs will push the use of alternative mechanisms such as
IAX2 or other possible alternatives.  I could simplify my life CONSIDERABLY
by ignoring RTP and using a single multiplexed stream, with lower overhead
due to piggybacked nacks and acks, and much less hassle blowing multiple
holes through firewall/NATS.  But for the sake of standards and
compatibility I've gone through the pain of getting dealing with all the
issues surrounding RTP, feedback, RTP-through-NATs, etc, and have probably
hewn (much) closer to the specs than most, I suspect.

        Don't put up roadblocks where there isn't a reason to.  In this
case, there is a good argument _for this codec_ to use a fixed 16K
timestamp rate.

>> In the VMR-WB case I would think that having only one timestamp rate of
>> 16kHz does not effect codec operation and would simplify the handling
>> when one has some senders that do use 8kHz, especially when gateways need
>> to encoded sometime 8kHz material from pre-recorded responses and in
>> other cases WB channel data. This do avoid the need to perform RTP
>> timestamp rate switches.
>
>But in the process you make senders that support multiple codecs more
>complex, since they can't use the sampling rate to drive the media clock
>for all codecs.

The sender "can't use the sampling rate to drive the media clock for all
codecs"?

        Worst case (since this may fall out of all the tons of other work
the sender is doing) is the sender does:

        rtp_packet->timestamp = first_sample_time * conversion_for_codec;

(and no it doesn't have to use floating point even if it's fractional, of
course).

        More typically, I suspect for frame-oriented codecs the code will
initialize the RTP stack with the 'normal' timestamp increment, or
calculate that value once and store it in a variable to pass with each
packet to the RTP code.  Not to mention Magnus's comments about senders
who actually sample at much higher rates and downsample for encoding.

        My apologies for disagreeing, but I don't see any serious attempt
to answer either Magnus's, other commenter's or my arguments and examples
on this issue; I merely see assertion that "that's the way it's done", and
"it's simpler" without significant explanation (which we've disagreed with
and given arguments to).

-- 
Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team
rjesup@wgate.com

_______________________________________________
Audio/Video Transport Working Group
avt@ietf.org
https://www1.ietf.org/mailman/listinfo/avt