Re: [AVT] RE: <draft-ietf-avt-rtp-vmr-wb-03.txt>: sampling rate

My apologies for the delayed response; I accidentally deleted my
1/2-written response a few days ago.

Colin Perkins <csp@csperkins.org> writes:
>>> Yes. This is why RTP has the "rate" parameter, and uses the sampling
>>> rate as the RTP timestamp rate.
>>
>>         My apologies, but I don't see how that answers the question as to
>> what the benefit is.  How does the receiver make use of this information?
>> Why is this better for the receiver?
>
>If the codec can have different rates for source material and output
>material, it's clearly advantageous to signal that. More information is
>clearly better, as I hope is obvious.

        Sure - but that doesn't answer the question, which is how 
timestamp-rate == input-sample-rate benefits the receiver.

>For RTP audio, the convention is to use the sampling rate at the input to
>drive the timestamp. The benefit is that it gives a consistent RTP timing
>model, with well defined semantics independent of the codec.  This makes it
>simpler to design senders that operate with a range of codecs, and to
>design receivers that perform sample-accurate synchronisation (since the
>timestamp always increases by exactly 1 per audio sample in the input,
>meaning receivers can do synchronisation independent of the codec).

Ok, here there are some claims.  Let me list them:

1) "consistent RTP timing model, with well defined semantics independent of
   the codec"

   My apologies again, but that still seems to be hand-waving.  What
   semantics are you speaking of, and how does this do something useful?

2) "simpler to design senders that operate with a range of codecs"

   In all cases, the codec must tell the "sender" (RTP sender API) how
   many input samples a frame of RTP data corresponds to (directly or
   indirectly).  You could design things such that higher-level code (that
   feeds input data to the codec) "knows" the codec's frame size - but
   you really don't gain by doing so, and you remove a bunch of
   possibilities in codec design.  Also, doing so rather inappropriately
   mixes codec-specific knowledge into higher-level code - the higher level
   code would have to learn about silence suppression, etc, etc.

   Also the codec might want higher-sample-rate input in some cases to do
   it's own filtering before encoding.  And there are variable-framesize
   codecs, adaptive bandwidth codecs that might adapt input sample rates
   depending on guesses about the input, etc, etc.

   Certainly keeping timestamp-rate == input-sample-rate isn't necessary
   if the RTP send API includes a (possibly optional) parameter for the
   number of RTP timestamp units the block corresponds to.  And for all
   sorts of reasons it makes sense to have that, at least as an option.

3) [simpler to] "design receivers that perform sample-accurate
   synchronisation"

   Sample-accurate synchronization:
   a) if sticking by the RTP/RTCP model to the letter it isn't guaranteed
      to be possible, and a receiver can't assume it can do it.  RTCP will
      give you "this RTP timestamp == this NTP timestamp" for each stream
      independently, generally at wide intervals.  There is no guarantee
      even for streams from a single source that the sample times are
      synchronized - in fact in multimedia synchronization typically they
      aren't (video is a fixed 90000Hz timestamp rate, and video frames
      are usually sampled from an external clock, or if from an internal
      clock not the same one used to drive the audio D->A.  Even two audio 
      streams may (and in fact in many cases will) be based on different
      hardware clocks. 

      So no receiver can (in a spec-compliant way) assume it can do sample-
      accurate synchronization of streams from the same source, even if
      they're both audio.

   b) Since the timestamp-rates we're talking about are typically fixed
      in the RTP spec for the codec, one can assume that they will be
      chosen to make sense - typically to correspond to a useful rate
      for the codec data or a multiple of it (or a lowest-common-multiple
      if it's a multirate codec).  So from that end, sample accuracy
      shouldn't come into it.

   c) receivers don't get to process input data.  They get to process
      output data from the codec.  synchronization is typically done on
      output samples from the codec.  So you _can_ make an argument that
      the timestamp rate should correspond to the codec output sample
      rate, or for multirate codecs the lowest-common-multiple (though
      I would expect that multirate codecs would output at their highest
      rate regardless of the intermediate rate, but there may be some
      reasons not to - in which case the codec will have to provide data
      on the output rate for each block of output).

>>         I can see some advantage in having the RTP rate parameter be the
>> normal _output_ sample rate from the codec (which may be uncorrelated to
>> the input rate!) in that you could use that to implement an audio codec
>> with non-fixed frame sizes.
>
>I agree that might be useful. However, it doesn't fit with the semantics
>assigned to the RTP timestamp.

        I'd say it doesn't fit the original "fixed hardware
A->D/D->A"/G711/etc mindset/convention.  But I don't see how sticking to
that mindset/convention gets us anything, and as people have shown, it
definitely costs us something.

>>>> B. Is it necessary to use the sampling frequency as RTP timestamp rate.
>>>
>>> It's highly desirable.
>>
>>         Again, asserting this without a reason doesn't convince me.
>
>To make a consistent definition of the RTP timestamp.

        a) consistency is only useful if it provides you with an advantage.
        b) how is this consistent with other uses of RTP timestamps, such
           as video?  They're even further afield.

>>>> Colin, if one looks at issue B. Is it really needed to use the RTP
>>>> timestamp frequency equal to the sampling rate used? I would say NO to
>>>> that question.
>>>
>>> Yes, it is necessary to use an RTP timestamp equal to the sampling rate.
>>
>>         Again, no reasons given.
>
>Again, to avoid breaking standard RTP timing model.

        Again, what does this "break" other than a convention that so far
as I can see exists merely by happenstance and with no particular reason
behind it?

>>>> My reasoning is the following.
>>>>
>>>> - Many audio input is sampled from a source at a higher rate then the
>>>> encoder may handle. Thus a resampling and pre-processing stage is
>>>> employed based on the encoders input frequency rather then producing
>>>> that
>>>> rate initially from the hardware. Some of the reason is that the
>>>> pre-processing may actually yield better results than what the hardware
>>>> at given input rate can gain. Another reason may be that one like to
>>>> avoid switching the hardware between rate if changing the encoding.
>>>>
>>>> - The frame based decoders does not need to know the encoders input
>>>> rate. The encoder may anyway resample this into other rates for internal
>>>> processing and band limited signals. I would claim that VMR-WB, AMR-WB+
>>>> and AAC are all example of codecs that perform this kind of tricks. On
>>>> the receiver side they produce a output signal that has any sampling
>>>> frequency the receiver finds most useful. Either causing clipping of the
>>>> higher frequencies, but more commonly to a higher clock rate, despite
>>>> that no more information is provided simply for ease of use.
>>>>
>>>> - The frame based codecs do only need a RTP timestamp that allows the
>>>> receiver to correctly reconstruct the time line when the encoding is
>>>> done
>>>> with the most audio bandwidth. In the VMR-WB case this is 16kHz. AMR-WB+
>>>> is even more strange, as we have selected an RTP timestamp rate that
>>>> results in that all internal sampling frequencies will result in integer
>>>> timestamp ticks. Thus actually allowing one to correctly calculate frame
>>>> alignment when the internal sampling frequency changes. That the
>>>> frequency also is possible to recalculate into several common sampling
>>>> frequencies with few partial sample alignments was also considered.
>>>>
>>>> Thus I would use this to argue that indicating the actual sampling
>>>> frequency is not necessarily as long as the receiver is capable of
>>>> correctly reconstruct the media stream with its timing information in
>>>> full resolution.
>>>
>>> True, but it greatly simplifies the system if all codecs use the
>>> sampling rate as the RTP clock rate. You can make things work if each
>>> codec uses a different rate, but it's desirable that RTP is consistent
>>> where possible. Why is this codec so special that it needs to break this
>>> rule?
>>
>>         Again, I simply don't see how things are simplified by having the
>> RTP timestamp rate be the sender-side sample rate.  If the timestamp rate
>> is 2x or 4x the input rate, or if it's the lowest common multiple (if you
>> know what I mean) of a range of sample rates the encoder might use, how
>> does that hurt the receiver?
>
>See above.

        The above didn't answer it.  The closest thing to an answer were
the points 1-3 above that I pulled out of your response, and I so far
as I can see they don't answer the question I pose.

>> In fact, if it _is_ possible for the encoder to use multiple rates (like
>> VMR-WB), isn't it a lot easier on both sides if we can just us 16KHz for
>> RTP and avoid having the play games if we want to switch input rates
>> dynamically?  Setting the timestamp at the sender end is trivial.
>>
>>         Also, sticking with a strict timestamp rate == sample rate
>> dictate means that various classes of codecs won't fit well, will be
>> harder to write and/or use with RTP, or won't be developed because the
>> transports for them would be painful.
>
>I agree. This is a limitation of the way RTP has evolved, using the input
>sampling rate for the audio timestamp. Unfortunately changing the meaning
>of the RTP timestamp will break things, so we're stuck with the current
>model.

        So, breaking that down, you agree that sticking with the convention
limits the application of RTP, but you're worried about "break[ing]
things".  So if you can be convinced that the convention is not a
requirement and that defining new codecs that don't follow that convention
won't break anything, then I take it you'd be in favor?

>> Or, perhaps worse for the participant of this group, such codecs will
>> push the use of alternative mechanisms such as IAX2 or other possible
>> alternatives.  I could simplify my life CONSIDERABLY by ignoring RTP and
>> using a single multiplexed stream, with lower overhead due to piggybacked
>> nacks and acks, and much less hassle blowing multiple holes through
>> firewall/NATS.  But for the sake of standards and compatibility I've gone
>> through the pain of getting dealing with all the issues surrounding RTP,
>> feedback, RTP-through-NATs, etc, and have probably hewn (much) closer to
>> the specs than most, I suspect.
>>
>> Don't put up roadblocks where there isn't a reason to.  In this
>> case, there is a good argument _for this codec_ to use a fixed 16K
>> timestamp rate.
>
>However, doing so would fragment the consistent RTP timing model. I agree
>that using a 16kHz clock for this codec solves a short term implementation
>issue; however it makes RTP implementations that support multiple codec
>more complex in the long term, and fragments the standard.

        To be horribly repetitious: please show how this will make multiple
codec implementation noticably more complex, or how it will "fragment" the
standard and what the practical fallout would be.

-- 
Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team
rjesup@wgate.com

_______________________________________________
Audio/Video Transport Working Group
avt@ietf.org
https://www1.ietf.org/mailman/listinfo/avt