Re: [AVT] Submission and request for feedback on draft-valin-celt-rtp-profile-00.txt

Randell Jesup <rjesup@wgate.com> Mon, 09 March 2009 19:44 UTC

To: Jean-Marc Valin <jean-marc.valin@octasic.com>
References: <C5664E27013B564EBFA8884606D2439106B33589@antihadron.jnpr.net> <ybu1vt8hay0.fsf@jesup.eng.wgate.com> <49B52906.5000300@octasic.com>
From: Randell Jesup <rjesup@wgate.com>
Date: Mon, 09 Mar 2009 15:45:05 -0400
In-Reply-To: <49B52906.5000300@octasic.com> (Jean-Marc Valin's message of "Mon, 09 Mar 2009 10:34:46 -0400")
Message-ID: <ybusklmfpku.fsf@jesup.eng.wgate.com>
User-Agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Cc: avt@ietf.org
Subject: Re: [AVT] Submission and request for feedback on draft-valin-celt-rtp-profile-00.txt
Precedence: list
Reply-To: Randell Jesup <rjesup@wgate.com>

Jean-Marc Valin <jean-marc.valin@octasic.com> writes:
>Hi Randall,
>
>Thanks very much for taking time to carefully read our draft and comment
>it. Some more comments/answers below.

No problem.  Easier to fix problems up front.

>>> 5.  SDP usage of CELT
>>>
>>>    When conveying information by SDP [rfc2327], the encoding name MUST
>>>    be set to "CELT".  The sampling frequency is typically between 32000
>>>    and 48000 Hz.  Implementations SHOULD support both 44100 Hz and 48000
>>>    Hz.  The maximum bandwidth permitted for the CELT audio is encoded
>>>    using the "b=AS:" header, as explained in SDP [rfc2327].
>>>
>>
>> Above it said 48000 is a MUST, so it should say that here.
>Noted. We missed that when changing some SHOULDs for MUSTs. We hesitated a
>lot in deciding what's a SHOULD and what's a MUST. Basically, CELT's
>advantage is that it can operate with almost any sampling rate, frame size
>or bit-rate. The disadvantage is that unless we specify "baseline
>requirements", we might end up with several implementations that are unable
>to inter-operate. Also, feel free to suggest a better baseline if you think
>we didn't select the right one.

Is this intended to be used as a "speech" codec at all; as an alternative
to iLBC/G.729/G.722.x/etc?  If so, then support for 8KHz and/or 16KHz may
be important to mandate for interoperability reasons.  You DON'T want to be
mixing sample rates if multiple codecs are accepted.  (In theory it can be
done, but in practice it would be risky, especially in the face of packet
loss.)  For example, if you have this:

Random example (I probably have the G722 media type wrong):
m=audio 4321 RTP/AVP 0 97 98
a=rtpmap:0 PCMU/8000
a=rtpmap:97 G722/16000
a=rtpmap:98 CELT/48000

To quote from an earlier AVT email I wrote on this subject on 3 Dec 2007:

    Subject: Re: [AVT] I-D ACTION:draft-ietf-avt-rtcpxr-audio-01.txt
    [big SNIP]
    This means the timestamp rate can change at any point, on a
    packet-by-packet basis.  It's even theoretically allowable to alternate
    G711 and G722 packets.  Totally odd and non-useful, but it illustrates
    the point.  More realistic is a change from one to the other half-way
    through an RTCP monitoring period.

There were earlier discussions about this; no work has been done on finding
ways to deal with the issues raised by that sort of SDP, especially in the
face of packet loss.  Such a shift in timestamp rate can mess up
NTP<->timestamp conversion, and mess up things like RFC 2833 DTMF packets
that might overlap a change (or be near one).

>> You should not be using b=AS: here for codec bandwidth settings.  b=AS
>> applies to ALL payloads negotiated for a session; the bandwidth for
>> the CELT payload may be different (and in fact you could offer several
>> payload values, each with a different bandwidth specified in the a=fmtp
>> line for each payload.  So, bandwidth moves to fmtp.
>>
>I was under the impression that b=CT was applied to all payloads while
>b=AS: was just for one media:
>
>"AS gives a bandwidth figure for a single media at a single site, although
>there may be many sites sending simultaneously." 
>
>Isn't that what we want?

The wording there is very lawyer-ese, and really addressing how it's
affected by a multicast or conference setting. 

If the b=AS is at the m= level of the SDP (not above all the m='s), then it
only applies to that one media stream.  However, exactly what b=AS *means*
is very fuzzy.  b=AS is not a codec parameter; it's a stream parameter, and
it's also a reception parameter, not a "I plan to send" parameter.
There's been a lot of discussion about b=TIAS (RFC 3890) as a better way to
specify bandwidth.  Note that b=AS INCLUDES RTP/UDP/IP overhead, and thus
implicitly is dependent on packet rate.

More to the point, b=AS is just one way to specify bandwidth.  

Another huge blocker for using b=AS (or b=anything) in this way: what if
another codec also offered in this stream wanted to re-use b=AS as well?
And what if the preferred bitrates (or max-bitrate) for each codec was
different?

How much do you *need* to specify bandwidth here?  Realize that most
devices don't have a good idea what receive bandwidth is even theoretically
available, let alone practically.  Most configuration is done at the sender
end, or by explicit choice of codec and bitrate.

I suggest reviewing how other multiple-bitrate codecs like G.722.x handle
this (AMR-WB, etc).

Also, isn't CELT true variable-bitrate?  If so, the bitrate to use
(initially?) might be very different than the "maximum" bitrate.

>>>       ptime: The desired packetization time.  The sender SHOULD choose a
>>>       number of frames per packet that corresponds to the smallest
>>>       packetization time greater or equal to the specified ptime for the
>>>       selected frame size.  The default is 20 ms as specified in
>>>       [rfc3551]
>>
>> Ok, though realize again that ptime is shared by all payloads.  You may
>> want to consider negotiating it directly much as iLBC does via fmtp.
>>
>Well, I didn't see that to be a problem considering that one would probably
>want the same packetization time for any codec. Do you see a case where you
>wouldn't want that? I'm not not quite sure how people use the ptime in
>practice and how much it is followed.

Sure.  You might prefer 10ms, which G.711 and some others support, but iLBC
only supports 20 and 30 ms frames, and thus only multiples of those for the
actual packetization time. You can still specify a ptime of 10 when using
iLBC. 

ptime is merely a "I would prefer to receive" parameter.  Do not rely on it
for anything.  The actual packetization time does NOT have to be the same
in each direction, or even the same from one RTP packet to the next.
(packet 1 could have 1 frame and packet 2 could have 10).  

iLBC is really negotiating the framesize, not the packetization time.

>>>       maxptime: The maximum packetization time desired.  If the maximum
>>>       is lower than the smallest packetization time determined from the
>>>       chosen frame size (as described above), then that packtization
>>>       time SHOULD be used despite the maxptime value.  The default is
>>>       "no maximum".
>>
>> Ok, but this really isn't part of the codec.  Doesn't hurt to tell people.
>> (I'm assuming that the SDP spec for maxptime allows ignoring it (somewhat)
>> - if it doesn't, then you have to reject that payload.)
>>
>As far as I understand, maxptime is a SHOULD in the rfc, so I thought we'd
>just mention its interpretation wrt CELT.

Ok, then say something like "per [RFC 4566], if the maximum is lower..."

>>>    CELT-specific parameters can be given via the "a=fmtp:" directive.
>>>    Several parameters can be given in a single a=fmtp line provided that
>>>    they are separated by a semi-colon.  The following parameters are
>>>    defined for use in this way:
>>>
>>>       frame-size: The frame size is the duration of each frame in
>>>       samples.  If more than one frame size is supported, a comma-
>>>       separated list can be used.  It is possible to use "any" to denote
>>>       that all even frame sizes are supported.  The default is 480.
>>>
>>
>> You should give an example.  Also make sure there's a BNF. (I haven't read
>> ahead yet.)
>>
>OK, so maybe we should add a BNF. As for the examples, I think they're
>adequate, but let me know if that's not the case.

BNF may not be *needed*, but if there's anything at all complex it's handy to
avoiding mistakes.

>>>       mapping: Optional string describing the multi-channel mapping.
>>>
>>>    Because the frame-size is not transmitted in-band, an SDP answer MUST
>>>    contain only one frame-size, even if multiple frame sizes were
>>>    offered.
>>>
>>
>> Why?  I thought I knew, but then I thought about it.  Framesize is an fmtp
>> entry that defines the framesizes you expect to receive.
>>
>> You also should define how the set of allowable framesizes should be
>> determined, and if the answered frame size MUST be one of the offered ones
>> or not.  If the answer does lock-down the framesize in both directions,
>> then you need to remind the offerer that they must be ready to receive any
>> of the framesizes they offered, without any answer, since the answer can
>> be delayed or lost, and media has already started flowing.
>>
>The fundamental issue here is that one needs to know the frame size to be
>able to initialise the decoder. Just like a codec like iLBC had two modes
>(for 20 ms and 30 ms frame), CELT has a *very large* number of modes: one
>for each combination of frame size and sampling rate. So the idea was that
>one side offers a list of frame sizes and the other side responds with the
>one it likes best and both sides use that. There is no way to decode media
>without knowing the frame size. Any idea what's the best way to handle
>that?

a) use different payloads.  Clear, easy, wastes space in SDP
b) include framesize in the bitstream in *every* packet.  Clear, easy,
   no-fuss, wastes some bandwidth all the time.  And isn't this required
   anyways if there's more than one channel?
c) ignore media until you get an answer with a clear selection.  Clear,
   easy, could be a major loss of media on a delayed answer.
d) require media sent (at least until acknowledgment of an answer) be
   clearly distinguishable - i.e. do not vary framesize from the offered
   values until an ACK, and do not allow packet sizes that are common
   multiples of offered framesizes.
   For example, if you offer (say) 20 and 30, do not send a packet 
   containing 60, 120, etc.  You can send 20, 30, 40, 80, 90, 100, since
   they can't be mis-understood.
   Complex, doesn't waste bits, artificial constraints, no adaptation to
   congestion/etc until ACK.

>We're also considering having a "configuration" packet to be sent at the
>beginning of any stream and that includes even more mode-specific data to
>increase flexibility. However, we haven't found a good way of doing that
>yet (wrt loss of the configuration packet). Any thought on that?

Yes: you can't assume 0-loss.  Also, if possible, the bitstream should be
decodable without the out-of-band channel information.  Video people have
struggled with this with sprop-parameter-sets in H.264 (RFC 3984).
Downsize will be bandwidth used.  You can amortize the overhead by sending
the config packets only periodically, but you'll probably need to send
them reasonably often to allow mid-stream join (think conferences).

>>>    The selected frame-size values MUST be even.  They SHOULD be
>>>    divisible by 8 and have a prime factorization which consists only of
>>>    2, 3, or 5 factors.  For example, powers-of-two and values such as
>>>    160, 320, 240, and 480 are recommended.  Implementations MUST support
>>>    receiving and sending the default value of 480, and if the size 480
>>>    is supported it MUST be offered.  Implementations SHOULD also support
>>>    frame sizes of 256 and 512 since these are the ones that lead to the
>>>    lowest complexity.  When frame sizes that are powers of two are
>>>    supported, they SHOULD be listed first in the offer and chosen over
>>>    non powers of two in the answer.
>>>
>>
>> (editorial) Needs hyphenation.
>>
>> Why 2, 3 or 5?  Need to speak to why I think.
>>
>OK, that's mainly because of the CELT uses an FFT, which is faster for
>small radices. We'll explain that better. In the end, I don't think many
>people will really want frame sizes like 238 samples.

>> Also again note that ptime is a shared value, and most other codecs use
>> multiples of 10ms.  That said, it's merely a recommendation/preference, but
>> if so it should generally not be specific to CELT.  In this case, it
>> probably should be 20, and CELT would round up.  If this is an important
>> parameter that may want to be different from the values other payloads
>> might want to use, you may want to move it to fmtp.
>>
>I don't see a reason to make it different from other codecs considering
>that CELT can handle about any ptime. I wouldn't mind doing it though if
>there's a use case for it.

Then specify a value, and let CELT round up (or down) as needed.
It might be nice to show how this might interact with alternative payloads
that might want multiples of 10.

>>> 5.2.  Low-Overhead Mode
>>>
>>>    A low-overhead mode is defined to make more efficient use of
>>>    bandwidth when transmitting CELT frames.  In that mode none of the
>>>    length values need to be transmitted.  One the a=fmtp: parameter low-
>>>
>>                                             ^^^ remove
>>
>>>    overhead: is defined and contains a single frame size, followed by a
>>>    '/', followed by the number of frames (per channel) per packet,
>>>    followed by a '/', followed by a comma-separated list of the number
>>>    of bytes per frame for each stream defined in the channel mapping.
>>>    The frame-size: parameter MUST not be specified and SHOULD be ignored
>>>    if encountered in an SDP offer or answer.  The ptime:, maxptime: and
>>>    b=AS: parameters SHOULD also be ignored since the low-overhead:
>>>    parameter makes them redundant.  When the low-overhead: parameter is
>>>    specified, the length of each frame MUST NOT be encoded in the
>>>    payload and the bit-rate MUST NOT be changed during the session.
>>>
>>
>> So the packet layout in the initial payloiad diagram is not really *the*
>> layout; there's an alternative.
>>
>Actually, we're not yet sure the reduced overhead is worth the trouble of
>adding another layout. Any thought?

Probably not.  How big a saving is it?  This mandates (effectively) CBR.
Too bad there's no way in-stream to know when you've finished decoding
the channel (stop token or the like).  Then you don't need channel
sizes at all (though perhaps you might want them).

>> Also, how does this work with Offer/Answer?  If the
>> answerer starts sending immediately, and the answer itself is delayed or
>> lost, then the offerer might start receiving media without being able to
>> know if it has length bytes.  You probably should make it clear it's a
>> *receive*-only parameter - that removes the issue with delayed/lost OK's.
>> You're saying "I support receiving low-bandwidth data on this payload".
>> It says nothing about what you're sending.
>>
>Well, we have the same fundamental issue everywhere. If we don't know what
>frame size was selected, we cannot decode anything. So we always need the
>answer. The only way I can see to go around that would be to use a
>different payload type for every parameter combination, but I think that
>would be ugly.

You really *need* to handle the media-before-answer case somehow, even if
it's separate payloads.

>>> 6.  Congestion Control
>>>
>>>    CELT allows for bitrate adjustment in one byte per frame increments
>>>    without any signaling requirement or overhead.  Applications SHOULD
>>>    utilize congestion control to regulate the transmitted bitrate.  In
>>>    some applications it may make sense to increase the packetization
>>>    interval rather than decreasing the codec bitrate.  Congestion
>>>    control implementations should consider the users differential
>>>    tolerance for high latency and low quality.
>>>
>>
>> That would be (I assume) really 1 byte per channel per frame, not 1 byte
>> per frame.
>>
>Well, technically you can increase the bit-rate by one byte for only one
>channel...

I meant that the change rate is one byte per channel per frame (max).  This
implies the frame-size rate of change could be anywhere in the range of
plus/minus the number of channels (in bytes).

I assume the decoder would handle large frame gaps with corresponding large
changes in framesize?

-- 
Randell Jesup, Worldgate (developers of the Ojo videophone), ex-Amiga OS team
rjesup@wgate.com
"The fetters imposed on liberty at home have ever been forged out of the weapons
provided for defence against real, pretended, or imaginary dangers from abroad."
		- James Madison, 4th US president (1751-1836)

[AVT] Submission and request for feedback on draf… Gregory Maxwell
Re: [AVT] Submission and request for feedback on … Randell Jesup
Re: [AVT] Submission and request for feedback on … Jean-Marc Valin
Re: [AVT] Submission and request for feedback on … Gregory Maxwell
Re: [AVT] Submission and request for feedback on … Randell Jesup
Re: [AVT] Submission and request for feedback on … Jean-Marc Valin
Re: [AVT] Submission and request for feedback on … Randell Jesup
Re: [AVT] Submission and request for feedback on … Gregory Maxwell
Re: [AVT] Submission and request for feedback on … Randell Jesup
[AVT] Request for feedback on draft-valin-celt-rt… Jean-Marc Valin
Re: [AVT] Request for feedback on draft-valin-cel… Stefan Sayer
Re: [AVT] Request for feedback on draft-valin-cel… Jean-Marc Valin
Re: [AVT] Request for feedback on draft-valin-cel… Stefan Sayer
Re: [AVT] Request for feedback on draft-valin-cel… Jean-Marc Valin