Re: [AVT] Submission and request for feedback on draft-valin-celt-rtp-profile-00.txt

Randell Jesup <rjesup@wgate.com> Tue, 10 March 2009 03:43 UTC

Return-Path: <rjesup@wgate.com>
X-Original-To: avt@core3.amsl.com
Delivered-To: avt@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 4D9CE3A6B29 for <avt@core3.amsl.com>; Mon, 9 Mar 2009 20:43:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.434
X-Spam-Level:
X-Spam-Status: No, score=-0.434 tagged_above=-999 required=5 tests=[AWL=-0.500, BAYES_00=-2.599, FH_HOST_EQ_D_D_D_D=0.765, J_CHICKENPOX_12=0.6, J_CHICKENPOX_14=0.6, J_CHICKENPOX_15=0.6, RDNS_DYNAMIC=0.1]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tTJiAvXPZkzV for <avt@core3.amsl.com>; Mon, 9 Mar 2009 20:43:06 -0700 (PDT)
Received: from exchange1.wgate.com (pr-66-150-46-254.wgate.com [66.150.46.254]) by core3.amsl.com (Postfix) with ESMTP id 8C2593A6988 for <avt@ietf.org>; Mon, 9 Mar 2009 20:43:06 -0700 (PDT)
Received: from jesup.eng.wgate.com ([10.32.2.26]) by exchange1.wgate.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 9 Mar 2009 23:43:40 -0400
To: Jean-Marc Valin <jean-marc.valin@octasic.com>
References: <C5664E27013B564EBFA8884606D2439106B33589@antihadron.jnpr.net> <ybu1vt8hay0.fsf@jesup.eng.wgate.com> <49B52906.5000300@octasic.com> <ybusklmfpku.fsf@jesup.eng.wgate.com> <49B58D1E.702@octasic.com>
From: Randell Jesup <rjesup@wgate.com>
Date: Mon, 09 Mar 2009 23:43:56 -0400
In-Reply-To: <49B58D1E.702@octasic.com> (Jean-Marc Valin's message of "Mon\, 09 Mar 2009 17\:41\:50 -0400")
Message-ID: <ybuvdqiavpf.fsf@jesup.eng.wgate.com>
User-Agent: Gnus/5.101 (Gnus v5.10.10) Emacs/21.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-OriginalArrivalTime: 10 Mar 2009 03:43:40.0846 (UTC) FILETIME=[6730D4E0:01C9A132]
Cc: avt@ietf.org
Subject: Re: [AVT] Submission and request for feedback on draft-valin-celt-rtp-profile-00.txt
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: Randell Jesup <rjesup@wgate.com>
List-Id: Audio/Video Transport Working Group <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/avt>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Mar 2009 03:43:08 -0000

Jean-Marc Valin <jean-marc.valin@octasic.com> writes:

>Randell Jesup wrote:
>>> Noted. We missed that when changing some SHOULDs for MUSTs. We hesitated a
>>> lot in deciding what's a SHOULD and what's a MUST. Basically, CELT's
>>> advantage is that it can operate with almost any sampling rate, frame size
>>> or bit-rate. The disadvantage is that unless we specify "baseline
>>> requirements", we might end up with several implementations that are unable
>>> to inter-operate. Also, feel free to suggest a better baseline if you think
>>> we didn't select the right one.
>>>
>> Is this intended to be used as a "speech" codec at all; as an alternative
>> to iLBC/G.729/G.722.x/etc?  If so, then support for 8KHz and/or 16KHz may
>> be important to mandate for interoperability reasons.
>CELT is (so far at least) not intended for lower sampling rates like 8 kHz
>or 16 kHz and doesn't operate in the same space as the codecs above or
>Speex. It's closer to codecs such as AAC-LD, G.722.1C, G.719, and ULD,
>though only ULD has a delay as short as CELT's.

Ok, though the issue still remains.  And since the offerer might not know who
it's calling, it may need to offer both low-bandwidth/sampling codecs and
higher-quality/bandwidth codecs.

Even among higher quality codecs like AAC-LD (which does go down to 16K I
think), you still have the same issue - offering mixes of sampling rates is
potentially a boatload of fun - and with these codecs, it's almost
impossible to avoid unless the offerer insists on a sampling rate, and then
you have problems if the answerer wants something different.

I'm not expecting you to solve the general issue all codecs have here, but
I do want to sensitize you to it, since your codec is much more flexible in
sampling rate and bandwidth than most others.  You may want to take it into
consideration in choosing required rates - choosing rates that other codecs
it might be combined with would use (like 48000, or 44100).  The downside
is that if the HW codec doesn't match the negotiated codec you have to do a
resampling after decode and before encode.

>> You DON'T want to be
>> mixing sample rates if multiple codecs are accepted.  (In theory it can be
>> done, but in practice it would be risky, especially in the face of packet
>> loss.)  For example, if you have this:
>>
>> Random example (I probably have the G722 media type wrong):
>> m=audio 4321 RTP/AVP 0 97 98
>> a=rtpmap:0 PCMU/8000
>> a=rtpmap:97 G722/16000
>> a=rtpmap:98 CELT/48000
>>
>> To quote from an earlier AVT email I wrote on this subject on 3 Dec 2007:
>>
>>     Subject: Re: [AVT] I-D ACTION:draft-ietf-avt-rtcpxr-audio-01.txt
>>     [big SNIP]
>>     This means the timestamp rate can change at any point, on a
>>     packet-by-packet basis.  It's even theoretically allowable to alternate
>>     G711 and G722 packets.  Totally odd and non-useful, but it illustrates
>>     the point.  More realistic is a change from one to the other half-way
>>     through an RTCP monitoring period.
>>
>So you mean that you'd need to maintain a coherent timestamp despite
>changing between codecs that have different sampling rates? I wasn't aware
>of that, so I guess it's something that needs to be addressed. Maybe just
>by saying that different sampling rates SHOULD NOT be used with the same m=

As written, when you switch payloads to one with a different sample
rate/timestamp rate, the timestamp rate would change at that point as well.
This causes considerable confusion in other related specs, since most
writers don't consider this case.  And it really causes confusion when a
packet loss occurs at the point of timestamp change (and to a slightly
lesser extent, when timestamp changes occur between RTCP SR/RR's, or during
the playout of an RFC 2833 packet).

I would NOT try to outlaw this within a codec; it's an issue for general
AVT resolution.

>> The wording there is very lawyer-ese, and really addressing how it's
>> affected by a multicast or conference setting.
>>
>> If the b=AS is at the m= level of the SDP (not above all the m='s), then it
>> only applies to that one media stream.  However, exactly what b=AS *means*
>> is very fuzzy.  b=AS is not a codec parameter; it's a stream parameter, and
>> it's also a reception parameter, not a "I plan to send" parameter.
>> There's been a lot of discussion about b=TIAS (RFC 3890) as a better way to
>> specify bandwidth.  Note that b=AS INCLUDES RTP/UDP/IP overhead, and thus
>> implicitly is dependent on packet rate.
>>
>I've seen many G.711 implementations using b=AS:64, so I thought we could
>use it in a way that excludes the overhead.

Those implementations are wrong... :-)  Yet another reason not to rely on 
it.  Also, why specify the bandwidth for a fixed-bandwidth codec? 

>> More to the point, b=AS is just one way to specify bandwidth.
>>
>> Another huge blocker for using b=AS (or b=anything) in this way: what if
>> another codec also offered in this stream wanted to re-use b=AS as well?
>> And what if the preferred bitrates (or max-bitrate) for each codec was
>> different?
>>
>> How much do you *need* to specify bandwidth here?  Realize that most
>> devices don't have a good idea what receive bandwidth is even theoretically
>> available, let alone practically.  Most configuration is done at the sender
>> end, or by explicit choice of codec and bitrate.
>>
>> I suggest reviewing how other multiple-bitrate codecs like G.722.x handle
>> this (AMR-WB, etc).
>>
>> Also, isn't CELT true variable-bitrate?  If so, the bitrate to use
>> (initially?) might be very different than the "maximum" bitrate.
>>
>CELT can change bit-rate at any time, but so far it only changes based on
>what the senter wants to use, i.e. to adapt to congestion. Even if used
>with b=AS:, I was thinking that it would be more like a max bit-rate anyway.

If you're using bandwidth parameters as specified in RFC 4566, then you
probably don't need to address them here at all (and then you don't have
to worry about newer ways to specify bandwidth, like b=TIAS - it's the
problem of a higher level).

>>> Well, I didn't see that to be a problem considering that one would probably
>>> want the same packetization time for any codec. Do you see a case where you
>>> wouldn't want that? I'm not not quite sure how people use the ptime in
>>> practice and how much it is followed.
>>>
>>
>> Sure.  You might prefer 10ms, which G.711 and some others support, but iLBC
>> only supports 20 and 30 ms frames, and thus only multiples of those for the
>> actual packetization time. You can still specify a ptime of 10 when using
>> iLBC.
>>
>> ptime is merely a "I would prefer to receive" parameter.  Do not rely on it
>> for anything.  The actual packetization time does NOT have to be the same
>> in each direction, or even the same from one RTP packet to the next.
>> (packet 1 could have 1 frame and packet 2 could have 10).
>>
>> iLBC is really negotiating the framesize, not the packetization time.
>>
>Well, I was thinking of using the ptime just for a preference. You specify
>the frame size and ptime helps decide how many frames get sent per packet.

Sure.  That's the intention of ptime.  It's a preference.  Sounds like
you're using it as intended, so you may not need to spend much time
discussing it, except to mention the (unusual) flexibility of CELT.

>>> The fundamental issue here is that one needs to know the frame size to be
>>> able to initialise the decoder. Just like a codec like iLBC had two modes
>>> (for 20 ms and 30 ms frame), CELT has a *very large* number of modes: one
>>> for each combination of frame size and sampling rate. So the idea was that
>>> one side offers a list of frame sizes and the other side responds with the
>>> one it likes best and both sides use that. There is no way to decode media
>>> without knowing the frame size. Any idea what's the best way to handle
>>> that?
>>>
>>
>> a) use different payloads.  Clear, easy, wastes space in SDP
>> b) include framesize in the bitstream in *every* packet.  Clear, easy,
>>    no-fuss, wastes some bandwidth all the time.  And isn't this required
>>    anyways if there's more than one channel?
>> c) ignore media until you get an answer with a clear selection.  Clear,
>>    easy, could be a major loss of media on a delayed answer.
>> d) require media sent (at least until acknowledgment of an answer) be
>>    clearly distinguishable - i.e. do not vary framesize from the offered
>>    values until an ACK, and do not allow packet sizes that are common
>>    multiples of offered framesizes.
>>    For example, if you offer (say) 20 and 30, do not send a packet
>> containing 60, 120, etc.  You can send 20, 30, 40, 80, 90, 100, since
>>    they can't be mis-understood.
>>    Complex, doesn't waste bits, artificial constraints, no adaptation to
>>    congestion/etc until ACK.
>>
>OK, I'll need to give this a bit more thought.

My suggestion is (a), with a second choice of (b).  Or delete it entirely.

>>> We're also considering having a "configuration" packet to be sent at the
>>> beginning of any stream and that includes even more mode-specific data to
>>> increase flexibility. However, we haven't found a good way of doing that
>>> yet (wrt loss of the configuration packet). Any thought on that?
>>>
>>
>> Yes: you can't assume 0-loss.  Also, if possible, the bitstream should be
>> decodable without the out-of-band channel information.  Video people have
>> struggled with this with sprop-parameter-sets in H.264 (RFC 3984).
>> Downsize will be bandwidth used.  You can amortize the overhead by sending
>> the config packets only periodically, but you'll probably need to send
>> them reasonably often to allow mid-stream join (think conferences).
>>
>Yes, I'm aware of the loss problem and I'm not sure how to handle
>that. I'll have a look at the RFC you mention.

It may not be obvious.  You'll want to look at the discussions in the last
year leading up to RFC 3489bis, now at WGLC or so.  You won't have the
complications that led to sprop-level-parameter-sets, but you have the
choice to put the data out-of-band (like sprop-parameter-sets), or in-band
(H.264 Picture and Sequence Parameter Set NAL units).

>>> Actually, we're not yet sure the reduced overhead is worth the trouble of
>>> adding another layout. Any thought?
>
>> Probably not.  How big a saving is it?  This mandates (effectively) CBR.
>> Too bad there's no way in-stream to know when you've finished decoding
>> the channel (stop token or the like).  Then you don't need channel
>> sizes at all (though perhaps you might want them).
>>
>Well, the stop token would waste about the same space as the size value...

Well, if the frame size is fixed, then (depending on how things are
encoded) you may be able to have an implicit length (stop when you generate
enough bytes) - but the codec might be more frequency-domain and this idea
wouldn't fly.

Also, I assumed it might be a variable bitstring encoding, and so a "STOP"
symbol might only cost a few bits.

>>> Well, we have the same fundamental issue everywhere. If we don't know what
>>> frame size was selected, we cannot decode anything. So we always need the
>>> answer. The only way I can see to go around that would be to use a
>>> different payload type for every parameter combination, but I think that
>>> would be ugly.
>>>
>>
>> You really *need* to handle the media-before-answer case somehow, even if
>> it's separate payloads.
>>
>Hadn't realised it was an issue. Need to give more thought into this. Other
>suggestions welcome.

Anyone else?  Basically, you need to be able to reasonably decode when you
have the offer and received data, but no answer.  This might (or might not)
be related to how you'd handle people joining a conference or multicast.
See also the RTP topologies RFC (forget the number).

>>>> That would be (I assume) really 1 byte per channel per frame, not 1 byte
>>>> per frame.
>>>>
>>>>
>>> Well, technically you can increase the bit-rate by one byte for only one
>>> channel...
>>>
>>
>> I meant that the change rate is one byte per channel per frame (max).  This
>> implies the frame-size rate of change could be anywhere in the range of
>> plus/minus the number of channels (in bytes).
>>
>> I assume the decoder would handle large frame gaps with corresponding large
>> changes in framesize
>Basically, any channel of any frame can have a different size and change
>arbitrarily from one frame to the next. You can have one channel jumping
>from 64 to 128 kbps while the other goes from 96 to 48 kbps. That's why
>there's one byte per channel per frame used for the size.

So what then was the "one-byte change in bitrate per frame" then?

I was referring to what happens if 10 packets in a row are lost - can the
receiver decode a packet that might be 10 bytes smaller or larger?  Does it
need as input the number of packets lost (if any) since the last decoded
packet? 

-- 
Randell Jesup, Worldgate (developers of the Ojo videophone), ex-Amiga OS team
rjesup@wgate.com
"The fetters imposed on liberty at home have ever been forged out of the weapons
provided for defence against real, pretended, or imaginary dangers from abroad."
		- James Madison, 4th US president (1751-1836)