Re: [AVT] Submission and request for feedback on draft-valin-celt-rtp-profile-00.txt

Jean-Marc Valin <jean-marc.valin@octasic.com> Mon, 09 March 2009 21:40 UTC

Return-Path: <jean-marc.valin@octasic.com>
X-Original-To: avt@core3.amsl.com
Delivered-To: avt@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id B2DB03A6CA1 for <avt@core3.amsl.com>; Mon, 9 Mar 2009 14:40:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.648
X-Spam-Level:
X-Spam-Status: No, score=-0.648 tagged_above=-999 required=5 tests=[AWL=0.151, BAYES_00=-2.599, J_CHICKENPOX_12=0.6, J_CHICKENPOX_14=0.6, J_CHICKENPOX_15=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id F1jdT9JOkQko for <avt@core3.amsl.com>; Mon, 9 Mar 2009 14:40:54 -0700 (PDT)
Received: from MAILEXCH.octasic.com (mail.octasic.com [216.208.79.2]) by core3.amsl.com (Postfix) with ESMTP id 1CB3A3A6CD1 for <avt@ietf.org>; Mon, 9 Mar 2009 14:40:53 -0700 (PDT)
Received: from [142.138.24.19] ([142.138.24.19]) by MAILEXCH.octasic.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 9 Mar 2009 17:41:27 -0400
Message-ID: <49B58D1E.702@octasic.com>
Date: Mon, 09 Mar 2009 17:41:50 -0400
From: Jean-Marc Valin <jean-marc.valin@octasic.com>
User-Agent: Thunderbird 2.0.0.19 (X11/20090105)
MIME-Version: 1.0
To: Randell Jesup <rjesup@wgate.com>
References: <C5664E27013B564EBFA8884606D2439106B33589@antihadron.jnpr.net> <ybu1vt8hay0.fsf@jesup.eng.wgate.com> <49B52906.5000300@octasic.com> <ybusklmfpku.fsf@jesup.eng.wgate.com>
In-Reply-To: <ybusklmfpku.fsf@jesup.eng.wgate.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-OriginalArrivalTime: 09 Mar 2009 21:41:27.0883 (UTC) FILETIME=[CD55D9B0:01C9A0FF]
Cc: avt@ietf.org
Subject: Re: [AVT] Submission and request for feedback on draft-valin-celt-rtp-profile-00.txt
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Audio/Video Transport Working Group <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/avt>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Mar 2009 21:40:55 -0000

Randell Jesup wrote:
>> Noted. We missed that when changing some SHOULDs for MUSTs. We hesitated a
>> lot in deciding what's a SHOULD and what's a MUST. Basically, CELT's
>> advantage is that it can operate with almost any sampling rate, frame size
>> or bit-rate. The disadvantage is that unless we specify "baseline
>> requirements", we might end up with several implementations that are unable
>> to inter-operate. Also, feel free to suggest a better baseline if you think
>> we didn't select the right one.
>>     
> Is this intended to be used as a "speech" codec at all; as an alternative
> to iLBC/G.729/G.722.x/etc?  If so, then support for 8KHz and/or 16KHz may
> be important to mandate for interoperability reasons.  
CELT is (so far at least) not intended for lower sampling rates like 8 
kHz or 16 kHz and doesn't operate in the same space as the codecs above 
or Speex. It's closer to codecs such as AAC-LD, G.722.1C, G.719, and 
ULD, though only ULD has a delay as short as CELT's.
> You DON'T want to be
> mixing sample rates if multiple codecs are accepted.  (In theory it can be
> done, but in practice it would be risky, especially in the face of packet
> loss.)  For example, if you have this:
>
> Random example (I probably have the G722 media type wrong):
> m=audio 4321 RTP/AVP 0 97 98
> a=rtpmap:0 PCMU/8000
> a=rtpmap:97 G722/16000
> a=rtpmap:98 CELT/48000
>
> To quote from an earlier AVT email I wrote on this subject on 3 Dec 2007:
>
>     Subject: Re: [AVT] I-D ACTION:draft-ietf-avt-rtcpxr-audio-01.txt
>     [big SNIP]
>     This means the timestamp rate can change at any point, on a
>     packet-by-packet basis.  It's even theoretically allowable to alternate
>     G711 and G722 packets.  Totally odd and non-useful, but it illustrates
>     the point.  More realistic is a change from one to the other half-way
>     through an RTCP monitoring period.
>   
So you mean that you'd need to maintain a coherent timestamp despite 
changing between codecs that have different sampling rates? I wasn't 
aware of that, so I guess it's something that needs to be addressed. 
Maybe just by saying that different sampling rates SHOULD NOT be used 
with the same m=

> The wording there is very lawyer-ese, and really addressing how it's
> affected by a multicast or conference setting. 
>
> If the b=AS is at the m= level of the SDP (not above all the m='s), then it
> only applies to that one media stream.  However, exactly what b=AS *means*
> is very fuzzy.  b=AS is not a codec parameter; it's a stream parameter, and
> it's also a reception parameter, not a "I plan to send" parameter.
> There's been a lot of discussion about b=TIAS (RFC 3890) as a better way to
> specify bandwidth.  Note that b=AS INCLUDES RTP/UDP/IP overhead, and thus
> implicitly is dependent on packet rate.
>   
I've seen many G.711 implementations using b=AS:64, so I thought we 
could use it in a way that excludes the overhead.
> More to the point, b=AS is just one way to specify bandwidth.  
>
> Another huge blocker for using b=AS (or b=anything) in this way: what if
> another codec also offered in this stream wanted to re-use b=AS as well?
> And what if the preferred bitrates (or max-bitrate) for each codec was
> different?
>
> How much do you *need* to specify bandwidth here?  Realize that most
> devices don't have a good idea what receive bandwidth is even theoretically
> available, let alone practically.  Most configuration is done at the sender
> end, or by explicit choice of codec and bitrate.
>
> I suggest reviewing how other multiple-bitrate codecs like G.722.x handle
> this (AMR-WB, etc).
>
> Also, isn't CELT true variable-bitrate?  If so, the bitrate to use
> (initially?) might be very different than the "maximum" bitrate.
>   
CELT can change bit-rate at any time, but so far it only changes based 
on what the senter wants to use, i.e. to adapt to congestion. Even if 
used with b=AS:, I was thinking that it would be more like a max 
bit-rate anyway.

>> Well, I didn't see that to be a problem considering that one would probably
>> want the same packetization time for any codec. Do you see a case where you
>> wouldn't want that? I'm not not quite sure how people use the ptime in
>> practice and how much it is followed.
>>     
>
> Sure.  You might prefer 10ms, which G.711 and some others support, but iLBC
> only supports 20 and 30 ms frames, and thus only multiples of those for the
> actual packetization time. You can still specify a ptime of 10 when using
> iLBC. 
>
> ptime is merely a "I would prefer to receive" parameter.  Do not rely on it
> for anything.  The actual packetization time does NOT have to be the same
> in each direction, or even the same from one RTP packet to the next.
> (packet 1 could have 1 frame and packet 2 could have 10).  
>
> iLBC is really negotiating the framesize, not the packetization time.
>   
Well, I was thinking of using the ptime just for a preference. You 
specify the frame size and ptime helps decide how many frames get sent 
per packet.
>>> Ok, but this really isn't part of the codec.  Doesn't hurt to tell people.
>>> (I'm assuming that the SDP spec for maxptime allows ignoring it (somewhat)
>>> - if it doesn't, then you have to reject that payload.)
>>>
>>>       
>> As far as I understand, maxptime is a SHOULD in the rfc, so I thought we'd
>> just mention its interpretation wrt CELT.
>>     
>
> Ok, then say something like "per [RFC 4566], if the maximum is lower..."
>   
OK, will do that.
>> OK, so maybe we should add a BNF. As for the examples, I think they're
>> adequate, but let me know if that's not the case.
>>     
>
> BNF may not be *needed*, but if there's anything at all complex it's handy to
> avoiding mistakes.
>   
Noted.
>> The fundamental issue here is that one needs to know the frame size to be
>> able to initialise the decoder. Just like a codec like iLBC had two modes
>> (for 20 ms and 30 ms frame), CELT has a *very large* number of modes: one
>> for each combination of frame size and sampling rate. So the idea was that
>> one side offers a list of frame sizes and the other side responds with the
>> one it likes best and both sides use that. There is no way to decode media
>> without knowing the frame size. Any idea what's the best way to handle
>> that?
>>     
>
> a) use different payloads.  Clear, easy, wastes space in SDP
> b) include framesize in the bitstream in *every* packet.  Clear, easy,
>    no-fuss, wastes some bandwidth all the time.  And isn't this required
>    anyways if there's more than one channel?
> c) ignore media until you get an answer with a clear selection.  Clear,
>    easy, could be a major loss of media on a delayed answer.
> d) require media sent (at least until acknowledgment of an answer) be
>    clearly distinguishable - i.e. do not vary framesize from the offered
>    values until an ACK, and do not allow packet sizes that are common
>    multiples of offered framesizes.
>    For example, if you offer (say) 20 and 30, do not send a packet 
>    containing 60, 120, etc.  You can send 20, 30, 40, 80, 90, 100, since
>    they can't be mis-understood.
>    Complex, doesn't waste bits, artificial constraints, no adaptation to
>    congestion/etc until ACK.
>   
OK, I'll need to give this a bit more thought.

>   
>> We're also considering having a "configuration" packet to be sent at the
>> beginning of any stream and that includes even more mode-specific data to
>> increase flexibility. However, we haven't found a good way of doing that
>> yet (wrt loss of the configuration packet). Any thought on that?
>>     
>
> Yes: you can't assume 0-loss.  Also, if possible, the bitstream should be
> decodable without the out-of-band channel information.  Video people have
> struggled with this with sprop-parameter-sets in H.264 (RFC 3984).
> Downsize will be bandwidth used.  You can amortize the overhead by sending
> the config packets only periodically, but you'll probably need to send
> them reasonably often to allow mid-stream join (think conferences).
>   
Yes, I'm aware of the loss problem and I'm not sure how to handle that. 
I'll have a look at the RFC you mention.
>> I don't see a reason to make it different from other codecs considering
>> that CELT can handle about any ptime. I wouldn't mind doing it though if
>> there's a use case for it.
>>     
>
> Then specify a value, and let CELT round up (or down) as needed.
> It might be nice to show how this might interact with alternative payloads
> that might want multiples of 10.
>   
I guess the rounding would just be a bit more. WIll add an example.
>> Actually, we're not yet sure the reduced overhead is worth the trouble of
>> adding another layout. Any thought?
>>     
>
> Probably not.  How big a saving is it?  This mandates (effectively) CBR.
> Too bad there's no way in-stream to know when you've finished decoding
> the channel (stop token or the like).  Then you don't need channel
> sizes at all (though perhaps you might want them).
>   
Well, the stop token would waste about the same space as the size value...

>> Well, we have the same fundamental issue everywhere. If we don't know what
>> frame size was selected, we cannot decode anything. So we always need the
>> answer. The only way I can see to go around that would be to use a
>> different payload type for every parameter combination, but I think that
>> would be ugly.
>>     
>
> You really *need* to handle the media-before-answer case somehow, even if
> it's separate payloads.
>   
Hadn't realised it was an issue. Need to give more thought into this. 
Other suggestions welcome.
>>> That would be (I assume) really 1 byte per channel per frame, not 1 byte
>>> per frame.
>>>
>>>       
>> Well, technically you can increase the bit-rate by one byte for only one
>> channel...
>>     
>
> I meant that the change rate is one byte per channel per frame (max).  This
> implies the frame-size rate of change could be anywhere in the range of
> plus/minus the number of channels (in bytes).
>
> I assume the decoder would handle large frame gaps with corresponding large
> changes in framesize
Basically, any channel of any frame can have a different size and change 
arbitrarily from one frame to the next. You can have one channel jumping 
from 64 to 128 kbps while the other goes from 96 to 48 kbps. That's why 
there's one byte per channel per frame used for the size.

Cheers,

    Jean-Marc