Re: [codec] Next Steps for WG

That should work. 
Make sure to get the time alignment right by adjusting for look-ahead etc. 
And only the first frame of the double talk period has to be coded independently; for the others you could let the encoder decide. In general, whenever you're adding or removing a stream from the mix you want an independent frame. 

best, 
koen. 

----- Original Message -----
From: "Roman Shpount" <roman@telurix.com> 
To: "Koen Vos" <koen.vos@skype.net> 
Cc: codec@ietf.org 
Sent: Tuesday, January 18, 2011 2:55:09 PM 
Subject: Re: [codec] Next Steps for WG 

Thank you, this is very helpful. 

What I was considering was something similar: Typically you have one speaker in the conference with short periods of double talk. I was thinking that decoding, mixing and encoding using independent frames the double talk period with the few initial frames of the new speaker, and then switching to a new speaker stream would sound the most natural and will only require re-encoding for short periods of time. 

Thank you for your great work, 
_____________________________ 
Roman Shpount - www.telurix.com 

On Tue, Jan 18, 2011 at 5:35 PM, Koen Vos < koen.vos@skype.net > wrote: 

Roman: 

Jean-Marc and I discussed this a bit, and this is how we see it: 

Resetting the decoder state would lead to a discontinuity in the output signal and may create an objectionable "click" sound. Moreover, the incoming stream that's enabled in the switching node at the moment of the decoder reset may depend on the last frame before the switch. So the first few frames would not be decoded correctly. 

Coding and transmitting the decoder state doesn't overcome the discontinuity problem. And it would add a lot of complications and code to Opus. 

A better method seems to be: 
1. Determine if the first frame after switching has significant dependencies on the previous frame (i.e., long-term prediction in the case of SILK, or energy prediction in the case of CELT) 
2a: If yes: decode the first frame and re-encode it independently. After that, switch to the new stream. 
2b. If no: simply switch to the new stream immediately. 
This ensures a smooth transition and fast convergence to the correct output signal for the new stream. 

best, 
koen. 

From: "Roman Shpount" < roman@telurix.com > 
To: "Kevin P. Fleming" < kpfleming@digium.com > 
Cc: codec@ietf.org 
Sent: Saturday, January 15, 2011 2:07:28 PM 

Subject: Re: [codec] Next Steps for WG 

First of all, CODEC definition should be independent from the transport protocol, and we might need this functionality when RTP SSRC are not available. 

Furthermore, in case of RTP, there are two problems with your suggestion: 

1. Quite a few clients do not allow remote party to change SSRC without the re-Invite. SIP clients note SSRC of the first received RTP packet and ignore RTP packets with different SSRC 

2. Even when the client allows switching of SSRC on the fly, a few initial RTP packets are typically discarded due to RTP probation. After this, clients typically need to pre-fill the jitter buffer, and only after this start audio playback. This produces from 40ms to 100ms gap in audio. This is very audible and highly undesirable. 

Finally, what I was looking for was a bit more then just decoder reset. Ideally I wanted to set decoder state to a known value to start decoding audio packets that I will send to it. 

P.S. As a side note, Intel Performance Primitives CODECs implement a packet type in its wave file format that resets the decoder. This packet is used to simplify implementation of performance and regression tests. I think they are using standard sized packet with all bits set to zero for this purpose. We can at least do something similar. 

_____________ 
Roman Shpount 

On Sat, Jan 15, 2011 at 8:12 AM, Kevin P. Fleming < kpfleming@digium.com > wrote: 

On 01/14/2011 11:27 PM, Roman Shpount wrote: 

My question was not as much about the VAD, as about multiple stream 
combining. What I want is to implement is switching between multiple 
talker based on VAD (computation of which is done by some external 
methods). In this case we need to indicate to the remote party that this 
is a new stream and ideally get the remote decoder in such a state that 
it can immediately start correctly decoding the new audio without a 
glitch. So, the minimum requirement would be to have a flag or a packet 
that indicates the decoder that it needs to reset its state. Ideally, we 
need an algorithm to get the proper decoder state based on some audio 
history stored on the conferencing server, and to send this decoder 
state to the remote party so that it can be synchronized. I don't think 
this should affect the codec performance. This is just something that 
needs to be accommodated in the bitstream by adding a packet type which 
either resets the decoder state or sets it to some specified value. 

This sort of mechanism already exists when using RTP as the transport mechanism, by setting the marker bit and changing the SSRC to indicate that the payload in the packet is from a different source than the previous packet. In my opinion there's no need for the codec bitstream to have any provisions for such an indication. 

-- 
Kevin P. Fleming 
Digium, Inc. | Director of Software Technologies 
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA 
skype: kpfleming | jabber: kfleming@digium.com 
Check us out at www.digium.com & www.asterisk.org 

_______________________________________________ 
codec mailing list 
codec@ietf.org 
https://www.ietf.org/mailman/listinfo/codec 

_______________________________________________ 
codec mailing list 
codec@ietf.org 
https://www.ietf.org/mailman/listinfo/codec