[xrblock] Video loss concealment support in draft-ietf-xrblock-rtcp-xr-concsec

Qin Wu <bill.wu@huawei.com> Tue, 16 October 2012 06:31 UTC

Message-ID: <DD2E3ADF9AA44E5BAFBDD56E4AC818F6@china.huawei.com>
From: Qin Wu <bill.wu@huawei.com>
To: xrblock@ietf.org
Date: Tue, 16 Oct 2012 14:30:51 +0800
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_NextPart_000_0494_01CDABAA.D7C47960"
Subject: [xrblock] Video loss concealment support in draft-ietf-xrblock-rtcp-xr-concsec
Precedence: list

Hi,
In order to support video loss concealment, I like to propose the following changes to draft-ietf-xrblock-rtcp-xr-concsec:
1. Abstract:
OLD TEXT:
"
This document defines an RTP Control Protocol(RTCP)

Extended Report (XR) Block that allows the

reporting of Concealed Seconds metrics for a range

of RTP applications primarily for audio

applications of RTP.

"
NEW TEXT:
"
This document defines an RTP Control Protocol(RTCP)

Extended Report (XR) Block that allows the

reporting of Concealed Seconds metrics for a

range of RTP applications.

"
2. Section 1.1 Editor's Note
OLD TEXT:

At any instant, the audio output at a receiver may be classified as
either 'normal' or 'concealed'. 'Normal' refers to playout of audio
payload received from the remote end, and also includes locally
generated signals such as announcements, tones and comfort noise.
Concealment refers to playout of locally-generated signals used to
mask the impact of network impairments such as lost packets or to
reduce the audibility of jitter buffer adaptations.

Editor's Note: For video application, the output at a receiver
should also be classified as either normal or concealed. Should
this paragraph be clear about this?

NEW TEXT:

At any instant, the media output at a receiver may be classified as
either 'normal' or 'concealed'. 'Normal' refers to playout of media
payload received from the remote end, and also includes locally
generated signals such as announcements, tones and comfort noise.
Concealment refers to playout of locally-generated signals used to
mask the impact of network impairments such as lost packets or to
reduce the discontinuities in the media play-out (e.g.,audibility

of jitter buffer adaptations).

3. Section 1.4 Editor Note

OLD TEXT:

This metric is primarily applicable to audio applications of RTP.
EDITOR'S NOTE: are there metrics for concealment of transport errors

for video.

NEW TEXT:

These metrics are primarily applicable to audio applications of RTP.

In addition, these metrics are also used for concealment of transport

errors for video applications of RTP.

4. Section 2.1 Editor's Note

OLD TEXT:

2.1. Standards Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].

In addition, the following terms are defined:

Editor's Note: For Video loss concealment, at least the following
four methods are used,i.e., Frame freeze,inter-frame
extrapolation, interpolation, Noise insertation, should this
section consider giving definition of these four methods for video
loss concealment?

NEW TEXT:

2.1. Standards Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",

"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this

document are to be interpreted as described in RFC 2119 [RFC2119].

In addition, the following terms are defined:

Frame freeze

The impaired video frame is not displayed, instead,

the previously displayed frame is hence “frozen”

for the duration of the loss event.

Inter-frame extrapolation

If an area of the video frame is damaged by loss,

the same area from the previous frame(s) can be

used to estimate what the missing pixels would

have been. This can work well in a scene with

no motion but can be very noticeable if there

is significant movement from one frame to another.

Simple decoders may simply re-use the pixels

that were in the missing area while more complex decoders

may try to use several frames to do a more complex extrapolation

Interpolation

A decoder may use the undamaged pixels in the image to estimate

what the missing block of image should have.

Noise insertion

A decoder may insert random pixel values - which would

generally be less noticeable than a blank rectangle in the image

5. Section 3. 1st paragraph Editor's Note

OLD TEXT:

This sub-block provides a description of potentially audible
impairments due to lost and discarded packets at the endpoint,
expressed on a time basis analogous to a traditional PSTN T1/E1
errored seconds metric.

Editor's Note: Should impairment also cover video application?

NEW TEXT:

This sub-block provides a description of potentially network

impairments due to lost and discarded packets at the endpoint,

expressed on a time basis analogous to a traditional PSTN T1/E1

errored seconds metric.

6. Section 3.2, Packet Loss Concealment method defintion, Editor's Note

OLD TEXT:

Packet Loss Concealment Method (plc): 2 bits

This field is used to identify the packet loss concealment method
in use at the receiver, according to the following code:

bits 014-015

0 = silence insertion

1 = simple replay, no attenuation

2 = simple replay, with attenuation

3 = enhanced

Other values reserved

Editor's Note 1 : In the packet loss concealment
methods,"Enhanced" is defines as one new Packet loss
Concealment method? However it is not clear what this
packet loss concealment method looks like?

Editor's Note 2: For Video loss concealment, there are a
range of methods used, for example:

(i) Frame freeze In this case the impaired video frame
is not displayed and the previously displayed frame is
hence "frozen" for the duration of the loss event

(ii) Inter-frame extrapolation If an area of the video
frame is damaged by loss, the same area from the
previous frame(s) can be used to estimate what the
missing pixels would have been. This can work well in
a scene with no motion but can be very noticeable if
there is significant movement from one frame to
another. Simple decoders may simply re-use the pixels
that were in the missing area, more complex decoders
may try to use several frames to do a more complex
extrapolation.

(iii) Interpolation A decoder may use the undamaged
pixels in the image to estimate what the missing block
of image should have

(iv) Noise insertion A decoder may insert random pixel
values - which would generally be less noticeable than
a blank rectangle in the image.

Therefore more text required in the future draft to
discuss Techniques for Video Loss Concealment method in
this document.

NEW TEXT:

Packet Loss Concealment Method (plc): 4 bits

This field is used to identify the packet loss concealment method

in use at the receiver, according to the following code:

bits 011-014

0 = silence insertion (audio)

1 = simple replay, no attenuation (audio)

2 = simple replay, with attenuation (audio)

3 = enhanced (audio)

4 = Frame Freezed (video)

5 = Inter-Frame extrapolation (video)

6 = Interpolation (video)

7 = Noise Insertion (video)

Other values reserved

7. Section 3.2, Unimpaired Seconds, Editor's Note

OLD TEXT:

Normal playout of comfort noise or other silence concealment
signal during periods of talker silence, if VAD [VAD] is used,
shall be counted as unimpaired seconds.

Editor's Note: It should be clear that VAD does not apply to
video.

NEW TEXT:

For speech application, normal playout of comfort noise or other silence concealment
signal during periods of talker silence, if VAD [VAD] is used, shall be
counted as unimpaired seconds.

8. Section 3.2, Concealed Seconds, Editor's Note

OLD TEXT:

Equivalently, a concealed second is one in which some Loss-type
concealment has occurred. Buffer adjustment-type concealment
SHALL not cause Concealed Seconds to be incremented, with the
following exception. An implementation MAY cause Concealed
Seconds to be incremented for 'emergency' buffer adjustments made
during talkspurts.

Loss-type concealment is reactive insertion or deletion of samples
in the audio playout stream due to effective frame loss at the
audio decoder. "Effective frame loss" is the event in which a
frame of coded audio is simply not present at the audio decoder
when required. In this case, substitute audio samples are
generally formed, at the decoder or elsewhere, to reduce audible
impairment.

Because this insertion is controlled, rather than occurring
randomly in response to losses, it is typically less audible than
loss-type concealment. For example, jitter buffer adaptation
events may be constrained to occur during periods of talker
silence, in which case only silence duration is affected, or
sophisticated time-stretching methods for insertion/deletion
during favorable periods in active speech may be employed. For
these reasons, buffer adjustment-type concealment MAY be exempted
from inclusion in calculations of Concealed Seconds and Severely
Concealed Seconds.

Editor's Note: In this document, two kind of concealments are
defined: a. Loss-type concealment b. Buffer Adjustment-type
concealment Loss-type concealment is applicable to both audio
and video. However Buffer Adjustment-type concealment is
usually applied to audio. Should this section be clear about
this?
"

NEW TEXT:

Equivalently, a concealed second is one in which some Loss-type
concealment has occurred. Buffer adjustment-type concealment
is usually designed for audio application and SHALL not cause

Concealed Seconds to be incremented, with the
following exception. An implementation MAY cause Concealed
Seconds to be incremented for 'emergency' buffer adjustments made
during talkspurts.

Loss-type concealment is reactive insertion or deletion of samples
in the media playout stream due to effective frame loss at the
media decoder. "Effective frame loss" is the event in which a
frame of coded media is simply not present at the media decoder
when required. In this case, substitute media samples are
generally formed, at the decoder or elsewhere, to reduce audible pr perceivable
impairment.

Buffer Adjustment-type concealment is proactive or controlled
insertion or deletion of samples in the audio playout stream due
to jitter buffer adaptation, re-sizing or re-centering decisions
within the endpoint. Because this insertion is controlled, rather than occurring
randomly in response to losses, it is typically less audible than
loss-type concealment. For example, jitter buffer adaptation
events may be constrained to occur during periods of talker
silence, in which case only silence duration is affected, or
sophisticated time-stretching methods for insertion/deletion
during favorable periods in active speech may be employed. For
these reasons, buffer adjustment-type concealment MAY be exempted
from inclusion in calculations of Concealed Seconds and Severely
Concealed Seconds.

Regards!

-Qin

[xrblock] Video loss concealment support in draft… Qin Wu