Re: [Speechsc] [RAI] RAI review of draft-ietf-speechsc-mrcpv2-19

"Roni Even" <ron.even.tlv@gmail.com> Tue, 29 December 2009 15:37 UTC

Return-Path: <ron.even.tlv@gmail.com>
X-Original-To: speechsc@core3.amsl.com
Delivered-To: speechsc@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 0FDEE3A6891; Tue, 29 Dec 2009 07:37:00 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, HTML_MESSAGE=0.001, J_CHICKENPOX_16=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2nFGziLDF48Y; Tue, 29 Dec 2009 07:36:46 -0800 (PST)
Received: from mail-fx0-f215.google.com (mail-fx0-f215.google.com [209.85.220.215]) by core3.amsl.com (Postfix) with ESMTP id 2A5CD3A6820; Tue, 29 Dec 2009 07:36:45 -0800 (PST)
Received: by fxm7 with SMTP id 7so10397626fxm.29 for <multiple recipients>; Tue, 29 Dec 2009 07:36:22 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:cc:references :in-reply-to:subject:date:message-id:mime-version:content-type :x-mailer:thread-index:content-language; bh=oeVVv7+aw9S6v3L/MO5rnDH65uEnAeHXJL0d2He8/gg=; b=pFEO8QlCnV0IqQPt5S07tE7oUt+2epfErq30ElhwVRJtD44O6Z+J8pGG0MYADyCi/4 0l0VfZKCkgm8AlyynErxm6Z84baKHzi5IZHEZ7ka9VPeAdYGnMiyop+JwNDD33BI7Y9Z zMXeUNW2qwT7A0vjIRMo/CDwiump8xIRJPcpI=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:cc:references:in-reply-to:subject:date:message-id :mime-version:content-type:x-mailer:thread-index:content-language; b=aF+OsL+OjsNSDbdmd/LY6u8lzv7x5DrQrFqT1yCRq7Q+XqhfhTuNRaHj81nBWwLay0 pmok/9rmyOua+6+jJfjMF8ncHJAS9TgYSDtqOkYwBYD7WtZZ6586FfkA/qTwnDAUDFzh Ftyn7OAMaaCgq3AmmB2hr8LI71X1CrdPLHM6I=
Received: by 10.223.19.200 with SMTP id c8mr10024907fab.55.1262100982726; Tue, 29 Dec 2009 07:36:22 -0800 (PST)
Received: from windows8d787f9 (bzq-79-183-125-218.red.bezeqint.net [79.183.125.218]) by mx.google.com with ESMTPS id 35sm18549275fkt.40.2009.12.29.07.36.16 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 29 Dec 2009 07:36:20 -0800 (PST)
From: Roni Even <ron.even.tlv@gmail.com>
To: 'Dan Burnett' <dburnett@voxeo.com>, 'Roni Even' <Even.roni@huawei.com>
References: <033101c9ff3a$cbe33160$63a99420$%roni@huawei.com> <E2C626B8-8CA1-4A1D-A2CE-B6AB4B269DEE@voxeo.com> <027801ca1b1c$c2e8ee80$48bacb80$%roni@huawei.com> <C46B7F31-9989-442C-B2F1-CA77E79F04F8@voxeo.com>
In-Reply-To: <C46B7F31-9989-442C-B2F1-CA77E79F04F8@voxeo.com>
Date: Tue, 29 Dec 2009 17:35:29 +0200
Message-ID: <4b3a21f4.23145e0a.5a8e.ffff8c91@mx.google.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_NextPart_000_0801_01CA88AD.53323700"
X-Mailer: Microsoft Office Outlook 12.0
Thread-index: AcqIdkGBq48YoaG7TeC/YMOV6kSpFgAJg4eQ
Content-language: en-us
Cc: speechsc@ietf.org, sarvi@cisco.com, oran@cisco.com, rai@ietf.org
Subject: Re: [Speechsc] [RAI] RAI review of draft-ietf-speechsc-mrcpv2-19
X-BeenThere: speechsc@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Speech Services Control Working Group <speechsc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/speechsc>
List-Post: <mailto:speechsc@ietf.org>
List-Help: <mailto:speechsc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Dec 2009 15:37:00 -0000

Looks OK

Roni

 

From: rai-bounces@ietf.org [mailto:rai-bounces@ietf.org] On Behalf Of Dan
Burnett
Sent: Tuesday, December 29, 2009 1:01 PM
To: Roni Even
Cc: speechsc@ietf.org; sarvi@cisco.com; oran@cisco.com; rai@ietf.org
Subject: Re: [RAI] RAI review of draft-ietf-speechsc-mrcpv2-19

 

Hi Roni,

 

Just to finish up on your last comments . . .

 

-- dan

 

On Aug 12, 2009, at 3:15 AM, Roni Even wrote:





Hi Dan,

I understand your explanation about all these "vendor specific" parameter. I
think that since this a standard track document there should be some text
explaining the usage of these parameters as well as making a note that since
these are vendor specific information you cannot compare the values coming
from different vendors

 

Thank you.  I will note this in the next draft and suggest how these
parameters may be used in light of their vendor dependence.





 

 

As for my comment number 5 on payload type 96. My comment was that if the
m-line has a payload type number of 96 you must have a a=rtpmap line mapping
96 to a specific subtype name while for pcmu it is not mandatory to have
a=rtpmap like you have in your examples since payload type number 0 is a
static payload type number assigned to pcmu

 

 

I'm sorry, I did not explain this very well.  I understood your comment.  My
reply was that of the three examples, example 2 did actually provide the
a=rtpmap line for 96.  Since the payload type of 96 should not even have
been included in the first and third examples, once I removed it from those
two examples all three contained the proper a=rtpmap lines.

Although not necessary to have an a=rtpmap line for payload type 0, others
in the past had requested it so I left it in.





 

Roni Even

 

From: Dan Burnett [mailto:dburnett@voxeo.com] 
Sent: Tuesday, August 11, 2009 9:22 PM
To: Roni Even
Cc: sarvi@cisco.com; oran@cisco.com; 'Eric Burger'; speechsc@ietf.org;
rai@ietf.org
Subject: Re: RAI review of draft-ietf-speechsc-mrcpv2-19

 

 

On Jul 7, 2009, at 3:40 PM, Roni Even wrote:






Hi,

I was assigned to do a RAI review of the draft.  The draft looks ready for
publication to me. I have some comments mostly editorial.

The only issue I see that is not pure editorial is the issue of the
different parameters like confidence threshold, sensitivity level (see
comments 11, 13, 15, 16 and 17). I think that some clarification on the
semantics and the scale (for example are the values linearly spaced) as well
as when they are useful will be helpful to implementers.

1.       In figure 1 Expand the abbreviations TTS, ASR, SV , SI and how they
are related to the media resource types in 3.1

 

Done.  Added some text explaining Figure 1 and enhanced Figure 1 slightly
for clarification.




2.       In figure 1 there is a SIP dialog between the MRCPv2 client and the
media source/sink, what is this dialog, I only saw in section 4 a dialog
between the client and server.

Clarified in the first example of section 4.2 that the SIP dialog with the
media source/sink is not shown.

3.       In section 3.2 you have "For example:  <sip:mrcpv2@example.net>
sip:mrcpv2@example.net" twice one after the other.

 

Fixed.






4.       In the example in section 4.2 you "a=cmid:1", cmid is specified
later in the document so maybe you can add some reference to where it is
specified

 

Done.






 

5.       In the example is section 4.2 and in following examples you have
"m=audio 49170 RTP/AVP 0 96" but do not have an rtpmap parameter for mapping
96 (dynamic payload type number) to a media encoding name.

 

It is not in the first or third examples (Synthesizer only), but it is in
the second example (Recognizer).  I have removed 96 as an option for the
Synthesizer-only examples but let it remain as an addition for the
Recognizer example.






 

6.       In section 4.3 "Also note that more that one media session can be
associated with a single resource if need be, but this scenario is not
useful for the current set of resources". There is a typo the second "that"
should be "than". I am also not sure if the current syntax in this document
can support the mode.

 

Fixed the typo.






 

7.       In section 4.3 "The formatting of the"cmid" attribute in SDP
RFC3388 [RFC4566]". I think you meant SDP grouping and need the reference to
RFC 3388.

 

I removed the reference altogether because it already exists (correctly)
earlier in the paragraph.






 

8.       In section 5.1 "The message-length field specifies the length of
the message, including the start-line" is the length in Bytes, there is no
unit specified.

 

Changed "length of the message" to "length of the message in bytes".






 

9.       In section 6.3.1, typo you have "Verfication " instead of
verification. It appears twice in the section.

 

Fixed.






 

10.   In the example in section 7 you have "m=audio 0 RTP/AVP 0 1 3" payload
type 1 was deleted from the IANA registry, maybe have another payload type
number.

 

I just removed that payload type.  It is not germane to the example.






 

11.   In section 9.4.1, 9.4.2 and 9.4.3 you specify confidence threshold,
sensitivity level and speed vs accuracy. What is the scale here; is it
linear between 0 and 1. What is the absolute value of the number, if you
receive the same confidence level from two recognizers are they the same
(e.g. when using context block to switch servers).  For the speed vs
accuracy, how does the client know what is the relation between the value
and the number of available sessions, since this seems to be the reason for
using this parameter.

 

The interpretation of all of these parameters is implementation-specific
because the underlying technologies used to implement them vary and can even
be proprietary.  In practice the speech recognition and synthesis and
speaker authentication communities have lived with this state of affairs for
many years, and users of other APIs for this technology are well aware of
and have built applications that accommodate this variability in
interpretation.  It is outside the scope of this specification to attempt to
standardize interpretations of these values.






12.   In 9.4.9 and in 10.4.8, 11.4.11 what are the values for
media-type-value, you also mention audio and video but it looks to me that
this document only discusses voice.

 

Yes.  Although the original intent was to record speech, application authors
today are beginning to look at ways to incorporate other audio or video.
The intent of the sentences in these sections is to clarify that the
specification itself imposes no restriction on the types of media that are
allowed.






 

13.   In 9.4.35 and 9.4.36 what is the scale for the consistency here. How
does one know what close means. What is the consistency between different
recognizers.

 

The answer to question 11, above, applies here as well.






 

14.   In section 9.6.3.3 in the example (figure 2) confidence should be 0.75
and not 75

 

Fixed.






 

15.   In section 10.4.1 it is not clear how you measure the sensitivity in
order to specify, is it based on some SNR translated to 0 to 1 scale?

 

The answer to question 11, above, applies here as well.






 

16.   In 11.4.6 the same issue with the scale, how does the client know how
to set a value when working with different speaker verification servers.

 

Ditto.  I should point out that in all of these cases the parameters are
typically passed directly to the engine, and their interpretations are
defined (and described) in the vendors' documentation.  The most common
MRCPv2 server implementations are by the technology vendors themselves (the
providers of the synthesis, recognition, and verification engines).  This is
commonly understood in this technology industry (meaning those who use this
technology regularly).






 

17.   In 11.5.2.9 you state that the verification-score is not a
probability, so what is it. How can the client decide if, for example, 0 is
a good score for specifying the threshold.  I also noticed that the values
in the example in section 11.5.2.10 are very precise like 0.98514 is this
the expected precision. The examples here and in section 11.11 do not show
the threshold, if the threshold is required for this flow why not show it in
the example?

 

This parameter, as others mentioned above, has only a vendor-specific
interpretation.  In practice authors interpret these values based both on
guidance from the technology vendors and via experimentation on large sets
of recorded data.

 

The Min-Verification-Score threshold is not required to be set.  In many
cases the technology vendor has a fairly good understanding of what the
default threshold should be.  The verification-score is returned, however,
in case the application author determines (through experimentation, as
described above) that the default threshold is not producing optimal results
for the application.  In that case the author can set the threshold to a
different value or can set it to -1 and make the determination within the
application itself based on the verification-score values.






 

18.   In section 12.3 the suggestion is to use SRTP as the mandatory
interoperability mode. If the reason for mandating SRTP is for a common mode
you should also decide on a key exchange mechanism. I suggest you look at
http://tools.ietf.org/html/draft-ietf-avt-srtp-not-mandatory-02 for
discussion on media security.

 

Based on the discussion between you and Dan York on the list, I will change
this:

 

12.3. Media session protection 
Sensitive data is also carried on media sessions terminating on MRCPv2
servers (the other end of a media channel may or may not be on the MRCPv2
client). This data includes the user's spoken utterances and the output of
text-to-speech operations. MRCPv2 servers MUST support SRTP for protection
of audio media sessions. MRCPv2 clients that originate or consume audio
similarly MUST support SRTP. Alternative media channel protection MAY be
used if desired (e.g. IPSEC).

 

to this:

 

12.3. Media session protection 
Sensitive data is also carried on media sessions terminating on MRCPv2
servers (the other end of a media channel may or may not be on the MRCPv2
client). This data includes the user's spoken utterances and the output of
text-to-speech operations. MRCPv2 servers MUST support a security mechanism
for protection of audio media sessions. MRCPv2 clients that originate or
consume audio similarly MUST support a security mechanism for protection of
the audio. If appropriate, usage of the Secure Real-time Transport Protocol
(SRTP) [RFC3711] is recommended.

 

19.   In section13.7.2 you specify the attribute resource as session level
yet in the example in section 4.2 it is a media level attribute. The same
goes for the channel attribute

 

I have corrected both in section 13.7.2 to be media-level.






 

Thanks

 

Roni Even