Re: [Speechsc] RAI review of draft-ietf-speechsc-mrcpv2-19

Dan Burnett <dburnett@voxeo.com> Tue, 11 August 2009 18:35 UTC

Return-Path: <dburnett@voxeo.com>
X-Original-To: speechsc@core3.amsl.com
Delivered-To: speechsc@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 7DB343A6CE5; Tue, 11 Aug 2009 11:35:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.299
X-Spam-Level:
X-Spam-Status: No, score=-1.299 tagged_above=-999 required=5 tests=[AWL=1.299, BAYES_00=-2.599, HTML_MESSAGE=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jqsZzR3Xhe2H; Tue, 11 Aug 2009 11:34:53 -0700 (PDT)
Received: from voxeo.com (mmail.voxeo.com [66.193.54.208]) by core3.amsl.com (Postfix) with SMTP id 400573A6B1D; Tue, 11 Aug 2009 11:34:14 -0700 (PDT)
Received: from [76.111.40.166] (account dburnett HELO [192.168.15.123]) by voxeo.com (CommuniGate Pro SMTP 5.2.3) with ESMTPSA id 50022126; Tue, 11 Aug 2009 18:21:49 +0000
Message-Id: <E2C626B8-8CA1-4A1D-A2CE-B6AB4B269DEE@voxeo.com>
From: Dan Burnett <dburnett@voxeo.com>
To: Roni Even <Even.roni@huawei.com>
In-Reply-To: <033101c9ff3a$cbe33160$63a99420$%roni@huawei.com>
Content-Type: multipart/alternative; boundary="Apple-Mail-125--1022714190"
Mime-Version: 1.0 (Apple Message framework v930.3)
Date: Tue, 11 Aug 2009 14:21:48 -0400
References: <033101c9ff3a$cbe33160$63a99420$%roni@huawei.com>
X-Mailer: Apple Mail (2.930.3)
Cc: speechsc@ietf.org, sarvi@cisco.com, oran@cisco.com, rai@ietf.org
Subject: Re: [Speechsc] RAI review of draft-ietf-speechsc-mrcpv2-19
X-BeenThere: speechsc@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Speech Services Control Working Group <speechsc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/speechsc>
List-Post: <mailto:speechsc@ietf.org>
List-Help: <mailto:speechsc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Aug 2009 18:35:01 -0000

On Jul 7, 2009, at 3:40 PM, Roni Even wrote:

> Hi,
>
> I was assigned to do a RAI review of the draft.  The draft looks  
> ready for publication to me. I have some comments mostly editorial.
>
> The only issue I see that is not pure editorial is the issue of the  
> different parameters like confidence threshold, sensitivity level  
> (see comments 11, 13, 15, 16 and 17). I think that some  
> clarification on the semantics and the scale (for example are the  
> values linearly spaced) as well as when they are useful will be  
> helpful to implementers.
>
> 1.       In figure 1 Expand the abbreviations TTS, ASR, SV , SI and  
> how they are related to the media resource types in 3.1
>

Done.  Added some text explaining Figure 1 and enhanced Figure 1  
slightly for clarification.
> 2.       In figure 1 there is a SIP dialog between the MRCPv2 client  
> and the media source/sink, what is this dialog, I only saw in  
> section 4 a dialog between the client and server.
>
Clarified in the first example of section 4.2 that the SIP dialog with  
the media source/sink is not shown.
> 3.       In section 3.2 you have “For example:  
> sip:mrcpv2@example.net” twice one after the other.
>
Fixed.

> 4.       In the example in section 4.2 you “a=cmid:1”, cmid is  
> specified later in the document so maybe you can add some reference  
> to where it is specified

Done.

>
> 5.       In the example is section 4.2 and in following examples you  
> have “m=audio 49170 RTP/AVP 0 96” but do not have an rtpmap  
> parameter for mapping 96 (dynamic payload type number) to a media  
> encoding name.

It is not in the first or third examples (Synthesizer only), but it is  
in the second example (Recognizer).  I have removed 96 as an option  
for the Synthesizer-only examples but let it remain as an addition for  
the Recognizer example.

>
> 6.       In section 4.3 “Also note that more that one media session  
> can be associated with a single resource if need be, but this  
> scenario is not useful for the current set of resources”. There is a  
> typo the second “that” should be “than”. I am also not sure if the  
> current syntax in this document can support the mode.
>
Fixed the typo.

>
> 7.       In section 4.3 “The formatting of the"cmid" attribute in  
> SDP RFC3388 [RFC4566]”. I think you meant SDP grouping and need the  
> reference to RFC 3388.
>
I removed the reference altogether because it already exists  
(correctly) earlier in the paragraph.

>
> 8.       In section 5.1 “The message-length field specifies the  
> length of the message, including the start-line” is the length in  
> Bytes, there is no unit specified.

Changed "length of the message" to "length of the message in bytes".

>
> 9.       In section 6.3.1, typo you have “Verfication “ instead of  
> verification. It appears twice in the section.

Fixed.

>
> 10.   In the example in section 7 you have “m=audio 0 RTP/AVP 0 1 3”  
> payload type 1 was deleted from the IANA registry, maybe have  
> another payload type number.

I just removed that payload type.  It is not germane to the example.

>
> 11.   In section 9.4.1, 9.4.2 and 9.4.3 you specify confidence  
> threshold, sensitivity level and speed vs accuracy. What is the  
> scale here; is it linear between 0 and 1. What is the absolute value  
> of the number, if you receive the same confidence level from two  
> recognizers are they the same (e.g. when using context block to  
> switch servers).  For the speed vs accuracy, how does the client  
> know what is the relation between the value and the number of  
> available sessions, since this seems to be the reason for using this  
> parameter.
>
The interpretation of all of these parameters is implementation- 
specific because the underlying technologies used to implement them  
vary and can even be proprietary.  In practice the speech recognition  
and synthesis and speaker authentication communities have lived with  
this state of affairs for many years, and users of other APIs for this  
technology are well aware of and have built applications that  
accommodate this variability in interpretation.  It is outside the  
scope of this specification to attempt to standardize interpretations  
of these values.

> 12.   In 9.4.9 and in 10.4.8, 11.4.11 what are the values for media- 
> type-value, you also mention audio and video but it looks to me that  
> this document only discusses voice.

Yes.  Although the original intent was to record speech, application  
authors today are beginning to look at ways to incorporate other audio  
or video.  The intent of the sentences in these sections is to clarify  
that the specification itself imposes no restriction on the types of  
media that are allowed.

>
> 13.   In 9.4.35 and 9.4.36 what is the scale for the consistency  
> here. How does one know what close means. What is the consistency  
> between different recognizers.

The answer to question 11, above, applies here as well.

>
> 14.   In section 9.6.3.3 in the example (figure 2) confidence should  
> be 0.75 and not 75

Fixed.

>
> 15.   In section 10.4.1 it is not clear how you measure the  
> sensitivity in order to specify, is it based on some SNR translated  
> to 0 to 1 scale?

The answer to question 11, above, applies here as well.

>
> 16.   In 11.4.6 the same issue with the scale, how does the client  
> know how to set a value when working with different speaker  
> verification servers.

Ditto.  I should point out that in all of these cases the parameters  
are typically passed directly to the engine, and their interpretations  
are defined (and described) in the vendors' documentation.  The most  
common MRCPv2 server implementations are by the technology vendors  
themselves (the providers of the synthesis, recognition, and  
verification engines).  This is commonly understood in this technology  
industry (meaning those who use this technology regularly).

>
> 17.   In 11.5.2.9 you state that the verification-score is not a  
> probability, so what is it. How can the client decide if, for  
> example, 0 is a good score for specifying the threshold.  I also  
> noticed that the values in the example in section 11.5.2.10 are very  
> precise like 0.98514 is this the expected precision. The examples  
> here and in section 11.11 do not show the threshold, if the  
> threshold is required for this flow why not show it in the example?

This parameter, as others mentioned above, has only a vendor-specific  
interpretation.  In practice authors interpret these values based both  
on guidance from the technology vendors and via experimentation on  
large sets of recorded data.

The Min-Verification-Score threshold is not required to be set.  In  
many cases the technology vendor has a fairly good understanding of  
what the default threshold should be.  The verification-score is  
returned, however, in case the application author determines (through  
experimentation, as described above) that the default threshold is not  
producing optimal results for the application.  In that case the  
author can set the threshold to a different value or can set it to -1  
and make the determination within the application itself based on the  
verification-score values.

>
> 18.   In section 12.3 the suggestion is to use SRTP as the mandatory  
> interoperability mode. If the reason for mandating SRTP is for a  
> common mode you should also decide on a key exchange mechanism. I  
> suggest you look at http://tools.ietf.org/html/draft-ietf-avt-srtp-not-mandatory-02 
>  for discussion on media security.

Based on the discussion between you and Dan York on the list, I will  
change this:

12.3. Media session protection
Sensitive data is also carried on media sessions terminating on MRCPv2  
servers (the other end of a media channel may or may not be on the  
MRCPv2 client). This data includes the user's spoken utterances and  
the output of text-to-speech operations. MRCPv2 servers MUST support  
SRTP for protection of audio media sessions. MRCPv2 clients that  
originate or consume audio similarly MUST support SRTP. Alternative  
media channel protection MAY be used if desired (e.g. IPSEC).

to this:

12.3. Media session protection
Sensitive data is also carried on media sessions terminating on MRCPv2  
servers (the other end of a media channel may or may not be on the  
MRCPv2 client). This data includes the user's spoken utterances and  
the output of text-to-speech operations. MRCPv2 servers MUST support a  
security mechanism for protection of audio media sessions. MRCPv2  
clients that originate or consume audio similarly MUST support a  
security mechanism for protection of the audio. If appropriate, usage  
of the Secure Real-time Transport Protocol (SRTP) [RFC3711] is  
recommended.
>
> 19.   In section13.7.2 you specify the attribute resource as session  
> level yet in the example in section 4.2 it is a media level  
> attribute. The same goes for the channel attribute

I have corrected both in section 13.7.2 to be media-level.

>
> Thanks
>
> Roni Even
>
>