Re: [Speechsc] RAI review of draft-ietf-speechsc-mrcpv2-19

Dan Burnett <dburnett@voxeo.com> Tue, 29 December 2009 11:01 UTC

Return-Path: <dburnett@voxeo.com>
X-Original-To: speechsc@core3.amsl.com
Delivered-To: speechsc@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id C68B23A680B; Tue, 29 Dec 2009 03:01:25 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.301
X-Spam-Level:
X-Spam-Status: No, score=0.301 tagged_above=-999 required=5 tests=[AWL=-0.300, BAYES_50=0.001, HTML_MESSAGE=0.001, J_CHICKENPOX_16=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zo95H47v-TAY; Tue, 29 Dec 2009 03:01:20 -0800 (PST)
Received: from voxeo.com (mmail.voxeo.com [66.193.54.208]) by core3.amsl.com (Postfix) with ESMTP id D3D5A3A6844; Tue, 29 Dec 2009 03:01:19 -0800 (PST)
Received: from [71.204.33.81] (account dburnett HELO [192.168.15.111]) by voxeo.com (CommuniGate Pro SMTP 5.2.3) with ESMTPSA id 55101526; Tue, 29 Dec 2009 11:00:52 +0000
Message-Id: <C46B7F31-9989-442C-B2F1-CA77E79F04F8@voxeo.com>
From: Dan Burnett <dburnett@voxeo.com>
To: Roni Even <Even.roni@huawei.com>
In-Reply-To: <027801ca1b1c$c2e8ee80$48bacb80$%roni@huawei.com>
Content-Type: multipart/alternative; boundary="Apple-Mail-40-309409471"
Mime-Version: 1.0 (Apple Message framework v936)
Date: Tue, 29 Dec 2009 06:00:50 -0500
References: <033101c9ff3a$cbe33160$63a99420$%roni@huawei.com> <E2C626B8-8CA1-4A1D-A2CE-B6AB4B269DEE@voxeo.com> <027801ca1b1c$c2e8ee80$48bacb80$%roni@huawei.com>
X-Mailer: Apple Mail (2.936)
X-Mailman-Approved-At: Tue, 29 Dec 2009 03:37:58 -0800
Cc: speechsc@ietf.org, sarvi@cisco.com, oran@cisco.com, rai@ietf.org
Subject: Re: [Speechsc] RAI review of draft-ietf-speechsc-mrcpv2-19
X-BeenThere: speechsc@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Speech Services Control Working Group <speechsc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/speechsc>
List-Post: <mailto:speechsc@ietf.org>
List-Help: <mailto:speechsc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Dec 2009 11:01:25 -0000

Hi Roni,

Just to finish up on your last comments . . .

-- dan

On Aug 12, 2009, at 3:15 AM, Roni Even wrote:

> Hi Dan,
> I understand your explanation about all these "vendor specific"  
> parameter. I think that since this a standard track document there  
> should be some text explaining the usage of these parameters as well  
> as making a note that since these are vendor specific information  
> you cannot compare the values coming from different vendors

Thank you.  I will note this in the next draft and suggest how these  
parameters may be used in light of their vendor dependence.

>
>
> As for my comment number 5 on payload type 96. My comment was that  
> if the m-line has a payload type number of 96 you must have a  
> a=rtpmap line mapping 96 to a specific subtype name while for pcmu  
> it is not mandatory to have a=rtpmap like you have in your examples  
> since payload type number 0 is a static payload type number assigned  
> to pcmu
>

I'm sorry, I did not explain this very well.  I understood your  
comment.  My reply was that of the three examples, example 2 did  
actually provide the a=rtpmap line for 96.  Since the payload type of  
96 should not even have been included in the first and third examples,  
once I removed it from those two examples all three contained the  
proper a=rtpmap lines.
Although not necessary to have an a=rtpmap line for payload type 0,  
others in the past had requested it so I left it in.

>
> Roni Even
>
> From: Dan Burnett [mailto:dburnett@voxeo.com]
> Sent: Tuesday, August 11, 2009 9:22 PM
> To: Roni Even
> Cc: sarvi@cisco.com; oran@cisco.com; 'Eric Burger';  
> speechsc@ietf.org; rai@ietf.org
> Subject: Re: RAI review of draft-ietf-speechsc-mrcpv2-19
>
>
> On Jul 7, 2009, at 3:40 PM, Roni Even wrote:
>
>
> Hi,
>
> I was assigned to do a RAI review of the draft.  The draft looks  
> ready for publication to me. I have some comments mostly editorial.
>
> The only issue I see that is not pure editorial is the issue of the  
> different parameters like confidence threshold, sensitivity level  
> (see comments 11, 13, 15, 16 and 17). I think that some  
> clarification on the semantics and the scale (for example are the  
> values linearly spaced) as well as when they are useful will be  
> helpful to implementers.
>
> 1.       In figure 1 Expand the abbreviations TTS, ASR, SV , SI and  
> how they are related to the media resource types in 3.1
>
>
> Done.  Added some text explaining Figure 1 and enhanced Figure 1  
> slightly for clarification.
>
> 2.       In figure 1 there is a SIP dialog between the MRCPv2 client  
> and the media source/sink, what is this dialog, I only saw in  
> section 4 a dialog between the client and server.
>
> Clarified in the first example of section 4.2 that the SIP dialog  
> with the media source/sink is not shown.
> 3.       In section 3.2 you have “For example:  
> sip:mrcpv2@example.net” twice one after the other.
>
> Fixed.
>
>
> 4.       In the example in section 4.2 you “a=cmid:1”, cmid is  
> specified later in the document so maybe you can add some reference  
> to where it is specified
>
> Done.
>
>
>
> 5.       In the example is section 4.2 and in following examples you  
> have “m=audio 49170 RTP/AVP 0 96” but do not have an rtpmap  
> parameter for mapping 96 (dynamic payload type number) to a media  
> encoding name.
>
> It is not in the first or third examples (Synthesizer only), but it  
> is in the second example (Recognizer).  I have removed 96 as an  
> option for the Synthesizer-only examples but let it remain as an  
> addition for the Recognizer example.
>
>
>
> 6.       In section 4.3 “Also note that more that one media session  
> can be associated with a single resource if need be, but this  
> scenario is not useful for the current set of resources”. There is a  
> typo the second “that” should be “than”. I am also not sure if the  
> current syntax in this document can support the mode.
>
> Fixed the typo.
>
>
>
> 7.       In section 4.3 “The formatting of the"cmid" attribute in  
> SDP RFC3388 [RFC4566]”. I think you meant SDP grouping and need the  
> reference to RFC 3388.
>
> I removed the reference altogether because it already exists  
> (correctly) earlier in the paragraph.
>
>
>
> 8.       In section 5.1 “The message-length field specifies the  
> length of the message, including the start-line” is the length in  
> Bytes, there is no unit specified.
>
> Changed "length of the message" to "length of the message in bytes".
>
>
>
> 9.       In section 6.3.1, typo you have “Verfication “ instead of  
> verification. It appears twice in the section.
>
> Fixed.
>
>
>
> 10.   In the example in section 7 you have “m=audio 0 RTP/AVP 0 1 3”  
> payload type 1 was deleted from the IANA registry, maybe have  
> another payload type number.
>
> I just removed that payload type.  It is not germane to the example.
>
>
>
> 11.   In section 9.4.1, 9.4.2 and 9.4.3 you specify confidence  
> threshold, sensitivity level and speed vs accuracy. What is the  
> scale here; is it linear between 0 and 1. What is the absolute value  
> of the number, if you receive the same confidence level from two  
> recognizers are they the same (e.g. when using context block to  
> switch servers).  For the speed vs accuracy, how does the client  
> know what is the relation between the value and the number of  
> available sessions, since this seems to be the reason for using this  
> parameter.
>
> The interpretation of all of these parameters is implementation- 
> specific because the underlying technologies used to implement them  
> vary and can even be proprietary.  In practice the speech  
> recognition and synthesis and speaker authentication communities  
> have lived with this state of affairs for many years, and users of  
> other APIs for this technology are well aware of and have built  
> applications that accommodate this variability in interpretation.   
> It is outside the scope of this specification to attempt to  
> standardize interpretations of these values.
>
>
> 12.   In 9.4.9 and in 10.4.8, 11.4.11 what are the values for media- 
> type-value, you also mention audio and video but it looks to me that  
> this document only discusses voice.
>
> Yes.  Although the original intent was to record speech, application  
> authors today are beginning to look at ways to incorporate other  
> audio or video.  The intent of the sentences in these sections is to  
> clarify that the specification itself imposes no restriction on the  
> types of media that are allowed.
>
>
>
> 13.   In 9.4.35 and 9.4.36 what is the scale for the consistency  
> here. How does one know what close means. What is the consistency  
> between different recognizers.
>
> The answer to question 11, above, applies here as well.
>
>
>
> 14.   In section 9.6.3.3 in the example (figure 2) confidence should  
> be 0.75 and not 75
>
> Fixed.
>
>
>
> 15.   In section 10.4.1 it is not clear how you measure the  
> sensitivity in order to specify, is it based on some SNR translated  
> to 0 to 1 scale?
>
> The answer to question 11, above, applies here as well.
>
>
>
> 16.   In 11.4.6 the same issue with the scale, how does the client  
> know how to set a value when working with different speaker  
> verification servers.
>
> Ditto.  I should point out that in all of these cases the parameters  
> are typically passed directly to the engine, and their  
> interpretations are defined (and described) in the vendors'  
> documentation.  The most common MRCPv2 server implementations are by  
> the technology vendors themselves (the providers of the synthesis,  
> recognition, and verification engines).  This is commonly understood  
> in this technology industry (meaning those who use this technology  
> regularly).
>
>
>
> 17.   In 11.5.2.9 you state that the verification-score is not a  
> probability, so what is it. How can the client decide if, for  
> example, 0 is a good score for specifying the threshold.  I also  
> noticed that the values in the example in section 11.5.2.10 are very  
> precise like 0.98514 is this the expected precision. The examples  
> here and in section 11.11 do not show the threshold, if the  
> threshold is required for this flow why not show it in the example?
>
> This parameter, as others mentioned above, has only a vendor- 
> specific interpretation.  In practice authors interpret these values  
> based both on guidance from the technology vendors and via  
> experimentation on large sets of recorded data.
>
> The Min-Verification-Score threshold is not required to be set.  In  
> many cases the technology vendor has a fairly good understanding of  
> what the default threshold should be.  The verification-score is  
> returned, however, in case the application author determines  
> (through experimentation, as described above) that the default  
> threshold is not producing optimal results for the application.  In  
> that case the author can set the threshold to a different value or  
> can set it to -1 and make the determination within the application  
> itself based on the verification-score values.
>
>
>
> 18.   In section 12.3 the suggestion is to use SRTP as the mandatory  
> interoperability mode. If the reason for mandating SRTP is for a  
> common mode you should also decide on a key exchange mechanism. I  
> suggest you look at http://tools.ietf.org/html/draft-ietf-avt-srtp-not-mandatory-02 
>  for discussion on media security.
>
> Based on the discussion between you and Dan York on the list, I will  
> change this:
>
> 12.3. Media session protection
> Sensitive data is also carried on media sessions terminating on  
> MRCPv2 servers (the other end of a media channel may or may not be  
> on the MRCPv2 client). This data includes the user's spoken  
> utterances and the output of text-to-speech operations. MRCPv2  
> servers MUST support SRTP for protection of audio media sessions.  
> MRCPv2 clients that originate or consume audio similarly MUST  
> support SRTP. Alternative media channel protection MAY be used if  
> desired (e.g. IPSEC).
>
> to this:
>
> 12.3. Media session protection
> Sensitive data is also carried on media sessions terminating on  
> MRCPv2 servers (the other end of a media channel may or may not be  
> on the MRCPv2 client). This data includes the user's spoken  
> utterances and the output of text-to-speech operations. MRCPv2  
> servers MUST support a security mechanism for protection of audio  
> media sessions. MRCPv2 clients that originate or consume audio  
> similarly MUST support a security mechanism for protection of the  
> audio. If appropriate, usage of the Secure Real-time Transport  
> Protocol (SRTP) [RFC3711] is recommended.
>
> 19.   In section13.7.2 you specify the attribute resource as session  
> level yet in the example in section 4.2 it is a media level  
> attribute. The same goes for the channel attribute
>
> I have corrected both in section 13.7.2 to be media-level.
>
>
>
> Thanks
>
> Roni Even
>
>
>