[Speechsc] apps-team review of draft-ietf-speechsc-mrcpv2-24
Larry Masinter <masinter@adobe.com> Sun, 22 May 2011 23:50 UTC
Return-Path: <masinter@adobe.com>
X-Original-To: speechsc@ietfa.amsl.com
Delivered-To: speechsc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0ABC0E06D9; Sun, 22 May 2011 16:50:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -105.999
X-Spam-Level:
X-Spam-Status: No, score=-105.999 tagged_above=-999 required=5 tests=[AWL=-0.600, BAYES_00=-2.599, J_CHICKENPOX_110=0.6, J_CHICKENPOX_15=0.6, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LixLbTdxctJC; Sun, 22 May 2011 16:50:03 -0700 (PDT)
Received: from exprod6og106.obsmtp.com (exprod6og106.obsmtp.com [64.18.1.191]) by ietfa.amsl.com (Postfix) with ESMTP id 0BA02E06A6; Sun, 22 May 2011 16:50:01 -0700 (PDT)
Received: from outbound-smtp-2.corp.adobe.com ([193.104.215.16]) by exprod6ob106.postini.com ([64.18.5.12]) with SMTP ID DSNKTdmhKX7Z5E79uLdCd/wu83IxsxJjQaqK@postini.com; Sun, 22 May 2011 16:50:02 PDT
Received: from inner-relay-4.eur.adobe.com (inner-relay-4b [10.128.4.237]) by outbound-smtp-2.corp.adobe.com (8.12.10/8.12.10) with ESMTP id p4MNMLFu012813; Sun, 22 May 2011 16:22:21 -0700 (PDT)
Received: from nacas02.corp.adobe.com (nacas02.corp.adobe.com [10.8.189.100]) by inner-relay-4.eur.adobe.com (8.12.10/8.12.9) with ESMTP id p4MNMGqK008238; Sun, 22 May 2011 16:22:16 -0700 (PDT)
Received: from nambxv01a.corp.adobe.com ([10.8.189.95]) by nacas02.corp.adobe.com ([10.8.189.100]) with mapi; Sun, 22 May 2011 16:22:15 -0700
From: Larry Masinter <masinter@adobe.com>
To: "eburger@standardstrack.com" <eburger@standardstrack.com>, "draft-ietf-speechsc-mrcpv2@tools.ietf.org" <draft-ietf-speechsc-mrcpv2@tools.ietf.org>, "dburnett@voxeo.com" <dburnett@voxeo.com>, "sarvi@cisco.com" <sarvi@cisco.com>, "rjsparks@nostrum.com" <rjsparks@nostrum.com>, "speechsc@ietf.org" <speechsc@ietf.org>, "apps-review@ietf.org" <apps-review@ietf.org>, "apps-discuss@ietf.org" <apps-discuss@ietf.org>, "iesg@ietf.org" <iesg@ietf.org>
Date: Sun, 22 May 2011 16:22:13 -0700
Thread-Topic: apps-team review of draft-ietf-speechsc-mrcpv2-24
Thread-Index: Acv+H96fCgdiI14BSGyp2vy2TwDXaQatfj3Q
Message-ID: <C68CB012D9182D408CED7B884F441D4D05CAA587DC@nambxv01a.corp.adobe.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailman-Approved-At: Mon, 23 May 2011 08:14:52 -0700
Subject: [Speechsc] apps-team review of draft-ietf-speechsc-mrcpv2-24
X-BeenThere: speechsc@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Speech Services Control Working Group <speechsc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/speechsc>, <mailto:speechsc-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/speechsc>
List-Post: <mailto:speechsc@ietf.org>
List-Help: <mailto:speechsc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 22 May 2011 23:50:05 -0000
(Sorry for delay in sending this) I was selected as the Applications Area Review Team reviewer for this draft (for background on apps-review, please see http://www.apps.ietf.org/content/applications-area-review-team). Please resolve these comments along with any other Last Call comments you may receive. Please wait for direction from your document shepherd or AD before posting a new version of the draft. Document: apps-team review of draft-ietf-speechsc-mrcpv2-24 Title: Media Resource Control Protocol Version 2 (MRCPv2) Reviewer: Larry Masinter Review Date: 4/25/2011 IETF Last Call Date: (unknown) IESG Telechat Date: (unknown) Summary: I'm very reluctant to recommend this document for publication as an RFC as standards track or even informational. The document is in need of significant editorial work to make it technically reviewable. The most serious technical problems I noted are in what seem like untestable normative requirements (MUST). I have a lot of sympathy for the editors of this document. It has been under development for many years (it is version -24 and version -00 was published in 2004, 7 years ago), and I know that many of the difficulties of the document come from trying to address hard fought contentious issues. The document is over 221 pages long.... nearly impossible for anyone to get their head around. And given how difficult it must have been to come to agreement on many issues, I'm sure there is a great reluctance to engage in a heavy massive editing job. I also understand that the protocol described here is widely implemented with several interoperable implementations deployed. As a measure of protocol quality, then, implementations should attest to the value of the document. I confess to have only spent 6-8 hours trying to review the document before giving up, and I really only tried to carefully review the first 20 pages; perhaps the remaining 191 pages are of far superior quality to the first 20 or so I was able to review in detail. However, a scan of the reest isn't encouraging. EDITORIAL The context and operation of this protocol in the space of other protocols is really unclear. The relationship between MRCPv2, MRCP, SPEECHSC, RFC4313, VoiceXML, RTSP, SIP, SDP, SSML, and other protocols and formats cannot be teased out from the introductory material. The relationship to HTTP vs. SIP is really unclear. The simple requirement that terms be defined on first use, and be given references as either normative or informative ... is not met. Many of the referenced items are mysterious or are not defined. "SIP URI", "re-INVITE", "channel identifier". Many technical terms are mis-named, e.g., "HTTP" and "HTTPS" are not URI schemes. While essential information is missing, there seem to be many instances of redundant information, explained slightly differently, and perhaps inconsistently. While the RFC editor might normally be expected to fix up some of the references, I think doing so here would be nearly impossible without significant work from the editors. NORMATIVE REQUIREMENTS The document seems to be full of MUST and MAY requirements which do not make sense, are missing essential context, or are provided with preconditions which cannot be readily computed or known. The client MAY then open a new TCP connection with the server on this port number. which client, which server, under what circumstances? A recorder MUST provide some endpointing capabilities for suppressing silence at the beginning and end of a recording, What are "some" endpointing capabilities? Which roles of participants in this protocol are not in conformance with the specification if the recorder's endpointing capabilities are poor? Is this really a normative requirement? and MAY also suppress silence in the middle of a recording. If such suppression is done, the recorder MUST maintain timing metadata to indicate the actual time stamps of the recorded media. How does the maintenance of timing metadata manifest itself in any visible effect in the protocol? Examples are called "architectural diagrams". (DTMF Recorder) could also do a semantic interpretation based on semantic tags in the grammar. Is this a normative requirement? Just a description? > MRCPv2 employs a session establishment and management protocol such > as SIP in conjunction with SDP. Is this a "such as" or in fact that SIP and SDP are mandatory and other protocols are allowed? > The client needs a separate MRCPv2 resource control channel to > control each media processing resource under the SIP dialog. The concept of 'channel' isn't clear, where is it defined? What does it entail? .... All servers MUST support TLS. Servers MAY support TCP without TLS in physically secure environments. Is it really "physically" secure that is the requirement? How does the server know that it is in a "secure" environment? Are there other situations where TCP without TLS MAY be supported? ... SUMMARY I've really been unable to do an extensive review of the actual protocol, because of what I think is a problem with the document quality. I'm confident that a careful overview plus removal of redundant or misleading material would make the document shorter and clearer, and that a very careful review of the language describing normative requirements would improve implementability and interoperability. ========================== (original review, more details than above but more tentative) This protocol references RFC 2616. But the convention of references [HX.Y] to reference sections of RFC 2616 seems problematic. I will try to review where RFC2616's definitions have been changed or obsoleted. The RFC editor will probably fix up the references style and minor editorial details, but they're annoying. 2.2: I think you mean that the "State-Machine Diagrams" are not normative but are there for illustrative purposes only? Otherwise why are they incomplete? 2.3. URI Schemes: "HTTP" and "HTTPS" are protocols, not URI schemes. You mean the "http" and "https" scheme name. I don't understand the MAY supporting other schemes... either they are allowed or not. What are the "provided they have addressed any security considerations" doesn't seem like a reasonable requiest for conformance. 3. Architecture Maybe this belongs earlier? Or in the introduction? With references to SIP and SDP? Do these requirements appear in any other document? SIP URIs needs reference. The server, through the SDP exchange, provides the client with an unambiguous channel identifier and a TCP port number. What is a "channel identifier"? The client MAY then open a new TCP connection with the server on this port number. This doesn't make too much sense, it's missing some context. When MAY the client open a new connection? Multiple MRCPv2 channels can share a TCP connection between the client and the server. How does this work... aren't they asynchronous? All MRCPv2 messages exchanged between the client and the server carry the specified channel identifier that the server MUST ensure is unambiguous among all MRCPv2 control channels that are active on that server. The client uses this channel identifier to indicate the media processing resource associated with that channel. For information on message framing, see Section 5. Still not clear what a channel identifier is. The session initiation protocol (SIP) also establishes the media sessions between the client (or other source/sink of media) and the MRCPv2 server using SDP m-lines. One or more media processing resources may share a media session under a SIP session, or each media processing resource may have its own media session. The following diagram shows the general architecture of a system that uses MRCPv2. To simplify the diagram only a few resources are shown. How is this a symplification? Is this a "general architecture" or is it an example diagram? Figure 1: Where did RTP come in? Why is the TCP/IP stack even shown? 3.1. MRCPv2 Media Resource Types Are these examples? Exhaustive? Normative requirements? Are these resource types actually named? "Basic Synthesizer". This is the first mention of "SSML". Is "full SSML support" actually defined? (Haven't cross-checked the W3C reference.) Recorder A resource capable of recording audio and providing a URI pointer to the recording. Any other requirements, like access control? A recorder MUST provide some endpointing capabilities for suppressing silence at the beginning and end of a recording, Doesn't make sense for a "MUST" requirement to be described as "some". and MAY also suppress silence in the middle of a recording. If such suppression is done, the recorder MUST maintain timing metadata to indicate the actual time stamps of the recorded media. Unclear why this is a normative requirement at all. DTMF Recognizer A recognition resource capable of extracting and interpreting DTMF digits in a media stream and matching them against a supplied digit grammar It could also do a semantic interpretation based on semantic tags in the grammar. Unclear what "could" means here. Allowed? How would it do this? Speech Recognizer A full speech recognition resource that is capable of receiving a media stream containing audio and interpreting it to recognition results. It also has a natural language semantic interpreter to post-process the recognized data according to the semantic data in the grammar and provide semantic results along with the recognized input. The recognizer may also support enrolled grammars, where the client can enroll and create new personal grammars for use in future recognition operations. Is this a normative requirement, a description of a typical Speech Recognizer, or just an example? Speaker Verifier A resource capable of verifying the authenticity of a claimed identity by matching a media stream containing spoken input to a pre-existing voiceprint. This may also involve matching the caller's voice against more than one voiceprint, also called multi-verification or speaker identification. Does it matter how this works? Whether it does matching against voiceprints or something else? The MRCPv2 server is a generic SIP server, and is thus addressed by a SIP URI. I think you mean "A" and not "The" MRCPv2 server. What does it mean "addressed by"? In what context? 4. MRCPv2 Protocol Basics MRCPv2 requires a connection-oriented transport layer protocol such as TCP or SCTP to guarantee reliable sequencing and delivery of MRCPv2 control messages between the client and the server. In order to meet the requirements for security enumerated in SpeechSC Requirements [RFC4313], clients and servers MUST implement TLS as well. RFC 4313 now has a different name again. Normative requirements should have references. Need reference for "TLS"? One or more connections between the client and the server can be shared among different MRCPv2 channels to the server. The individual messages carry the channel identifier to differentiate messages on different channels. MRCPv2 protocol encoding is text based with mechanisms to carry embedded binary data. This allows arbitrary data like recognition grammars, recognition results, synthesizer speech markup etc. to be carried in MRCPv2 messages. For information on message framing, see Section 5. Doesn't feel like 'protocol basics' to me. 4.1. Connecting to the Server MRCPv2 employs a session establishment and management protocol such as SIP in conjunction with SDP. Is this really "such as"? Maybe you want to allow for some extensibility with other possibilities, but really, is that necessary? Especially since you make reference to SIP-specific operations and not arbitrary "session establishment" protocols? I don't understand the overall flow which makes reference to 4.1. 4.2. Managing Resource Control Channels A unique channel identifier string identifies these resource control channels. The channel identifier is an unambiguous, opaque string followed by an "@", then by a string token specifying the type of resource. So I guess the resource types listed before are exhaustive? The server generates the channel identifier and MUST make sure it does not clash with the identifier of any other MRCP channel currently allocated by that server. Are they globally unique or just relative to a server? Is it clear what a "server" is? MRCPv2 defines the following IANA-registered types of media processing resources. Additional resource types, their associated methods/events and state machines may be added as described below in Section 13. +---------------+----------------------+--------------+ | Resource Type | Resource Description | Described in | +---------------+----------------------+--------------+ | speechrecog | Speech Recognizer | Section 9 | | dtmfrecog | DTMF Recognizer | Section 9 | | speechsynth | Speech Synthesizer | Section 8 | | basicsynth | Basic Synthesizer | Section 8 | | speakverify | Speaker Verification | Section 11 | | recorder | Speech Recorder | Section 10 | +---------------+----------------------+--------------+ Looks like these *are* exhaustive, describes in a previous section, and then updated. Resource Types The SIP INVITE or re-INVITE transaction and the SDP offer/answer exchange it carries contain m-lines describing the resource control channel to be allocated. What's a "m-line"? There MUST be one SDP m-line for each MRCPv2 resource to be used in the session. Where must there be an SDP m-line? In what ocntext? This m-line MUST have a media type field of "application" and a transport type field of either "TCP/MRCPv2" or "TCP/TLS/MRCPv2". Where is the "media type field" ? The "transport type field"? (The usage of SCTP with MRCPv2 may be addressed in a future specification). Future specification of what? What does SCTP have to do with anything? The port number field of the m-line MUST contain the "discard" port of the transport protocol (port 9 for TCP) in the SDP offer from the client and MUST contain the TCP listen port on the server in the SDP answer. This confuses me a lot? Why should the port be 9? For what? The client may then either set up a TCP or TLS connection to that server port or share an already established connection to that port. When MAY the client do this? Are these the only two options? When is this behavior mandated? Since MRCPv2 allows multiple sessions to share the same TCP connection, multiple m-lines in a single SDP document may share the same port field value; "document"? Where do documents come from? MRCPv2 servers MUST NOT assume any relationship between resources using the same port other than the sharing of the communication channel. What does it mean to "assume"? How could it assume something that this "MUST NOT" do? How would you know, even? MRCPv2 resources do not use the port or format field of the m-line to distinguish themselves from other resources using the same channel. This discussion of m-lines is just confusing, then. The client MUST specify the resource type identifier in the resource attribute associated with the control m-line of the SDP offer. The server MUST respond with the full Channel-Identifier (which includes the resource type identifier and an unambiguous string) in the "channel" attribute associated with the control m-line of the SDP answer. To remain backwards compatible with conventional SDP usage, the format field of the m-line MUST have the arbitrarily-selected value of "1". Is the resource type identifier the same in the offer & response, then? Why is it important to "remain backward compatible with conventional SDP usage"? If the value is "1", how is it "arbitrarily-selected"? When the client wants to add a media processing resource to the session, it issues a SIP re-INVITE transaction. Is this clear? Is re-INVITE a SIP operation? Or does this just mean repeating the SIP INVITE. The SDP offer/answer Looks like this uses SIP or SDP in a stylized way. The a=setup attribute, as described in RFC4145 [RFC4145], MUST be "active" for the offer from the client and MUST be "passive" for the answer from the MRCPv2 server. The a=connection attribute MUST have a value of "new" on the very first control m-line offer from the client to an MRCPv2 server. Subsequent control m-line offers from the client to the MRCP server MAY contain "new" or "existing", depending on whether the client wants to set up a new connection or share an existing connection, respectively. If the client specifies a value of "new", the server MUST respond with a value of "new". If the client specifies a value of "existing", the server MAY respond with a value of "existing" if it prefers to share an existing connection or can answer with a value of "new", in which case the client MUST initiate a new transport connection. When the client wants to de-allocate the resource from this session, it issues a SIP re-INVITE transaction with the server. The SDP MUST offer the control m-line with port 0. The server MUST then answer the control m-line with a response of port 0. This de-allocates the associated MRCPv2 identifier and resource. The server MUST NOT close the TCP, SCTP or TLS connection if it is currently being shared among multiple MRCP channels. When all MRCP channels that may be sharing the connection are released and/or the associated SIP dialog is terminated, the client or server terminates the connection. All servers MUST support TLS. Servers MAY support TCP without TLS in physically secure environments. Is it really "physically" secure that is the requirement? How does the server know that it is in a "secure" environment?
- [Speechsc] apps-team review of draft-ietf-speechs… Larry Masinter