[Speechsc] apps-team review of draft-ietf-speechsc-mrcpv2-24

Larry Masinter <masinter@adobe.com> Sun, 22 May 2011 23:50 UTC

Return-Path: <masinter@adobe.com>
X-Original-To: speechsc@ietfa.amsl.com
Delivered-To: speechsc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0ABC0E06D9; Sun, 22 May 2011 16:50:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -105.999
X-Spam-Level:
X-Spam-Status: No, score=-105.999 tagged_above=-999 required=5 tests=[AWL=-0.600, BAYES_00=-2.599, J_CHICKENPOX_110=0.6, J_CHICKENPOX_15=0.6, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LixLbTdxctJC; Sun, 22 May 2011 16:50:03 -0700 (PDT)
Received: from exprod6og106.obsmtp.com (exprod6og106.obsmtp.com [64.18.1.191]) by ietfa.amsl.com (Postfix) with ESMTP id 0BA02E06A6; Sun, 22 May 2011 16:50:01 -0700 (PDT)
Received: from outbound-smtp-2.corp.adobe.com ([193.104.215.16]) by exprod6ob106.postini.com ([64.18.5.12]) with SMTP ID DSNKTdmhKX7Z5E79uLdCd/wu83IxsxJjQaqK@postini.com; Sun, 22 May 2011 16:50:02 PDT
Received: from inner-relay-4.eur.adobe.com (inner-relay-4b [10.128.4.237]) by outbound-smtp-2.corp.adobe.com (8.12.10/8.12.10) with ESMTP id p4MNMLFu012813; Sun, 22 May 2011 16:22:21 -0700 (PDT)
Received: from nacas02.corp.adobe.com (nacas02.corp.adobe.com [10.8.189.100]) by inner-relay-4.eur.adobe.com (8.12.10/8.12.9) with ESMTP id p4MNMGqK008238; Sun, 22 May 2011 16:22:16 -0700 (PDT)
Received: from nambxv01a.corp.adobe.com ([10.8.189.95]) by nacas02.corp.adobe.com ([10.8.189.100]) with mapi; Sun, 22 May 2011 16:22:15 -0700
From: Larry Masinter <masinter@adobe.com>
To: "eburger@standardstrack.com" <eburger@standardstrack.com>, "draft-ietf-speechsc-mrcpv2@tools.ietf.org" <draft-ietf-speechsc-mrcpv2@tools.ietf.org>, "dburnett@voxeo.com" <dburnett@voxeo.com>, "sarvi@cisco.com" <sarvi@cisco.com>, "rjsparks@nostrum.com" <rjsparks@nostrum.com>, "speechsc@ietf.org" <speechsc@ietf.org>, "apps-review@ietf.org" <apps-review@ietf.org>, "apps-discuss@ietf.org" <apps-discuss@ietf.org>, "iesg@ietf.org" <iesg@ietf.org>
Date: Sun, 22 May 2011 16:22:13 -0700
Thread-Topic: apps-team review of draft-ietf-speechsc-mrcpv2-24
Thread-Index: Acv+H96fCgdiI14BSGyp2vy2TwDXaQatfj3Q
Message-ID: <C68CB012D9182D408CED7B884F441D4D05CAA587DC@nambxv01a.corp.adobe.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailman-Approved-At: Mon, 23 May 2011 08:14:52 -0700
Subject: [Speechsc] apps-team review of draft-ietf-speechsc-mrcpv2-24
X-BeenThere: speechsc@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Speech Services Control Working Group <speechsc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/speechsc>, <mailto:speechsc-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/speechsc>
List-Post: <mailto:speechsc@ietf.org>
List-Help: <mailto:speechsc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 22 May 2011 23:50:05 -0000

(Sorry for delay in sending this)

I was selected as the Applications Area Review Team reviewer for this draft (for background on apps-review, please see http://www.apps.ietf.org/content/applications-area-review-team).

Please resolve these comments along with any other Last Call comments you may receive. Please wait for direction from your document shepherd or AD before posting a new version of the draft.

Document: apps-team review of draft-ietf-speechsc-mrcpv2-24
Title: Media Resource Control Protocol Version 2 (MRCPv2)
Reviewer:  Larry Masinter
Review Date: 4/25/2011

IETF Last Call Date: (unknown)
IESG Telechat Date: (unknown)

Summary:  I'm very reluctant to recommend this document for publication as an RFC as standards track or even informational.  The document is in need of significant editorial work to make it technically reviewable. The most serious technical problems I noted are in what seem like untestable normative requirements (MUST).

I have a lot of sympathy for the editors of this document. It has been under development for many years (it is version -24 and version -00 was published in 2004, 7 years ago), and I know that many of the difficulties of the document come from trying to address hard fought contentious issues.  The document is over 221 pages long.... nearly impossible for anyone to get their head around.  And given how difficult it must have been to come to agreement on many issues, I'm sure there is a great reluctance to engage in a heavy massive editing job.

I also understand that the protocol described here is widely implemented with several interoperable implementations deployed. As a measure of protocol quality, then, implementations should attest to the value of the document.

I confess to have only spent 6-8 hours trying to review the document before giving up, and I really only tried to carefully review the first 20 pages; perhaps the remaining 191 pages are of far superior quality to the first 20 or so I was able to review in detail. However, a scan of the reest isn't encouraging.

EDITORIAL

The context and operation of this protocol in the space of other protocols is really unclear. The relationship between MRCPv2, MRCP, SPEECHSC, RFC4313, VoiceXML, RTSP,  SIP, SDP, SSML, and other protocols and formats cannot be teased out from the introductory material.  The relationship to HTTP vs. SIP is really unclear.

The simple requirement that terms be defined on first use, and be given references as either normative or informative ... is not met. Many of the referenced items are mysterious or are not defined. "SIP URI", "re-INVITE", "channel identifier".

Many technical terms are mis-named, e.g., "HTTP" and "HTTPS" are not URI schemes.

While essential information is missing, there seem to be many instances of redundant information, explained slightly differently, and perhaps inconsistently.

While the RFC editor might normally be expected to fix up some of the references, I think doing so here would be nearly impossible without significant work from the editors.


NORMATIVE REQUIREMENTS

The document seems to be full of MUST and MAY requirements which do not make sense, are missing essential context, or are provided with preconditions which cannot be readily computed or known.

   The client MAY
   then open a new TCP connection with the server on this port number.

which client, which server, under what circumstances?

                   A recorder MUST provide
                  some endpointing capabilities for suppressing silence
                  at the beginning and end of a recording,

What are "some" endpointing capabilities? Which roles of participants in this protocol are not in conformance with the specification if the recorder's endpointing capabilities are poor? Is this really a normative requirement?

                   and MAY also
                  suppress silence in the middle of a recording.  If
                  such suppression is done, the recorder MUST maintain
                  timing metadata to indicate the actual time stamps of
                  the recorded media.

How does the maintenance of timing metadata manifest itself in any visible effect in the protocol?
Examples are called "architectural diagrams".

                 (DTMF Recorder)
                  could also do a semantic interpretation based on
                  semantic tags in the grammar.

Is this a normative requirement? Just a description?

>     MRCPv2 employs a session establishment and management protocol such
>     as SIP in conjunction with SDP.

Is this a "such as" or in fact that SIP and SDP are mandatory and other protocols are allowed?


>   The client needs a separate MRCPv2 resource control channel to
>   control each media processing resource under the SIP dialog.

The concept of 'channel' isn't clear, where is it defined? What does
it entail?

....

   All servers MUST support TLS.  Servers MAY support TCP without TLS in
   physically secure environments.

Is it really "physically" secure that is the requirement? How does
the server know that it is in a "secure" environment? Are there other
situations where TCP without TLS MAY be supported?
...


SUMMARY


I've really been unable to do an extensive review of the actual protocol, because of what I think is a problem with the document quality.  I'm confident that a careful overview plus removal of redundant or misleading material would make the document shorter and clearer, and that a very careful review of the language describing normative requirements would improve implementability and interoperability.

==========================
(original review, more details than above but more tentative)

This protocol references RFC 2616. But the convention of references
[HX.Y] to reference sections of RFC 2616 seems problematic.
I will try to review where RFC2616's definitions have been changed
or obsoleted.

The RFC editor will probably fix up the references style and minor
editorial details, but they're annoying.

2.2: I think you mean that the "State-Machine Diagrams" are not
normative but are there for illustrative purposes only? Otherwise why
are they incomplete?

2.3.  URI Schemes:

"HTTP" and "HTTPS" are protocols, not URI schemes. You mean the
"http" and "https" scheme name. I don't understand the MAY
supporting other schemes... either they are allowed or not.
What are the "provided they have addressed any security
considerations" doesn't seem like a reasonable requiest for
conformance.

3.  Architecture
Maybe this belongs earlier? Or in the introduction?
With references to SIP and SDP? Do these requirements appear
in any other document?

SIP URIs needs reference.

   The server, through the SDP exchange, provides the client with an
   unambiguous channel identifier and a TCP port number.

What is a "channel identifier"?


   The client MAY
   then open a new TCP connection with the server on this port number.

This doesn't make too much sense, it's missing some context.
When MAY the client open a new connection?

   Multiple MRCPv2 channels can share a TCP connection between the
   client and the server.

How does this work... aren't they asynchronous?

    All MRCPv2 messages exchanged between the
   client and the server carry the specified channel identifier that the
   server MUST ensure is unambiguous among all MRCPv2 control channels
   that are active on that server.

   The client uses this channel
   identifier to indicate the media processing resource associated with
   that channel.  For information on message framing, see Section 5.

Still not clear what a channel identifier is.

   The session initiation protocol (SIP) also establishes the media
   sessions between the client (or other source/sink of media) and the
   MRCPv2 server using SDP m-lines.  One or more media processing
   resources may share a media session under a SIP session, or each
   media processing resource may have its own media session.

   The following diagram shows the general architecture of a system that
   uses MRCPv2.  To simplify the diagram only a few resources are shown.

How is this a symplification? Is this a "general architecture" or
is it an example diagram?



Figure 1:
Where did RTP come in? Why is the TCP/IP stack even shown?


3.1.  MRCPv2 Media Resource Types

Are these examples? Exhaustive? Normative requirements? Are these
resource types actually named?

"Basic Synthesizer". This is the first mention of "SSML". Is
"full SSML support" actually defined? (Haven't cross-checked
the W3C reference.)

   Recorder
                  A resource capable of recording audio and providing a
                  URI pointer to the recording.

Any other requirements, like access control?

                   A recorder MUST provide
                  some endpointing capabilities for suppressing silence
                  at the beginning and end of a recording,

Doesn't make sense for a "MUST" requirement to be described as "some".

                   and MAY also
                  suppress silence in the middle of a recording.  If
                  such suppression is done, the recorder MUST maintain
                  timing metadata to indicate the actual time stamps of
                  the recorded media.

Unclear why this is a normative requirement at all.


   DTMF Recognizer
                  A recognition resource capable of extracting and
                  interpreting DTMF digits in a media stream and
                  matching them against a supplied digit grammar It
                  could also do a semantic interpretation based on
                  semantic tags in the grammar.

Unclear what "could" means here. Allowed? How would it do this?

   Speech Recognizer
                  A full speech recognition resource that is capable of
                  receiving a media stream containing audio and
                  interpreting it to recognition results.  It also has a
                  natural language semantic interpreter to post-process
                  the recognized data according to the semantic data in
                  the grammar and provide semantic results along with
                  the recognized input.  The recognizer may also support
                  enrolled grammars, where the client can enroll and
                  create new personal grammars for use in future
                  recognition operations.

Is this a normative requirement, a description of a typical Speech
Recognizer, or just an example?

   Speaker Verifier
                  A resource capable of verifying the authenticity of a
                  claimed identity by matching a media stream containing
                  spoken input to a pre-existing voiceprint.  This may
                  also involve matching the caller's voice against more
                  than one voiceprint, also called multi-verification or
                  speaker identification.

Does it matter how this works? Whether it does matching against
voiceprints or something else?

   The MRCPv2 server is a generic SIP server, and is thus addressed by a
   SIP URI.

I think you mean "A" and not "The" MRCPv2 server. What does it mean
"addressed by"? In what context?

4.  MRCPv2 Protocol Basics

   MRCPv2 requires a connection-oriented transport layer protocol such
   as TCP or SCTP to guarantee reliable sequencing and delivery of
   MRCPv2 control messages between the client and the server.  In order
   to meet the requirements for security enumerated in SpeechSC
   Requirements [RFC4313], clients and servers MUST implement TLS as
   well.

RFC 4313 now has  a different name again.
Normative requirements should have references. Need reference for "TLS"?

   One or more connections between the client and the server can
   be shared among different MRCPv2 channels to the server.  The
   individual messages carry the channel identifier to differentiate
   messages on different channels.  MRCPv2 protocol encoding is text
   based with mechanisms to carry embedded binary data.  This allows
   arbitrary data like recognition grammars, recognition results,
   synthesizer speech markup etc. to be carried in MRCPv2 messages.  For
   information on message framing, see Section 5.

Doesn't feel like 'protocol basics' to me.

4.1.  Connecting to the Server
   MRCPv2 employs a session establishment and management protocol such
   as SIP in conjunction with SDP.

Is this really "such as"? Maybe you want to allow for some
extensibility with other possibilities, but really, is that
necessary? Especially since you make reference to SIP-specific
operations and not arbitrary "session establishment" protocols?

I don't understand the overall flow which makes reference to
4.1.

4.2.  Managing Resource Control Channels

    A
   unique channel identifier string identifies these resource control
   channels.  The channel identifier is an unambiguous, opaque string
   followed by an "@", then by a string token specifying the type of
   resource.

So I guess the resource types listed before are exhaustive?

    The server generates the channel identifier and MUST make
   sure it does not clash with the identifier of any other MRCP channel
   currently allocated by that server.

Are they globally unique or just relative to a server? Is it clear
what a "server" is?


  MRCPv2 defines the following
   IANA-registered types of media processing resources.  Additional
   resource types, their associated methods/events and state machines
   may be added as described below in Section 13.

          +---------------+----------------------+--------------+
          | Resource Type | Resource Description | Described in |
          +---------------+----------------------+--------------+
          | speechrecog   | Speech Recognizer    | Section 9    |
          | dtmfrecog     | DTMF Recognizer      | Section 9    |
          | speechsynth   | Speech Synthesizer   | Section 8    |
          | basicsynth    | Basic Synthesizer    | Section 8    |
          | speakverify   | Speaker Verification | Section 11   |
          | recorder      | Speech Recorder      | Section 10   |
          +---------------+----------------------+--------------+

Looks like these *are* exhaustive, describes in a previous section, and
then updated.

                              Resource Types

   The SIP INVITE or re-INVITE transaction and the SDP offer/answer
   exchange it carries contain m-lines describing the resource control
   channel to be allocated.

What's a "m-line"?


   There MUST be one SDP m-line for each
   MRCPv2 resource to be used in the session.

Where must there be an SDP m-line? In what ocntext?

   This m-line MUST have a
   media type field of "application" and a transport type field of
   either "TCP/MRCPv2" or "TCP/TLS/MRCPv2".


Where is the "media type field" ? The "transport type field"?

  (The usage of SCTP with
   MRCPv2 may be addressed in a future specification).

Future specification of what? What does SCTP have to do with anything?

  The port number
   field of the m-line MUST contain the "discard" port of the transport
   protocol (port 9 for TCP) in the SDP offer from the client and MUST
   contain the TCP listen port on the server in the SDP answer.

This confuses me a lot? Why should the port be 9? For what?


  The
   client may then either set up a TCP or TLS connection to that server
   port or share an already established connection to that port.

When MAY the client do this? Are these the only two options? When
is this behavior mandated?


   Since
   MRCPv2 allows multiple sessions to share the same TCP connection,
   multiple m-lines in a single SDP document may share the same port
   field value;

"document"? Where do documents come from?

   MRCPv2 servers MUST NOT assume any relationship between
   resources using the same port other than the sharing of the
   communication channel.

What does it mean to "assume"? How could it assume something that
this "MUST NOT" do? How would you know, even?

   MRCPv2 resources do not use the port or format field of the m-line to
   distinguish themselves from other resources using the same channel.

This discussion of m-lines is just confusing, then.


   The client MUST specify the resource type identifier in the resource
   attribute associated with the control m-line of the SDP offer.  The
   server MUST respond with the full Channel-Identifier (which includes
   the resource type identifier and an unambiguous string) in the
   "channel" attribute associated with the control m-line of the SDP
   answer.  To remain backwards compatible with conventional SDP usage,
   the format field of the m-line MUST have the arbitrarily-selected
   value of "1".

Is the resource type identifier the same in the offer & response, then?
Why is it important to "remain backward compatible with conventional
SDP usage"?  If the value is "1", how is it "arbitrarily-selected"?

   When the client wants to add a media processing resource to the
   session, it issues a SIP re-INVITE transaction.

Is this clear? Is re-INVITE a SIP operation? Or does this just mean
repeating the SIP INVITE.

  The SDP offer/answer

Looks like this uses SIP or SDP in a stylized way.

   The a=setup attribute, as described in RFC4145 [RFC4145], MUST be
   "active" for the offer from the client and MUST be "passive" for the
   answer from the MRCPv2 server.  The a=connection attribute MUST have
   a value of "new" on the very first control m-line offer from the
   client to an MRCPv2 server.  Subsequent control m-line offers from
   the client to the MRCP server MAY contain "new" or "existing",
   depending on whether the client wants to set up a new connection or
   share an existing connection, respectively.  If the client specifies
   a value of "new", the server MUST respond with a value of "new".  If
   the client specifies a value of "existing", the server MAY respond
   with a value of "existing" if it prefers to share an existing
   connection or can answer with a value of "new", in which case the
   client MUST initiate a new transport connection.

   When the client wants to de-allocate the resource from this session,
   it issues a SIP re-INVITE transaction with the server.  The SDP MUST
   offer the control m-line with port 0.  The server MUST then answer
   the control m-line with a response of port 0.  This de-allocates the
   associated MRCPv2 identifier and resource.  The server MUST NOT close
   the TCP, SCTP or TLS connection if it is currently being shared among
   multiple MRCP channels.  When all MRCP channels that may be sharing
   the connection are released and/or the associated SIP dialog is
   terminated, the client or server terminates the connection.

   All servers MUST support TLS.  Servers MAY support TCP without TLS in
   physically secure environments.

Is it really "physically" secure that is the requirement? How does
the server know that it is in a "secure" environment?