Re: [Dmsp] Comments on draft-engelsma-dmsp-01.txt

Chris Cross <> Tue, 21 March 2006 20:24 UTC

Received: from [] ( by with esmtp (Exim 4.43) id 1FLnPX-0007jF-OQ; Tue, 21 Mar 2006 15:24:59 -0500
Received: from [] ( by with esmtp (Exim 4.43) id 1FLnPW-0007jA-KI for; Tue, 21 Mar 2006 15:24:58 -0500
Received: from ([]) by with esmtp (Exim 4.43) id 1FLnPV-0007Md-Up for; Tue, 21 Mar 2006 15:24:58 -0500
Received: from ( []) by (8.12.11/8.12.11) with ESMTP id k2LKOvDe016288 for <>; Tue, 21 Mar 2006 15:24:57 -0500
Received: from ( []) by (8.12.10/NCO/VER6.8) with ESMTP id k2LKLvPM263406 for <>; Tue, 21 Mar 2006 13:21:57 -0700
Received: from (loopback []) by (8.12.11/8.13.3) with ESMTP id k2LKOvLB003186 for <>; Tue, 21 Mar 2006 13:24:57 -0700
Received: from ( []) by (8.12.11/8.12.11) with ESMTP id k2LKOuOm003175 for <>; Tue, 21 Mar 2006 13:24:57 -0700
In-Reply-To: <>
Subject: Re: [Dmsp] Comments on draft-engelsma-dmsp-01.txt
X-Mailer: Lotus Notes Release 7.0 HF85 November 04, 2005
Message-ID: <>
From: Chris Cross <>
Date: Tue, 21 Mar 2006 15:24:44 -0500
X-MIMETrack: Serialize by Router on D03NM119/03/M/IBM(Release 6.53HF654 | July 22, 2005) at 03/21/2006 13:27:47
MIME-Version: 1.0
X-Spam-Score: 0.5 (/)
X-Scan-Signature: 2fe944273194be3112d13b31c91e6941
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Distributed Multimodal Synchronization Protocol <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
Content-Type: multipart/mixed; boundary="===============0990740117=="

Thanks for your comments. It takes a bit of work to wade through a spec
this size and I appreciate the effort.

"Burger, Eric" <> wrote on 03/20/2006 08:48:35 PM:

> Section 3.
> Binary encoding - blech. Will anyone use XML if all of the normative
> text describes the binary encoding? Conversely, given how much easier it
> is to generate, parse, and debug XML, would it not be better to have the
> normative text use XML, and have the mapping of tags to binary values in
> the appendix?

This is just the first draft. There are different approaches to organize
the spec, e.g., whether we create seperate chapters or RFCs for each
encoding. It is not our intent to have only one encoding serve as the
normative specification for other encodings. This first draft springs from
the fact we have implemented the binary encoding but not the XML one.

> Seems very VoiceXML-centric, to the point it may only work with
> VoiceXML. Is that OK?

That's an accurate observation. We are looking to support a "dialog level"
programming model. The alternative is to work at the level of speech
engines, which is covered by MRCP. I'm open to generalizing as long as we
can hang on to a dialog level abstraction and continue to support a
VoiceXML server as an endpoint.

> User-Agent field in SIG_INIT: says for advertising capabilities, but it
> is just a string identifying the GUA. A better mechanism is to advertise
> capabilities.

Open to suggestion here. The intent is to provide an efficient one-turn
init event.

> RESULT: is there any reason not to simply tunnel EMMA or NLSML?

See section  Extended Recognition Result Type. The interpretation
is part of the payload of EVT_RECO_RESULTEX event. EMMA is anticipated to
be one of the types.

> Translating the real result into a DMSP result will be error-prone and
> is guaranteed to not supply what the application desires. What is the
> use case? It is not a VoiceXML browser in the handset; that is what
> MRCPv2 is for. It is inconceivable that it is a network-based VoiceXML
> browser using a handset ASR engine; if the handset has the power to run
> ASR, it most likely has the power to run a VoiceXML browser.
> For that matter, what does the GUA do with recognition results? Is it to
> populate fields or to help in low-confidence situations? If the former,
> then it isn't worth having confidence scores - there should not be more
> than one value. If the latter, what does the interaction look like? I am
> asking, because presumably the VoiceXML interpreter will go into its "I
> did not get that" portion of the form. I am assuming that the goal is to
> allow the user to visually pick from a list of results. I was thinking
> that it might be more compact to have the GUA send the VUA the correct
> pick by reference, but that is too much state to carry around (which
> pick of which result are we referring to ). Thus the current model where
> the GUA pushes down the result string is a good way to go.

Don't assume that the application author will only want to handle n-best
results in the voice modality. He may prompt the user with "what did you
say?" and pop up a list to choose from. The same argument goes for the
interpretation and/or recognition results. There's all kinds of creative
things that the GUA can do with that information.

MRCP by definition does not support dialog level application programming.
So your assertion that there won't be VoiceXML in a handset is incorrect.
DMSP is designed to support a couple of broad use cases: Interaction
Manager and peer-to-peer configurations. The latter includes an X+V
multimodal browser where the VoiceXML is rendered by a remote VoiceXML
server. Turn your assertion around: are there devices that could support a
VoiceXML interpreter but not ASR/TTS?

> SIG_VXML_START: which is not really going to be used, SIG_INIT or

SIG_VXML_START was an optimization developed for a low bandwidth link. From
the description:
      "The SIG_INIT message is used when fine-grained control of which
events the client will listen is needed, and latency is not an issue."

> Can Dispatch: Which is more likely, a series of "can you do this?" or
> "what can you do?" If the latter, then it would be better to have a
> single OPTIONS message. If the former, then the mechanism as described
> is OK.

OK :)

> Get/Set Cookies security and privacy considerations

Understand there is work to do here. The point is that we'll need to
propogate session information when distributing to a VoiceXML server (or
other dialog level VUA.)

> Strings: most of the strings are or will need to be Unicode. For
> example, arbitrary form text data can easily be non-Western. Likewise,
> expect International URI's to end up as Unicode or UTF-16. If every byte
> counts, then I would offer selecting the charset in SIG_INIT or
> SIG_VXML_START, with a default to UTF-8.

Every byte counts so utf-8 is probably the default. Maybe string encoding
is part of the initial session negotiation?

> DOM keydown, keyup, keypress events: I don't have the DOM reference
> handy. Do these refer to actual keyboard presses or ink strokes? If so,
> who would use a key-by-key protocol for a distributed, web-oriented
> stimulus protocol?

Others in the multimodal community, such as some OMA members, have pressed
for this level of granularity (no pun intended.) I don't think key-by-key
protocol is practical on a real network and it is generally not necessary
in dialog level interaction.

> General: Much easier to build parsers that have all of the fixed-length
> data items up front. Take Table 36, for example. Having the Error Code
> follow Correlation means I can immediately figure out the status without
> having to parse the Node and Location fields. I might not care,
> depending on the error. If I do care, there is no harm in having the
> Error Code up front.

Good suggestion.

> Need to explain how a loop could occur (Section 4.4)

We have a use case that illustrates this that I will dig up.

> _______________________________________________
> Dmsp mailing list
Dmsp mailing list