[Speechsc] Re: Comments on draft-ietf-speechsc-reqts-00.txt

Thanks for your careful and comprehensive review. This is exactly the kind 
of feedback and discussion that last call is intended for (although earlier 
would have been better :-)

Responses, comments, and requests embedded.

WG - please continue this discussion on the list - we want to see if we can 
reach closure quickly and get this to the IESG so as to not hold up the 
protocol analysis work.

--On Thursday, September 19, 2002 5:26 PM -0700 Salvador Neto 
<salvadon@Exchange.Microsoft.com> wrote:

> A number of people in Microsoft's speech group have reviewed the
> SpeechSC requirements draft in detail, and would like to raise the
> following comments and questions.
>
> Please forgive us if these have been discussed in the group already, as
> we approach this document as relative newcomers to the group.
>
> And also, please make sure you include Hsiao-Wuen-Hon
> (hon@microsoft.com) and Stephen Potter (spotter@microsoft.com) in your
> replies, as they are still in the process of adding themselves to the
> list.
>
Done.

> Thanks,
> Salvador Assumpcao Neto
> Program Manager - .Net Speech group
> Microsoft Corporation
>
> Comments follow...
>
> Section 3 (and throughout)
> 1. The requirements document is clearly focused on telephony gateway
> scenarios. We suggest that an equally compelling scenario is mobile
> multimodal scenarios where wireless mobile devices can attain ASR/SR/TTS
> services via network. If the group agrees that this is an important
> scenario to address, we would suggest modification or addition of the
> requirements to reflect this.
I, for one agree, and suspect the Wg would not have any problem with this.

> Specifically, we are concerned that saying
> SpeechSC depends on RTP may mean that it depends on UDP.
My understanding is that it does not. It is my understanding that RTP can 
be carried over TCP. The concern over broadening the coverage beyond RTP is 
that RTP is *the* media carriage standard of the IETF and we are producing 
an IETF standard here. In particular, we may wish to depend on control 
<->media protocol corellation and synchronization based on RTP timestamps 
and SSRCs. If we were to incorporate other media carriage protocols this 
might substantially complicate the protocol.

There is also the issue of barge-in and whether we can depend on the media 
protocol to have machinery that prevents or at least mitigates head-of-line 
blocking.

We could conceivably just write requirements that any media carriage 
protocol have certain characteristics, but I would be inclined to retain 
the RTP dependency for the reasons cited above.

Rapid WG and AD feedback on this particular issue is important, I think.

> If so, we
> believe that for ASR and SR services guarantee of delivery, such as
> provided by TCP, may be better suited (rather than real-time
> requirement), especially in the case of mobile multimodal scenarios. So
> we hope SpeechSC will be open to contemplate such alternatives.
>
> 2. The term 'speaker recognition' and its acronym SR are rather
> confusing, given the widespread use of SR to mean 'speech recognition'
> (i.e. ASR), and the use of the word 'recognition' in both concepts. We
> understand the meaning of the term to embrace both 'speaker
> verification' and 'speaker identification'. We would therefore suggest a
> different phrase for these capabilities: Speaker
> Identification/Verification (SIV).
>
I like this suggestion (despite my proposing a different acronym in my note 
responding to anohter set of comments). If there are no objections, I will 
adopt this acronym in the next revision of the draft.

> Section 4
> 3. Reference is made to VoiceXML as an example platform - we would also
> suggest naming a SALT platform for the same purposes - particularly in a
> mobile architecture - and we can suggest some wording here if you would
> like.
>
Please do. ASAP.

> Section 5.4
> 4. The requirement to be compatible with the IAB OPES framework is
> somewhat unspecific, and appears to require more digestion of a third
> party document which many readers will not undertake. Is is possible to
> make this requirement more tangible without sending the reader off to
> the OPES specification?
>
Another comment asked for an example of how one might run afoul of OPES. 
Would be satisfactory?

On a more philosophical note, these requirements are meant primarily to be 
consumed by the people undertaking the analysys of existing protocols, and 
as a measure against which to evaluate any protocol the WG proposes to the 
IETF for adoption as a standard. As WG chair I would expect and indeed 
demand that at least those folks in fact read and absorb the OPES documents 
since they are relevant to the design of a SPEECHSC protocol and point out 
many pits others have fallen into.

> Section 5.5
> 5. It's unclear whether this requirement implies that SpeechSC needs to
> be aware of Load Balancing mechanisms in order to perform Load Balancing
> or not. We hope this is not the case.
>
Could you restate this comment, please? If speechsc itself performs load 
balancing it by definition needs to be aware of load balancing mechanisms. 
If it does not perform load balancing itself, it may still need to be aware 
that load balancing may occur, because there are known cases in which 
protocols break, or perform in a highly suboptimal way when load balanced, 
unless this was considered in the design. The intent of the requirement is 
to say that Speechsc my take into account load balancing, but not replicate 
the functions of load balancing protocols as part of its design.

We'd appreciate any attempt you could make to capture this better than the 
existing text.

> Section 5.6
> 6. The MUST requirement to permit multiple operations on a single stream
> may significantly affect SpeechSC design, but we do not believe the
> scenarios which are enabled by this are sufficiently compelling to make
> it a must-have. So we suggest this requirement should be a SHOULD.
>
In earlier list discussion and at the Yokohmam IETF meeting, the consensus 
was this is a MUST requirement. Let's revisit this given we have a counter 
view.

WG members, please weigh in here.

> Section 6
> 7. It is unclear what 'voicing' means in this section (in the world of
> phonology it refers to a specific phenomenon of pronunciation which is
> clearly not intended). Does it mean something along the lines of 'output
> voice selection'?
>
In fact my understanding it was precisely the pronunciation details in 
addition to the gross aspect of selecting a particular speakers' voice 
which were intended. I could of course be wrong. WG - help us out here - 
which do you think was intended. I'll reword as needed once we have 
consensus.

> 8. Also the term 'reading' is somewhat confusing in this section. How
> about 'playback' or 'synthesis'?
>
Agree, and another set of comments pointed this out. I will change to 
"playback", and note that "synthesis" is not quite right either because the 
output my be just the result of matching a work or phrase in text to a 
pre-canned audio file rather than actually doing synthesis.

> 9. The case of mixed TTS and audio files in the same output is not
> addressed explicitly anywhere in this section. This is a common use case
> (eg 'please speak after the beep <beep>', or 'here's your voice mail
> <message>, what do you want to do now?'). The text seems to assume a
> disjunction, but especially since SSML can carry references to remote
> audio files, it may be useful to address this scenario more explicitly.
>
Agree. Could you please SUPPLY TEXT. Thanks.

> 10. (paras 2, 3, 4) The subtle switch to explicit requirements on the
> TTS server is confusing. Surely we should be defining requirements only
> on the framework. The abilities of the components of the framework
> should be called out separately either as (a) conformance issues, or (b)
> assumptions on individual component capabilities. They should certainly
> be in a different section. In any case, we believe the requirements in
> these paragraphs should be requirements on the framework itself.
>
Hmmm, I suppose we could do a better job here. The intent is to highlight 
how requirments on the protocol relate to things the servers are assumed to 
be doing or not doing. While separating these out lexically might 
ameliorate this problem it might cause other problems in understanding how 
the various requirements relate to one another. I could go either way on 
this. How strongly do you feel that the existing text/structure needs to 
change?

> 11. (para 6) The loose definitions in this paragraph render it hard to
> understand. We do not understand what is actually intended by the SHOULD
> and MAY capabilities here.
>
Ouch. You'll have to help me figure out what is wrong, since it's crystal 
clear to me (I know this isn't too  helpful). What we are trying to say is 
that it is pretty important (SHOULD) that the protocol be able to keep its 
control channel, and hence its basic control synchronization state, over 
longer than one session (where session is loosely corellated with a "call", 
SIP dialog, or session state of the associated voice application). We then 
say that it would be perfectly OK but not required (MAY) if the protocol 
was capable of a complete {setup/do work/shut down} cycle on shorter scale, 
down to a single interaction for one utterance.

> 12. (para 7) Again, point 10 on the TTS Server requirements. We also
> believe that such capabilities are not MUSTs in either case, but rather
> SHOULD for the framework (and correspondingly MAY for Server
> conformance).
>
The WG consensus was otherwise, but now you are the second set of folks 
commenting that we ought to water this down to SHOULD. I am personally 
inclined to keep this as a MUST, since adding this to the protocol later if 
not accommodated in the design up front could be very difficult. On 
apractical basis, a number of the candidate protocols (MRCP, RTSP) already 
do this, so meeting the MUST may not in practice be all that difficult.

Need WG feedback on this point so we can reach closure!

> 13. (penultimate para)
> The term 'prosody' is not defined, but this is a crucial requirement on
> the framework (see also point 7 on 'voicing').
>
Good point. Could you please SUPPLY TEXT.

> Section 7
> 14. Again, TCP seems to be ruled out in this section. A useful
> requirement may be to address 'guarantee of delivery' for ASR.
>
Where do you think TCP is ruled out in section 7?  I can't find it, sorry.

> 15. (para 5) The requirement to accommodate "all of the protocol
> parameters" of MRCP needs to be justified. We believe it is too strong,
> since it assumes that MRCP is an exact fit to SpeechSC, whereas we
> believe MRCP may be over-specialized for use in VoiceXML environments.
Au contraire. If MRCP met the requirements we likely would not even be 
having this WG or this discussion; we would be having an MRCP WG or just 
asking the IESG to bless MRCP as a proposed standard.

To address the substance of your comment, there was WG consensus that none 
of the parameters in MRCP could be eliminated without losing some 
functionality that was deemed essential. This is not to say that the set in 
MRCP is complete by any means. We know it's not. It would be helpful if you 
could enumerate that parameters in MRCP that you do not believe are 
essential, keeping in mind that we want the speechsc protocol to have wide 
applicability which *includes* but is *not limited to* VoiceXML 
environments.

> This paragraph is also another case where crucial requirements are made
> by reference to another spec; we think these should be explicit in this
> document.
>
Would you suggest we do the same for everything in SMIL as well, which we 
incorporate by reference. I'm sympathetic to the pain it causes a reader to 
have to chase separate documents, but am reluctant to mass import text that 
is easily available and be more understandable in context.

If the WG wants to import this, I'll do it.

> 16. (para 7) Again, this is a requirements on a component, and it is
> unclear what it adds to the requirement on the framework in para 3.
>
I'm somewhat sympathetic to this view, but on balance I'm inclined to keep 
these things because the do very much inform the protocol design, in this 
case how important it is to presume caching of grammar definitions when 
there is a long-lived control channel.

> Section 8
> 17. (para 1) Again, assumption of RTP only.
>
> 18. (para 3) It is unclear what burden this requirement places on the
> framework.
>
It says that if the protocol is purely transactional with no 
cross-transaction state it may run afoul of SV servers who fail to work 
right if the protocol resets their state on each interaction. The 
requirement is stated this way so that the protocol evaluators/designers 
think about the underlying operation of the SV servers when considering the 
interaction models that the protcol needs to support.

Would you like to take a crack at expressing this better than the existing 
text?

> Section 9
> 19. It took a couple of readings to decipher that there are two very
> different definitions of 'dual mode' at play here. Suggest they are
> called out in advance of the rationale.
>
Right. We'll to take a crack at this based on this comment and a similar 
comment from someone else. One good suggestion is to pitch the term 
"dual-mode" and say "full-duplex" instead, which may go a long way to 
helping get across the substance.

> Section 10
> 20. (para 2) "Investigating' is a rather vague term for a
> non-requirement. Is the intention to say that the WG will not be
> 'defining' DSR coding schemes?
>
Yes. In addition we won't even discuss them or do any 
investigation/evaluation of various DSR schemes. They're just anohter codec 
to us :-)

[fin]
Dave.
------------------------
David R. Oran
7 Ladyslipper Lane
Acton, MA 01720
Office: +1 978 264 2048
VoIP: +1 408 571 4576
Email: oran@cisco.com
_______________________________________________
Speechsc mailing list
Speechsc@ietf.org
https://www1.ietf.org/mailman/listinfo/speechsc