Re: [speechsc] Hotword Recognition and Timers
"Dave Burke" <david.burke@voxpilot.com> Thu, 13 July 2006 15:56 UTC
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1G13Y8-0005bQ-GK; Thu, 13 Jul 2006 11:56:24 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1G13Y7-0005bL-0S for speechsc@ietf.org; Thu, 13 Jul 2006 11:56:23 -0400
Received: from fw01.db01.voxpilot.com ([212.17.54.82] helo=mail.voxpilot.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1G13Y3-0003tZ-NU for speechsc@ietf.org; Thu, 13 Jul 2006 11:56:22 -0400
Received: by mail.voxpilot.com (Postfix, from userid 552) id B662F214114; Thu, 13 Jul 2006 15:56:18 +0000 (GMT)
X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on db01ms01
X-Spam-Status: No, score=-4.3 required=5.5 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.0
X-Spam-Level:
Received: from daburkewxp (unknown [10.0.0.102]) by mail.voxpilot.com (Postfix) with ESMTP id 579AA2140F6; Thu, 13 Jul 2006 15:56:10 +0000 (GMT)
Message-ID: <01ce01c6a694$d55a3fb0$6700000a@db01.voxpilot.com>
From: Dave Burke <david.burke@voxpilot.com>
To: Andrew Wahbe <Andrew.Wahbe@genesyslab.com>, "IETF SPEECHSC (E-mail)" <speechsc@ietf.org>
References: <911B89A9FD71E649AA624FF24790D76F51AA40@GIMLI.us.int.genesyslab.com>
Subject: Re: [speechsc] Hotword Recognition and Timers
Date: Thu, 13 Jul 2006 16:55:58 +0100
MIME-Version: 1.0
Content-Type: text/plain; format="flowed"; charset="iso-8859-1"; reply-type="original"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.2869
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2869
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 4d9ae72af46718088458d214998cc683
Cc:
X-BeenThere: speechsc@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Speech Services Control Working Group <speechsc.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:speechsc@ietf.org>
List-Help: <mailto:speechsc-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=subscribe>
Errors-To: speechsc-bounces@ietf.org
That works for me. And just to clarify: this means that if the Recognition-Timer fired in hotword and there was a match, then 008 success-maxtime would be returned - right? Dave ----- Original Message ----- From: "Andrew Wahbe" <Andrew.Wahbe@genesyslab.com> To: "Dave Burke" <david.burke@voxpilot.com>; "IETF SPEECHSC (E-mail)" <speechsc@ietf.org> Sent: Thursday, July 13, 2006 3:13 PM Subject: RE: [speechsc] Hotword Recognition and Timers One thing: if everyone is comfortable with these changes, then I wonder what the purpose of the 003 hotword-maxtime completion cause code is. It seems that throwing a 015 "no-match-maxtime" would not only work but also make the most sense as the rest of the hotword behavior (from the client's perspective) is more or less identical to the normal case. What is the rationale for having a 003 hotword-maxtime completion cause code at this point? If there is none, I would like to suggest that it be removed. We can "reserve" the numeric code for future use to avoid renumbering everything else if that is a concern. Andrew -----Original Message----- From: Dave Burke [mailto:david.burke@voxpilot.com] Sent: July 2, 2006 5:41 PM To: Andrew Wahbe; IETF SPEECHSC (E-mail) Subject: Re: [speechsc] Hotword Recognition and Timers Andrew's proposals/clarifications make sense to me. One interesting result, however, is that Andrew's definition for Recognition-Timeout coincides with Hotword-Max-Duration except that the former terminates the recognition when it fires. I don't think this is necessarily a problem. It seems (if I understand this thread properly) that the VoiceXML world needs a maxspeechtimeout to terminate hotword but the MRCP protocol also might need a safety net to prevent a RECOGNIZE going IN-PROGRESS forever. For normal recognition, the Recognition-Timeout gets you both the safety net and the maxspeechtimeout. Since the MRCP client can STOP a recognition at any point this safety net is not crucial. In short - I'm fine with Andrew's suggested changes. Dave ----- Original Message ----- From: "Andrew Wahbe" <Andrew.Wahbe@genesyslab.com> To: "IETF SPEECHSC (E-mail)" <speechsc@ietf.org> Sent: Monday, June 19, 2006 8:30 PM Subject: RE: [speechsc] Hotword Recognition and Timers The thing is that nowhere in your explanation are you mentioning the prompt and it's completion (ie. the START-INPUT-TIMERS message). The main use case and reason for hotword recognition/recognition-based barge-in is to prevent accidental barge-in on audio content such as a voicemail, tts email, etc. The scenario you describe below requires that the client knows how long the content is when the RECOGNIZE is started; this is definitely not an assumption you can make. The client won't know how long it will take to TTS a chunk of text or how long the set of audio files (prompts) are or even if they end at all (it could be a continuous stream). My proposal is that hotword recognition should "work" in a similar manner to normal recognition from the client's perspective: * RECOGNIZE is sent with the start-input-timers header set to "false". The recognition-mode is set to "hotword". Prompt playback starts at this point as well. * START-INPUT-TIMERS is sent when the prompt completes. The no-input-timer starts at this point. The above two points are identical to the normal case except that the recognition-mode is "hotword". My proposal is that the general meaning of the recognition and no-input timers are also the same as the normal case. Namely: * The no-input timer is the max amount of time after the prompt completes that we are willing to wait for input. This is equivalent to the "timeout" property in VoiceXML. It is usually on the order of a few seconds. * The recognition timer is the max amount of time that we will run recognition on a single "utterance". This is basically a safety net protecting against noise (say the user left the phone off the hook next to the radio) keeping the recognizer occupied for an unreasonable amount of time. This only applies when speech is detected since the no-input-timer will take effect (once the prompt is done) to terminate the recognition. This is equivalent to the "maxspeechtimeout" property in VoiceXML. This is usually quite a bit longer than the no-input timeout, say 10 to 30 seconds. Note that the definitions of timeout and maxspeechtimeout properties in VoiceXML apply to both normal and hotword recognition, which is part of the rational for keeping the high-level meaning the same for both modes in MRCP. At the end of the day, the developer has to answer two questions regardless of what mode they are using: * How long after the end of a prompt do I want to want to wait for input? (no-input timeout) * How much continuous noise am I willing to process before aborting a recognition? (recognition timeout) What makes things a little complicated is that in hotword recognition: 1) the detection of speech does not mean that "input" was detected -- we don't have "input" until we have a match; 2) we can go from a state of processing speech/sound back to a state where there is silence and we are waiting for speech. The behaviors that were specified in the original email was an attempt to keep the same high-level meanings for the timers while taking into account the two points above. These special behaviors for hotword mode were: a) the no-input timer is not cancelled until there is a recognition result. b) the recognition timer is reset and turned off when an utterance that doesn't match anything "ends" as determined by the incomplete timeout firing. The recognition timer is re-enabled when subsequent speech is detected. Another behavior that the VoiceXML Forum MRCP Liaison Committee has discussed recently is as follows: c) if the no-input timer fires while speech is being processed, then the recognition will not be aborted until the recognizer makes a decision on that segment of speech (eg. complete timeout, incomplete timeout, recognition timeout, or early no-match). A no-match on the utterance at this point would cause "no-input-timeout" to be returned for the recognition. This last behavior would prevent the no-input timeout from cutting off recognition in the middle of an utterance, which might happen if we followed (a) above. To address your use cases below: 1. If you say nothing, the no-input timer will eventually fire (at the specified number of milliseconds after the prompt is completed) and end the recognition. 2. If you say something unintelligible, the no-input timer is not stopped as that does not correspond to a recognition result in hotword. Note that the no-input timer may not even be enabled if the prompt is still playing. At the end of the unintelligible speech, the recognition timer is stopped and turned off. When you later say something intelligible, the recognition timer is turned back on while you are speaking. Assuming your speech was short, the recognition timer is turned back off when you are done speaking. Since you now generated a match, the no-input timer is also cancelled (if the prompt had finished) and the result is returned. Thanks, Andrew Wahbe -----Original Message----- From: Saravanan Shanmugham (sarvi) [mailto:sarvi@cisco.com] Sent: June 16, 2006 2:46 PM To: Dan Burnett; IETF SPEECHSC (E-mail) Subject: RE: [speechsc] Hotword Recognition and Timers I can see that both No-Input-Timout and Recognition-Tiemout values will be usefull for Hotword recognition. But saying that Recognition-Timer is started after speech is detected bothers me. Also what do you expect typical values for these timers based on your proposed definitions. Hotword recognition is very often used to issue commands. So lets take the following scenario and look at possible cases. When the system reading out a long email, you should be able to issue command like "speedup" or "slow down" or "repeat" etc. 1. But then I might never say any command at all. So defining Recognition-Timer as starting after speech is detected makes no sense in this case. No-Input-Timer, if defined to be applicable to Hotword recognition might make sense in this case. 2. Then I might say something unintelligible in the middle. Which should be technically ignored. And then a little later I might actually speak a command, "speed up". Here when I said something unintelligible, the No-Input-Timer would be stopped. If we went with the definition proposed, the Recognition-Timer would be started here. If you assume No-Input-Timer would be sufficiently large and Recognition-Timer will be relatively small. This means that once we say something not matching a hotword(which should technically expected to be ignored), the RECOGNIZE would complete due to Recogition-Timeout. If we assume No-Input-Timer to be short and Recognition-Timer to be long, then we are requiring that the user MUST say something intelligible or unintelligible reasobaly quickly. Or the Recognize would terminate due to No-Input-timeout. If we assume No-Input-Timer to be large and Recognition-timer to be large as well. The depending on whether I say something unintelligible or not, the over all timeout could be pretty large upto max of No-Tinput-timer + Recognition-Timer. The way I would expect this to work is, that No-Input-Timer and Recognition-Timers are started at beginning of a hotword RECOGNIZE and both are reasonably large values. The No-Input-Timer being most likely possible equal to or smaller than Recognition-Timer. Now, if I said nothing at all an the No-Input-Timer expired, the RECOGNIZE commplete with no-input-timeout. The moment I say something, unintelligible or intelligible, the No-Input-timer is stopped. Recognition-Timer continues on. If the current speech or a future command matches a hotword grammar, the RECOGNIZE command, it completes with success. If nothing matches and the Recognition-Timer expires, the RECOGNIZE completes with recognition-timeout. This way for hotword, Recognition-Timer is the max recognition time for the RECOGNIZE. While No-Input-Timer would only be equal or smaller. Thx, Sarvi -----Original Message----- From: Dan Burnett [mailto:dan_burnett2000@yahoo.com] Sent: Thursday, June 08, 2006 5:06 AM To: IETF SPEECHSC (E-mail) Subject: Re: [speechsc] Hotword Recognition and Timers This email is a result of discussions by the MRCP subgroup of the VoiceXML Forum, in which I participated, so I already agree with the proposals given here. However, I would like to hear comments from others before applying these changes to the spec draft, preferably from those who did not participate in the VoiceXML Forum discussions. This has been added to the issue tracker (http://www.softarmor.com/roundup/speechsc) as issue 88. -- dan --- Andrew Wahbe <awahbe@voicegenie.com> wrote: > The description of how timers (no-input and > recognition) are used during > hotword recognition is inconsistent. In sections 9.4.7, it is stated > that "For a hotword recognition mode, this timer is started when the > user begins speaking. Note that for Hotword mode recognition the > START-OF-INPUT event is not generated." However, section 9.9 states > that for the hotword case: "The Recognition-Timer gets started at the > beginning of RECOGNIZE." > > It seems that section 9.9 is incorrect (or at least is inconsistent > with VoiceXML). > > Section 9.9 omits any mention of the no-input timer for the hotword > mode recognition case; however, none of the sections that deal with > the no-input timer make a distinction between the hotword and > non-hotword cases. VoiceXML also does not make this distinction. > It would seem that > section 9.9 should be changed to indicate that no-input timers are > started in the hotword case and that no-input-timeout is a valid > completion cause for a hotword recognition. > > A related question worth considering is if the recognition timer is > reset at any point, for example, on the detection of silence. Consider > the case when maxspeech has a value of say 20 seconds (a > typical/reasonable value) and hotword barge-in is being used on a > prompt that is 30 seconds long. This would mean that a user that spoke > briefly > 2 seconds into the prompt (and was silent for the remainder of the > prompt) would experience a maxspeech timeout at about 22 seconds into > the prompt. They would not hear the whole prompt which seems > inappropriate. The reason for maxspeech timeout is to catch continuous > noise and keep it from occupying a recognizer; but what should happen > in periods of silence in the hotword case? > > Similarly, when is the no-input timer canceled in the hotword case? Is > it when speech (not necessarily matching) is detected? Or is it only > upon a match? > > The correct behavior in my opinion is that the no-input timer is > canceled only on a match, and that the recognition timer should be > reset if silence (determined by complete timeout and incomplete > timeout) is detected. If we are just processing intermittent noise, > the no-input timer will eventually expire. Continuous noise is handled > by the recognition timer. Of course other there are other > possibilities as well, this is just one option that I think fits with > VoiceXML. > > begin:vcard > fn:Andrew Wahbe > n:Wahbe;Andrew > org:VoiceGenie Technologies INC. > adr:8th Floor;;1120 Finch Avenue W.;Toronto;ON;M3J 3H7;Canada > email;internet:awahbe@voicegenie.com > title:Senior Architect > tel;work:(416) 736-0905 ext. 258 > tel;fax:(416) 736-1551 > x-mozilla-html:TRUE > url:http://www.voicegenie.com > version:2.1 > end:vcard > > > _______________________________________________ > Speechsc mailing list > Speechsc@ietf.org > https://www1.ietf.org/mailman/listinfo/speechsc > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com _______________________________________________ Speechsc mailing list Speechsc@ietf.org https://www1.ietf.org/mailman/listinfo/speechsc _______________________________________________ Speechsc mailing list Speechsc@ietf.org https://www1.ietf.org/mailman/listinfo/speechsc _______________________________________________ Speechsc mailing list Speechsc@ietf.org https://www1.ietf.org/mailman/listinfo/speechsc _______________________________________________ Speechsc mailing list Speechsc@ietf.org https://www1.ietf.org/mailman/listinfo/speechsc
- [speechsc] Hotword Recognition and Timers Andrew Wahbe
- Re: [speechsc] Hotword Recognition and Timers Dan Burnett
- RE: [speechsc] Hotword Recognition and Timers Saravanan Shanmugham (sarvi)
- RE: [speechsc] Hotword Recognition and Timers Andrew Wahbe
- Re: [speechsc] Hotword Recognition and Timers Dave Burke
- RE: [speechsc] Hotword Recognition and Timers Andrew Wahbe
- Re: [speechsc] Hotword Recognition and Timers Dave Burke
- RE: [speechsc] Hotword Recognition and Timers Andrew Wahbe
- Re: [Speechsc] [speechsc] Hotword Recognition and… Joe Wong
- Re: [Speechsc] [speechsc] Hotword Recognition and… Dan Burnett