RE: [speechsc] Hotword Recognition and Timers

"Andrew Wahbe" <Andrew.Wahbe@genesyslab.com> Mon, 19 June 2006 19:30 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1FsPS1-00067A-BD; Mon, 19 Jun 2006 15:30:21 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1FsPS0-00063j-31 for speechsc@ietf.org; Mon, 19 Jun 2006 15:30:20 -0400
Received: from g2.genesyslab.com ([198.49.180.210]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1FsPRy-0005o6-BX for speechsc@ietf.org; Mon, 19 Jun 2006 15:30:20 -0400
Received: from GIMLI.us.int.genesyslab.com ([192.168.20.224]) by g2.genesyslab.com with Microsoft SMTPSVC(6.0.3790.1830); Mon, 19 Jun 2006 12:30:17 -0700
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [speechsc] Hotword Recognition and Timers
Date: Mon, 19 Jun 2006 12:30:16 -0700
Message-ID: <911B89A9FD71E649AA624FF24790D76F2E96F7@GIMLI.us.int.genesyslab.com>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic: [speechsc] Hotword Recognition and Timers
Thread-Index: AcaK8/fdm8c0SZYVTLC8FadIOqZtaAGfC2CwAJmbM5A=
From: Andrew Wahbe <Andrew.Wahbe@genesyslab.com>
To: "IETF SPEECHSC (E-mail)" <speechsc@ietf.org>
X-OriginalArrivalTime: 19 Jun 2006 19:30:17.0262 (UTC) FILETIME=[CB78E4E0:01C693D6]
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 96d3a783a4707f1ab458eb15058bb2d7
X-BeenThere: speechsc@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Speech Services Control Working Group <speechsc.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:speechsc@ietf.org>
List-Help: <mailto:speechsc-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=subscribe>
Errors-To: speechsc-bounces@ietf.org

The thing is that nowhere in your explanation are you mentioning the
prompt and it's completion (ie. the START-INPUT-TIMERS message). The
main use case and reason for hotword recognition/recognition-based
barge-in is to prevent accidental barge-in on audio content such as a
voicemail, tts email, etc. The scenario you describe below requires that
the client knows how long the content is when the RECOGNIZE is started;
this is definitely not an assumption you can make. The client won't know
how long it will take to TTS a chunk of text or how long the set of
audio files (prompts) are or even if they end at all (it could be a
continuous stream). 

My proposal is that hotword recognition should "work" in a similar
manner to normal recognition from the client's perspective:

* RECOGNIZE is sent with the start-input-timers header set to "false".
The recognition-mode is set to "hotword". Prompt playback starts at this
point as well.
* START-INPUT-TIMERS is sent when the prompt completes. The
no-input-timer starts at this point.

The above two points are identical to the normal case except that the
recognition-mode is "hotword". My proposal is that the general meaning
of the recognition and no-input timers are also the same as the normal
case. Namely:

* The no-input timer is the max amount of time after the prompt
completes that we are willing to wait for input. This is equivalent to
the "timeout" property in VoiceXML. It is usually on the order of a few
seconds.
* The recognition timer is the max amount of time that we will run
recognition on a single "utterance". This is basically a safety net
protecting against noise (say the user left the phone off the hook next
to the radio) keeping the recognizer occupied for an unreasonable amount
of time. This only applies when speech is detected since the
no-input-timer will take effect (once the prompt is done) to terminate
the recognition. This is equivalent to the "maxspeechtimeout" property
in VoiceXML. This is usually quite a bit longer than the no-input
timeout, say 10 to 30 seconds.

Note that the definitions of timeout and maxspeechtimeout properties in
VoiceXML apply to both normal and hotword recognition, which is part of
the rational for keeping the high-level meaning the same for both modes
in MRCP. At the end of the day, the developer has to answer two
questions regardless of what mode they are using:
* How long after the end of a prompt do I want to want to wait for
input? (no-input timeout)
* How much continuous noise am I willing to process before aborting a
recognition? (recognition timeout)

What makes things a little complicated is that in hotword recognition:
1) the detection of speech does not mean that "input" was detected -- we
don't have "input" until we have a match;
2) we can go from a state of processing speech/sound back to a state
where there is silence and we are waiting for speech.

The behaviors that were specified in the original email was an attempt
to keep the same high-level meanings for the timers while taking into
account the two points above. These special behaviors for hotword mode
were:
a) the no-input timer is not cancelled until there is a recognition
result.
b) the recognition timer is reset and turned off when an utterance that
doesn't match anything "ends" as determined by the incomplete timeout
firing. The recognition timer is re-enabled when subsequent speech is
detected.

Another behavior that the VoiceXML Forum MRCP Liaison Committee has
discussed recently is as follows:
c) if the no-input timer fires while speech is being processed, then the
recognition will not be aborted until the recognizer makes a decision on
that segment of speech (eg. complete timeout, incomplete timeout,
recognition timeout, or early no-match). A no-match on the utterance at
this point would cause "no-input-timeout" to be returned for the
recognition.

This last behavior would prevent the no-input timeout from cutting off
recognition in the middle of an utterance, which might happen if we
followed (a) above.

To address your use cases below:

1. If you say nothing, the no-input timer will eventually fire (at the
specified number of milliseconds after the prompt is completed) and end
the recognition.

2. If you say something unintelligible, the no-input timer is not
stopped as that does not correspond to a recognition result in hotword.
Note that the no-input timer may not even be enabled if the prompt is
still playing. At the end of the unintelligible speech, the recognition
timer is stopped and turned off. When you later say something
intelligible, the recognition timer is turned back on while you are
speaking. Assuming your speech was short, the recognition timer is
turned back off when you are done speaking.  Since you now generated a
match, the no-input timer is also cancelled (if the prompt had finished)
and the result is returned.

Thanks,

Andrew Wahbe 

-----Original Message-----
From: Saravanan Shanmugham (sarvi) [mailto:sarvi@cisco.com] 
Sent: June 16, 2006 2:46 PM
To: Dan Burnett; IETF SPEECHSC (E-mail)
Subject: RE: [speechsc] Hotword Recognition and Timers 

 
I can see that both No-Input-Timout and Recognition-Tiemout values will
be usefull for Hotword recognition.
But saying that Recognition-Timer is started after speech is detected
bothers me.
Also what do you expect typical values for these timers based on your
proposed definitions. 

Hotword recognition is very often used to issue commands. 
So lets take the following scenario and look at possible cases. 

When the system reading out a long email, you should be able to issue
command like "speedup" or "slow down" or "repeat" etc.

1. But then I might never say any command at all. So defining
Recognition-Timer as starting after speech is detected makes no sense in
this case. No-Input-Timer, if defined to be applicable to Hotword
recognition might make sense in this case.

2. Then I might say something unintelligible in the middle. Which should
be technically ignored. And then a little later I might actually speak a
command, "speed up". Here when I said something unintelligible, the
No-Input-Timer would be stopped. If we went with the definition
proposed, the Recognition-Timer would be started here.   

If you assume No-Input-Timer would be sufficiently large and
Recognition-Timer will be relatively small. This means that once we say
something not matching a hotword(which should technically expected to be
ignored), the RECOGNIZE would complete due to Recogition-Timeout.

If we assume No-Input-Timer to be short and Recognition-Timer to be
long, then we are requiring that the user MUST say something
intelligible or unintelligible reasobaly quickly. Or the Recognize would
terminate due to No-Input-timeout.

If we assume No-Input-Timer to be large and Recognition-timer to be
large as well. The depending on whether I say something unintelligible
or not, the over all timeout could be  pretty large upto max of
No-Tinput-timer + Recognition-Timer.

The way I would expect this to work is, that No-Input-Timer and
Recognition-Timers are started at beginning of a hotword RECOGNIZE and
both are reasonably large values. The No-Input-Timer being most likely
possible equal to or smaller than Recognition-Timer.

Now, if I said nothing at all an the No-Input-Timer expired, the
RECOGNIZE commplete with no-input-timeout. The moment I say something,
unintelligible or intelligible, the No-Input-timer is stopped.
Recognition-Timer continues on.  If the current speech or a future
command matches a hotword grammar, the RECOGNIZE command, it completes
with success. 
If nothing matches and the Recognition-Timer expires, the RECOGNIZE
completes with recognition-timeout.

This way for hotword, Recognition-Timer is the max recognition time for
the RECOGNIZE. While No-Input-Timer would only be equal or smaller. 

Thx,
Sarvi

     -----Original Message-----
     From: Dan Burnett [mailto:dan_burnett2000@yahoo.com] 
     Sent: Thursday, June 08, 2006 5:06 AM
     To: IETF SPEECHSC (E-mail)
     Subject: Re: [speechsc] Hotword Recognition and Timers 
     
     This email is a result of discussions by the MRCP subgroup 
     of the VoiceXML Forum, in which I participated, so I 
     already agree with the proposals given here.
     
     However, I would like to hear comments from others before 
     applying these changes to the spec draft, preferably from 
     those who did not participate in the VoiceXML Forum discussions.
     
     This has been added to the issue tracker
     (http://www.softarmor.com/roundup/speechsc) as issue 88.
     
     -- dan
     
     
     
     --- Andrew Wahbe <awahbe@voicegenie.com> wrote:
     
     > The description of how timers (no-input and
     > recognition) are used during
     > hotword recognition is inconsistent. In sections 9.4.7, 
     it is stated 
     > that "For a hotword recognition mode, this timer is 
     started when the 
     > user begins speaking. Note that for Hotword mode recognition the 
     > START-OF-INPUT event is not generated." However, section 
     9.9 states 
     > that for the hotword case: "The Recognition-Timer gets 
     started at the 
     > beginning of RECOGNIZE."
     > 
     > It seems that section 9.9 is incorrect (or at least is 
     inconsistent 
     > with VoiceXML).
     > 
     > Section 9.9 omits any mention of the no-input timer for 
     the hotword 
     > mode recognition case; however, none of the sections 
     that deal with 
     > the no-input timer make a distinction between the hotword and 
     > non-hotword cases. VoiceXML also does not make this distinction.
     > It would seem that
     > section 9.9 should be changed to indicate that no-input 
     timers are 
     > started in the hotword case and that no-input-timeout is a valid 
     > completion cause for a hotword recognition.
     > 
     > A related question worth considering is if the 
     recognition timer is 
     > reset at any point, for example, on the detection of 
     silence. Consider 
     > the case when maxspeech has a value of say 20 seconds (a 
     > typical/reasonable value) and hotword barge-in is being 
     used on a 
     > prompt that is 30 seconds long. This would mean that a 
     user that spoke 
     > briefly
     > 2 seconds into the prompt (and was silent for the 
     remainder of the
     > prompt) would experience a maxspeech timeout at about 22 
     seconds into 
     > the prompt. They would not hear the whole prompt which seems 
     > inappropriate. The reason for maxspeech timeout is to 
     catch continuous 
     > noise and keep it from occupying a recognizer; but what 
     should happen 
     > in periods of silence in the hotword case?
     > 
     > Similarly, when is the no-input timer canceled in the 
     hotword case? Is 
     > it when speech (not necessarily matching) is detected? 
     Or is it only 
     > upon a match?
     > 
     > The correct behavior in my opinion is that the no-input timer is 
     > canceled only on a match, and that the recognition timer 
     should be 
     > reset if silence (determined by complete timeout and incomplete 
     > timeout) is detected. If we are just processing 
     intermittent noise, 
     > the no-input timer will eventually expire. Continuous 
     noise is handled 
     > by the recognition timer. Of course other there are other 
     > possibilities as well, this is just one option that I 
     think fits with 
     > VoiceXML.
     > > begin:vcard
     > fn:Andrew Wahbe
     > n:Wahbe;Andrew
     > org:VoiceGenie Technologies INC.
     > adr:8th Floor;;1120 Finch Avenue W.;Toronto;ON;M3J 3H7;Canada 
     > email;internet:awahbe@voicegenie.com
     > title:Senior Architect
     > tel;work:(416) 736-0905 ext. 258
     > tel;fax:(416) 736-1551
     > x-mozilla-html:TRUE
     > url:http://www.voicegenie.com
     > version:2.1
     > end:vcard
     > 
     > > _______________________________________________
     > Speechsc mailing list
     > Speechsc@ietf.org
     > https://www1.ietf.org/mailman/listinfo/speechsc
     > 
     
     
     __________________________________________________
     Do You Yahoo!?
     Tired of spam?  Yahoo! Mail has the best spam protection 
     around http://mail.yahoo.com 
     
     _______________________________________________
     Speechsc mailing list
     Speechsc@ietf.org
     https://www1.ietf.org/mailman/listinfo/speechsc
     

_______________________________________________
Speechsc mailing list
Speechsc@ietf.org
https://www1.ietf.org/mailman/listinfo/speechsc

_______________________________________________
Speechsc mailing list
Speechsc@ietf.org
https://www1.ietf.org/mailman/listinfo/speechsc