Re: [speechsc] Hotword Recognition and Timers

"Dave Burke" <david.burke@voxpilot.com> Thu, 13 July 2006 15:56 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1G13Y8-0005bQ-GK; Thu, 13 Jul 2006 11:56:24 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1G13Y7-0005bL-0S for speechsc@ietf.org; Thu, 13 Jul 2006 11:56:23 -0400
Received: from fw01.db01.voxpilot.com ([212.17.54.82] helo=mail.voxpilot.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1G13Y3-0003tZ-NU for speechsc@ietf.org; Thu, 13 Jul 2006 11:56:22 -0400
Received: by mail.voxpilot.com (Postfix, from userid 552) id B662F214114; Thu, 13 Jul 2006 15:56:18 +0000 (GMT)
X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on db01ms01
X-Spam-Status: No, score=-4.3 required=5.5 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.0
X-Spam-Level:
Received: from daburkewxp (unknown [10.0.0.102]) by mail.voxpilot.com (Postfix) with ESMTP id 579AA2140F6; Thu, 13 Jul 2006 15:56:10 +0000 (GMT)
Message-ID: <01ce01c6a694$d55a3fb0$6700000a@db01.voxpilot.com>
From: Dave Burke <david.burke@voxpilot.com>
To: Andrew Wahbe <Andrew.Wahbe@genesyslab.com>, "IETF SPEECHSC (E-mail)" <speechsc@ietf.org>
References: <911B89A9FD71E649AA624FF24790D76F51AA40@GIMLI.us.int.genesyslab.com>
Subject: Re: [speechsc] Hotword Recognition and Timers
Date: Thu, 13 Jul 2006 16:55:58 +0100
MIME-Version: 1.0
Content-Type: text/plain; format="flowed"; charset="iso-8859-1"; reply-type="original"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.2869
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2869
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 4d9ae72af46718088458d214998cc683
Cc:
X-BeenThere: speechsc@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Speech Services Control Working Group <speechsc.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:speechsc@ietf.org>
List-Help: <mailto:speechsc-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=subscribe>
Errors-To: speechsc-bounces@ietf.org

That works for me. And just to clarify: this means that if the 
Recognition-Timer fired in hotword and there was a match, then 008 
success-maxtime would be returned - right?

Dave

----- Original Message ----- 
From: "Andrew Wahbe" <Andrew.Wahbe@genesyslab.com>
To: "Dave Burke" <david.burke@voxpilot.com>; "IETF SPEECHSC (E-mail)" 
<speechsc@ietf.org>
Sent: Thursday, July 13, 2006 3:13 PM
Subject: RE: [speechsc] Hotword Recognition and Timers


One thing: if everyone is comfortable with these changes, then I wonder
what the purpose of the 003 hotword-maxtime completion cause code is. It
seems that throwing a 015 "no-match-maxtime" would not only work but
also make the most sense as the rest of the hotword behavior (from the
client's perspective) is more or less identical to the normal case.

What is the rationale for having a 003 hotword-maxtime completion cause
code at this point? If there is none, I would like to suggest that it be
removed. We can "reserve" the numeric code for future use to avoid
renumbering everything else if that is a concern.

Andrew

-----Original Message-----
From: Dave Burke [mailto:david.burke@voxpilot.com]
Sent: July 2, 2006 5:41 PM
To: Andrew Wahbe; IETF SPEECHSC (E-mail)
Subject: Re: [speechsc] Hotword Recognition and Timers

Andrew's proposals/clarifications make sense to me.

One interesting result, however, is that Andrew's definition for
Recognition-Timeout coincides with Hotword-Max-Duration except that the
former terminates the recognition when it fires. I don't think this is
necessarily a problem.

It seems (if I understand this thread properly) that the VoiceXML world
needs a maxspeechtimeout to terminate hotword but the MRCP protocol also
might need a safety net to prevent a RECOGNIZE going IN-PROGRESS
forever.
For normal recognition, the Recognition-Timeout gets you both the safety
net and the maxspeechtimeout. Since the MRCP client can STOP a
recognition at any point this safety net is not crucial. In short - I'm
fine with Andrew's suggested changes.

Dave

----- Original Message ----- 
From: "Andrew Wahbe" <Andrew.Wahbe@genesyslab.com>
To: "IETF SPEECHSC (E-mail)" <speechsc@ietf.org>
Sent: Monday, June 19, 2006 8:30 PM
Subject: RE: [speechsc] Hotword Recognition and Timers


The thing is that nowhere in your explanation are you mentioning the
prompt and it's completion (ie. the START-INPUT-TIMERS message). The
main use case and reason for hotword recognition/recognition-based
barge-in is to prevent accidental barge-in on audio content such as a
voicemail, tts email, etc. The scenario you describe below requires that
the client knows how long the content is when the RECOGNIZE is started;
this is definitely not an assumption you can make. The client won't know
how long it will take to TTS a chunk of text or how long the set of
audio files (prompts) are or even if they end at all (it could be a
continuous stream).

My proposal is that hotword recognition should "work" in a similar
manner to normal recognition from the client's perspective:

* RECOGNIZE is sent with the start-input-timers header set to "false".
The recognition-mode is set to "hotword". Prompt playback starts at this
point as well.
* START-INPUT-TIMERS is sent when the prompt completes. The
no-input-timer starts at this point.

The above two points are identical to the normal case except that the
recognition-mode is "hotword". My proposal is that the general meaning
of the recognition and no-input timers are also the same as the normal
case. Namely:

* The no-input timer is the max amount of time after the prompt
completes that we are willing to wait for input. This is equivalent to
the "timeout" property in VoiceXML. It is usually on the order of a few
seconds.
* The recognition timer is the max amount of time that we will run
recognition on a single "utterance". This is basically a safety net
protecting against noise (say the user left the phone off the hook next
to the radio) keeping the recognizer occupied for an unreasonable amount
of time. This only applies when speech is detected since the
no-input-timer will take effect (once the prompt is done) to terminate
the recognition. This is equivalent to the "maxspeechtimeout" property
in VoiceXML. This is usually quite a bit longer than the no-input
timeout, say 10 to 30 seconds.

Note that the definitions of timeout and maxspeechtimeout properties in
VoiceXML apply to both normal and hotword recognition, which is part of
the rational for keeping the high-level meaning the same for both modes
in MRCP. At the end of the day, the developer has to answer two
questions regardless of what mode they are using:
* How long after the end of a prompt do I want to want to wait for
input? (no-input timeout)
* How much continuous noise am I willing to process before aborting a
recognition? (recognition timeout)

What makes things a little complicated is that in hotword recognition:
1) the detection of speech does not mean that "input" was detected -- we
don't have "input" until we have a match;
2) we can go from a state of processing speech/sound back to a state
where there is silence and we are waiting for speech.

The behaviors that were specified in the original email was an attempt
to keep the same high-level meanings for the timers while taking into
account the two points above. These special behaviors for hotword mode
were:
a) the no-input timer is not cancelled until there is a recognition
result.
b) the recognition timer is reset and turned off when an utterance that
doesn't match anything "ends" as determined by the incomplete timeout
firing. The recognition timer is re-enabled when subsequent speech is
detected.

Another behavior that the VoiceXML Forum MRCP Liaison Committee has
discussed recently is as follows:
c) if the no-input timer fires while speech is being processed, then the
recognition will not be aborted until the recognizer makes a decision on
that segment of speech (eg. complete timeout, incomplete timeout,
recognition timeout, or early no-match). A no-match on the utterance at
this point would cause "no-input-timeout" to be returned for the
recognition.

This last behavior would prevent the no-input timeout from cutting off
recognition in the middle of an utterance, which might happen if we
followed (a) above.

To address your use cases below:

1. If you say nothing, the no-input timer will eventually fire (at the
specified number of milliseconds after the prompt is completed) and end
the recognition.

2. If you say something unintelligible, the no-input timer is not
stopped as that does not correspond to a recognition result in hotword.
Note that the no-input timer may not even be enabled if the prompt is
still playing. At the end of the unintelligible speech, the recognition
timer is stopped and turned off. When you later say something
intelligible, the recognition timer is turned back on while you are
speaking. Assuming your speech was short, the recognition timer is
turned back off when you are done speaking.  Since you now generated a
match, the no-input timer is also cancelled (if the prompt had finished)
and the result is returned.

Thanks,

Andrew Wahbe

-----Original Message-----
From: Saravanan Shanmugham (sarvi) [mailto:sarvi@cisco.com]
Sent: June 16, 2006 2:46 PM
To: Dan Burnett; IETF SPEECHSC (E-mail)
Subject: RE: [speechsc] Hotword Recognition and Timers


I can see that both No-Input-Timout and Recognition-Tiemout values will
be usefull for Hotword recognition.
But saying that Recognition-Timer is started after speech is detected
bothers me.
Also what do you expect typical values for these timers based on your
proposed definitions.

Hotword recognition is very often used to issue commands.
So lets take the following scenario and look at possible cases.

When the system reading out a long email, you should be able to issue
command like "speedup" or "slow down" or "repeat" etc.

1. But then I might never say any command at all. So defining
Recognition-Timer as starting after speech is detected makes no sense in
this case. No-Input-Timer, if defined to be applicable to Hotword
recognition might make sense in this case.

2. Then I might say something unintelligible in the middle. Which should
be technically ignored. And then a little later I might actually speak a
command, "speed up". Here when I said something unintelligible, the
No-Input-Timer would be stopped. If we went with the definition
proposed, the Recognition-Timer would be started here.

If you assume No-Input-Timer would be sufficiently large and
Recognition-Timer will be relatively small. This means that once we say
something not matching a hotword(which should technically expected to be
ignored), the RECOGNIZE would complete due to Recogition-Timeout.

If we assume No-Input-Timer to be short and Recognition-Timer to be
long, then we are requiring that the user MUST say something
intelligible or unintelligible reasobaly quickly. Or the Recognize would
terminate due to No-Input-timeout.

If we assume No-Input-Timer to be large and Recognition-timer to be
large as well. The depending on whether I say something unintelligible
or not, the over all timeout could be  pretty large upto max of
No-Tinput-timer + Recognition-Timer.

The way I would expect this to work is, that No-Input-Timer and
Recognition-Timers are started at beginning of a hotword RECOGNIZE and
both are reasonably large values. The No-Input-Timer being most likely
possible equal to or smaller than Recognition-Timer.

Now, if I said nothing at all an the No-Input-Timer expired, the
RECOGNIZE commplete with no-input-timeout. The moment I say something,
unintelligible or intelligible, the No-Input-timer is stopped.
Recognition-Timer continues on.  If the current speech or a future
command matches a hotword grammar, the RECOGNIZE command, it completes
with success.
If nothing matches and the Recognition-Timer expires, the RECOGNIZE
completes with recognition-timeout.

This way for hotword, Recognition-Timer is the max recognition time for
the RECOGNIZE. While No-Input-Timer would only be equal or smaller.

Thx,
Sarvi

     -----Original Message-----
     From: Dan Burnett [mailto:dan_burnett2000@yahoo.com]
     Sent: Thursday, June 08, 2006 5:06 AM
     To: IETF SPEECHSC (E-mail)
     Subject: Re: [speechsc] Hotword Recognition and Timers

     This email is a result of discussions by the MRCP subgroup
     of the VoiceXML Forum, in which I participated, so I
     already agree with the proposals given here.

     However, I would like to hear comments from others before
     applying these changes to the spec draft, preferably from
     those who did not participate in the VoiceXML Forum discussions.

     This has been added to the issue tracker
     (http://www.softarmor.com/roundup/speechsc) as issue 88.

     -- dan



     --- Andrew Wahbe <awahbe@voicegenie.com> wrote:

     > The description of how timers (no-input and
     > recognition) are used during
     > hotword recognition is inconsistent. In sections 9.4.7,
     it is stated
     > that "For a hotword recognition mode, this timer is
     started when the
     > user begins speaking. Note that for Hotword mode recognition the
     > START-OF-INPUT event is not generated." However, section
     9.9 states
     > that for the hotword case: "The Recognition-Timer gets
     started at the
     > beginning of RECOGNIZE."
     >
     > It seems that section 9.9 is incorrect (or at least is
     inconsistent
     > with VoiceXML).
     >
     > Section 9.9 omits any mention of the no-input timer for
     the hotword
     > mode recognition case; however, none of the sections
     that deal with
     > the no-input timer make a distinction between the hotword and
     > non-hotword cases. VoiceXML also does not make this distinction.
     > It would seem that
     > section 9.9 should be changed to indicate that no-input
     timers are
     > started in the hotword case and that no-input-timeout is a valid
     > completion cause for a hotword recognition.
     >
     > A related question worth considering is if the
     recognition timer is
     > reset at any point, for example, on the detection of
     silence. Consider
     > the case when maxspeech has a value of say 20 seconds (a
     > typical/reasonable value) and hotword barge-in is being
     used on a
     > prompt that is 30 seconds long. This would mean that a
     user that spoke
     > briefly
     > 2 seconds into the prompt (and was silent for the
     remainder of the
     > prompt) would experience a maxspeech timeout at about 22
     seconds into
     > the prompt. They would not hear the whole prompt which seems
     > inappropriate. The reason for maxspeech timeout is to
     catch continuous
     > noise and keep it from occupying a recognizer; but what
     should happen
     > in periods of silence in the hotword case?
     >
     > Similarly, when is the no-input timer canceled in the
     hotword case? Is
     > it when speech (not necessarily matching) is detected?
     Or is it only
     > upon a match?
     >
     > The correct behavior in my opinion is that the no-input timer is
     > canceled only on a match, and that the recognition timer
     should be
     > reset if silence (determined by complete timeout and incomplete
     > timeout) is detected. If we are just processing
     intermittent noise,
     > the no-input timer will eventually expire. Continuous
     noise is handled
     > by the recognition timer. Of course other there are other
     > possibilities as well, this is just one option that I
     think fits with
     > VoiceXML.
     > > begin:vcard
     > fn:Andrew Wahbe
     > n:Wahbe;Andrew
     > org:VoiceGenie Technologies INC.
     > adr:8th Floor;;1120 Finch Avenue W.;Toronto;ON;M3J 3H7;Canada
     > email;internet:awahbe@voicegenie.com
     > title:Senior Architect
     > tel;work:(416) 736-0905 ext. 258
     > tel;fax:(416) 736-1551
     > x-mozilla-html:TRUE
     > url:http://www.voicegenie.com
     > version:2.1
     > end:vcard
     >
     > > _______________________________________________
     > Speechsc mailing list
     > Speechsc@ietf.org
     > https://www1.ietf.org/mailman/listinfo/speechsc
     >


     __________________________________________________
     Do You Yahoo!?
     Tired of spam?  Yahoo! Mail has the best spam protection
     around http://mail.yahoo.com

     _______________________________________________
     Speechsc mailing list
     Speechsc@ietf.org
     https://www1.ietf.org/mailman/listinfo/speechsc


_______________________________________________
Speechsc mailing list
Speechsc@ietf.org
https://www1.ietf.org/mailman/listinfo/speechsc

_______________________________________________
Speechsc mailing list
Speechsc@ietf.org
https://www1.ietf.org/mailman/listinfo/speechsc


_______________________________________________
Speechsc mailing list
Speechsc@ietf.org
https://www1.ietf.org/mailman/listinfo/speechsc