Re: [Speechsc] [speechsc] Hotword Recognition and Timers

Dan Burnett <dburnett@voxeo.com> Tue, 05 May 2009 20:27 UTC

Return-Path: <dburnett@voxeo.com>
X-Original-To: speechsc@core3.amsl.com
Delivered-To: speechsc@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id E3D143A6A47 for <speechsc@core3.amsl.com>; Tue, 5 May 2009 13:27:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.09
X-Spam-Level:
X-Spam-Status: No, score=0.09 tagged_above=-999 required=5 tests=[BAYES_05=-1.11, J_CHICKENPOX_56=0.6, J_CHICKENPOX_75=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hFFtW5k2xjti for <speechsc@core3.amsl.com>; Tue, 5 May 2009 13:27:15 -0700 (PDT)
Received: from voxeo.com (mmail.voxeo.com [66.193.54.208]) by core3.amsl.com (Postfix) with SMTP id 739933A69BB for <speechsc@ietf.org>; Tue, 5 May 2009 13:27:14 -0700 (PDT)
Received: from 182.sub-70-214-140.myvzw.com (account dburnett [70.214.140.182] verified) by voxeo.com (CommuniGate Pro SMTP 5.2.3) with ESMTPSA id 41553298; Tue, 05 May 2009 20:28:33 +0000
Message-Id: <BC66047A-FA19-43FD-B093-D12FF7FE1F7B@voxeo.com>
From: Dan Burnett <dburnett@voxeo.com>
To: Joe Wong <jwong@genesyslab.com>
In-Reply-To: <3AE796A2A14AE9458DF60ACC3AE477BA020EB020@NARSIL.us.int.genesyslab.com>
Content-Type: text/plain; charset="US-ASCII"; format="flowed"; delsp="yes"
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Date: Tue, 05 May 2009 16:28:26 -0400
References: <911B89A9FD71E649AA624FF24790D76F51AB05@GIMLI.us.int.genesyslab.com> <3AE796A2A14AE9458DF60ACC3AE477BA020EB020@NARSIL.us.int.genesyslab.com>
X-Mailer: Apple Mail (2.930.3)
Cc: "IETF SPEECHSC (E-mail)" <speechsc@ietf.org>
Subject: Re: [Speechsc] [speechsc] Hotword Recognition and Timers
X-BeenThere: speechsc@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Speech Services Control Working Group <speechsc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/speechsc>
List-Post: <mailto:speechsc@ietf.org>
List-Help: <mailto:speechsc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/speechsc>, <mailto:speechsc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 May 2009 20:27:17 -0000

Joe,

Issue 88 was resolved in Draft 13.  The change notes say

======
- Issue 88: Moved text about the No-Input and Recognition timers in  
9.9 from the beginning of the Non-Hotword mode section before the two  
modes to be clear that it applies to both modes.  Added bullet in 9.9  
hotword mode section about what to do when the No-Input timer  
expires.  Added bullet in 9.9 hotword mode section about cancelling  
and restarting the Recognition timer.

- Issue 88: Aligned text in 9.4.7 and 9.9 by doing the following:
-- removed text in 9.4.7 describing how the timer behaves, since that  
is already described described in section 9.9 where it belongs.  This  
section should only describe how to set the header value.
-- removed first bullet point in hotword description in 9.9.  Added  
new paragraph there noting that the START-OF-INPUT event is not  
generated when speech/DTMF is detected.
======

These specific changes were made after much discussion in order to  
better align hotword mode in MRCP with hotword mode in VoiceXML.  The  
key in both cases is that hotword mode doesn't allow a no-match/reject  
to be returned, because any failure to match is treated as if the  
person has not spoken.

-- dan

On Jan 9, 2009, at 1:43 AM, Joe Wong wrote:

> What was the resolution of Andrew's proposal (July 13) to remove 003
> hotword-maxtime completion cause code and use 015 no-match-maxtime
> instead?
>
> In http://www.ietf.org/internet-drafts/draft-ietf-speechsc-mrcpv2-17.txt
> section 9.4.11, I still see 003 hotword-maxtime.
>
> Thanks,
> Joe
>
> -----Original Message-----
> From: Andrew Wahbe [mailto:Andrew.Wahbe@genesyslab.com]
> Sent: Thursday, July 13, 2006 9:35 AM
> To: Dave Burke; IETF SPEECHSC (E-mail)
> Subject: RE: [speechsc] Hotword Recognition and Timers
>
> yup
>
> -----Original Message-----
> From: Dave Burke [mailto:david.burke@voxpilot.com]
> Sent: July 13, 2006 11:56 AM
> To: Andrew Wahbe; IETF SPEECHSC (E-mail)
> Subject: Re: [speechsc] Hotword Recognition and Timers
>
> That works for me. And just to clarify: this means that if the
> Recognition-Timer fired in hotword and there was a match, then 008
> success-maxtime would be returned - right?
>
> Dave
>
> ----- Original Message -----
> From: "Andrew Wahbe" <Andrew.Wahbe@genesyslab.com>
> To: "Dave Burke" <david.burke@voxpilot.com>; "IETF SPEECHSC (E-mail)"
> <speechsc@ietf.org>
> Sent: Thursday, July 13, 2006 3:13 PM
> Subject: RE: [speechsc] Hotword Recognition and Timers
>
>
> One thing: if everyone is comfortable with these changes, then I  
> wonder
> what the purpose of the 003 hotword-maxtime completion cause code  
> is. It
> seems that throwing a 015 "no-match-maxtime" would not only work but
> also make the most sense as the rest of the hotword behavior (from the
> client's perspective) is more or less identical to the normal case.
>
> What is the rationale for having a 003 hotword-maxtime completion  
> cause
> code at this point? If there is none, I would like to suggest that  
> it be
> removed. We can "reserve" the numeric code for future use to avoid
> renumbering everything else if that is a concern.
>
> Andrew
>
> -----Original Message-----
> From: Dave Burke [mailto:david.burke@voxpilot.com]
> Sent: July 2, 2006 5:41 PM
> To: Andrew Wahbe; IETF SPEECHSC (E-mail)
> Subject: Re: [speechsc] Hotword Recognition and Timers
>
> Andrew's proposals/clarifications make sense to me.
>
> One interesting result, however, is that Andrew's definition for
> Recognition-Timeout coincides with Hotword-Max-Duration except that  
> the
> former terminates the recognition when it fires. I don't think this is
> necessarily a problem.
>
> It seems (if I understand this thread properly) that the VoiceXML  
> world
> needs a maxspeechtimeout to terminate hotword but the MRCP protocol  
> also
> might need a safety net to prevent a RECOGNIZE going IN-PROGRESS
> forever.
> For normal recognition, the Recognition-Timeout gets you both the  
> safety
> net and the maxspeechtimeout. Since the MRCP client can STOP a
> recognition at any point this safety net is not crucial. In short -  
> I'm
> fine with Andrew's suggested changes.
>
> Dave
>
> ----- Original Message -----
> From: "Andrew Wahbe" <Andrew.Wahbe@genesyslab.com>
> To: "IETF SPEECHSC (E-mail)" <speechsc@ietf.org>
> Sent: Monday, June 19, 2006 8:30 PM
> Subject: RE: [speechsc] Hotword Recognition and Timers
>
>
> The thing is that nowhere in your explanation are you mentioning the
> prompt and it's completion (ie. the START-INPUT-TIMERS message). The
> main use case and reason for hotword recognition/recognition-based
> barge-in is to prevent accidental barge-in on audio content such as a
> voicemail, tts email, etc. The scenario you describe below requires  
> that
> the client knows how long the content is when the RECOGNIZE is  
> started;
> this is definitely not an assumption you can make. The client won't  
> know
> how long it will take to TTS a chunk of text or how long the set of
> audio files (prompts) are or even if they end at all (it could be a
> continuous stream).
>
> My proposal is that hotword recognition should "work" in a similar
> manner to normal recognition from the client's perspective:
>
> * RECOGNIZE is sent with the start-input-timers header set to "false".
> The recognition-mode is set to "hotword". Prompt playback starts at  
> this
> point as well.
> * START-INPUT-TIMERS is sent when the prompt completes. The
> no-input-timer starts at this point.
>
> The above two points are identical to the normal case except that the
> recognition-mode is "hotword". My proposal is that the general meaning
> of the recognition and no-input timers are also the same as the normal
> case. Namely:
>
> * The no-input timer is the max amount of time after the prompt
> completes that we are willing to wait for input. This is equivalent to
> the "timeout" property in VoiceXML. It is usually on the order of a  
> few
> seconds.
> * The recognition timer is the max amount of time that we will run
> recognition on a single "utterance". This is basically a safety net
> protecting against noise (say the user left the phone off the hook  
> next
> to the radio) keeping the recognizer occupied for an unreasonable  
> amount
> of time. This only applies when speech is detected since the
> no-input-timer will take effect (once the prompt is done) to terminate
> the recognition. This is equivalent to the "maxspeechtimeout" property
> in VoiceXML. This is usually quite a bit longer than the no-input
> timeout, say 10 to 30 seconds.
>
> Note that the definitions of timeout and maxspeechtimeout properties  
> in
> VoiceXML apply to both normal and hotword recognition, which is part  
> of
> the rational for keeping the high-level meaning the same for both  
> modes
> in MRCP. At the end of the day, the developer has to answer two
> questions regardless of what mode they are using:
> * How long after the end of a prompt do I want to want to wait for
> input? (no-input timeout)
> * How much continuous noise am I willing to process before aborting a
> recognition? (recognition timeout)
>
> What makes things a little complicated is that in hotword recognition:
> 1) the detection of speech does not mean that "input" was detected  
> -- we
> don't have "input" until we have a match;
> 2) we can go from a state of processing speech/sound back to a state
> where there is silence and we are waiting for speech.
>
> The behaviors that were specified in the original email was an attempt
> to keep the same high-level meanings for the timers while taking into
> account the two points above. These special behaviors for hotword mode
> were:
> a) the no-input timer is not cancelled until there is a recognition
> result.
> b) the recognition timer is reset and turned off when an utterance  
> that
> doesn't match anything "ends" as determined by the incomplete timeout
> firing. The recognition timer is re-enabled when subsequent speech is
> detected.
>
> Another behavior that the VoiceXML Forum MRCP Liaison Committee has
> discussed recently is as follows:
> c) if the no-input timer fires while speech is being processed, then  
> the
> recognition will not be aborted until the recognizer makes a  
> decision on
> that segment of speech (eg. complete timeout, incomplete timeout,
> recognition timeout, or early no-match). A no-match on the utterance  
> at
> this point would cause "no-input-timeout" to be returned for the
> recognition.
>
> This last behavior would prevent the no-input timeout from cutting off
> recognition in the middle of an utterance, which might happen if we
> followed (a) above.
>
> To address your use cases below:
>
> 1. If you say nothing, the no-input timer will eventually fire (at the
> specified number of milliseconds after the prompt is completed) and  
> end
> the recognition.
>
> 2. If you say something unintelligible, the no-input timer is not
> stopped as that does not correspond to a recognition result in  
> hotword.
> Note that the no-input timer may not even be enabled if the prompt is
> still playing. At the end of the unintelligible speech, the  
> recognition
> timer is stopped and turned off. When you later say something
> intelligible, the recognition timer is turned back on while you are
> speaking. Assuming your speech was short, the recognition timer is
> turned back off when you are done speaking.  Since you now generated a
> match, the no-input timer is also cancelled (if the prompt had  
> finished)
> and the result is returned.
>
> Thanks,
>
> Andrew Wahbe
>
> -----Original Message-----
> From: Saravanan Shanmugham (sarvi) [mailto:sarvi@cisco.com]
> Sent: June 16, 2006 2:46 PM
> To: Dan Burnett; IETF SPEECHSC (E-mail)
> Subject: RE: [speechsc] Hotword Recognition and Timers
>
>
> I can see that both No-Input-Timout and Recognition-Tiemout values  
> will
> be usefull for Hotword recognition.
> But saying that Recognition-Timer is started after speech is detected
> bothers me.
> Also what do you expect typical values for these timers based on your
> proposed definitions.
>
> Hotword recognition is very often used to issue commands.
> So lets take the following scenario and look at possible cases.
>
> When the system reading out a long email, you should be able to issue
> command like "speedup" or "slow down" or "repeat" etc.
>
> 1. But then I might never say any command at all. So defining
> Recognition-Timer as starting after speech is detected makes no  
> sense in
> this case. No-Input-Timer, if defined to be applicable to Hotword
> recognition might make sense in this case.
>
> 2. Then I might say something unintelligible in the middle. Which  
> should
> be technically ignored. And then a little later I might actually  
> speak a
> command, "speed up". Here when I said something unintelligible, the
> No-Input-Timer would be stopped. If we went with the definition
> proposed, the Recognition-Timer would be started here.
>
> If you assume No-Input-Timer would be sufficiently large and
> Recognition-Timer will be relatively small. This means that once we  
> say
> something not matching a hotword(which should technically expected  
> to be
> ignored), the RECOGNIZE would complete due to Recogition-Timeout.
>
> If we assume No-Input-Timer to be short and Recognition-Timer to be
> long, then we are requiring that the user MUST say something
> intelligible or unintelligible reasobaly quickly. Or the Recognize  
> would
> terminate due to No-Input-timeout.
>
> If we assume No-Input-Timer to be large and Recognition-timer to be
> large as well. The depending on whether I say something unintelligible
> or not, the over all timeout could be  pretty large upto max of
> No-Tinput-timer + Recognition-Timer.
>
> The way I would expect this to work is, that No-Input-Timer and
> Recognition-Timers are started at beginning of a hotword RECOGNIZE and
> both are reasonably large values. The No-Input-Timer being most likely
> possible equal to or smaller than Recognition-Timer.
>
> Now, if I said nothing at all an the No-Input-Timer expired, the
> RECOGNIZE commplete with no-input-timeout. The moment I say something,
> unintelligible or intelligible, the No-Input-timer is stopped.
> Recognition-Timer continues on.  If the current speech or a future
> command matches a hotword grammar, the RECOGNIZE command, it completes
> with success.
> If nothing matches and the Recognition-Timer expires, the RECOGNIZE
> completes with recognition-timeout.
>
> This way for hotword, Recognition-Timer is the max recognition time  
> for
> the RECOGNIZE. While No-Input-Timer would only be equal or smaller.
>
> Thx,
> Sarvi
>
>     -----Original Message-----
>     From: Dan Burnett [mailto:dan_burnett2000@yahoo.com]
>     Sent: Thursday, June 08, 2006 5:06 AM
>     To: IETF SPEECHSC (E-mail)
>     Subject: Re: [speechsc] Hotword Recognition and Timers
>
>     This email is a result of discussions by the MRCP subgroup
>     of the VoiceXML Forum, in which I participated, so I
>     already agree with the proposals given here.
>
>     However, I would like to hear comments from others before
>     applying these changes to the spec draft, preferably from
>     those who did not participate in the VoiceXML Forum discussions.
>
>     This has been added to the issue tracker
>     (http://www.softarmor.com/roundup/speechsc) as issue 88.
>
>     -- dan
>
>
>
>     --- Andrew Wahbe <awahbe@voicegenie.com> wrote:
>
>> The description of how timers (no-input and
>> recognition) are used during
>> hotword recognition is inconsistent. In sections 9.4.7,
>     it is stated
>> that "For a hotword recognition mode, this timer is
>     started when the
>> user begins speaking. Note that for Hotword mode recognition the
>> START-OF-INPUT event is not generated." However, section
>     9.9 states
>> that for the hotword case: "The Recognition-Timer gets
>     started at the
>> beginning of RECOGNIZE."
>>
>> It seems that section 9.9 is incorrect (or at least is
>     inconsistent
>> with VoiceXML).
>>
>> Section 9.9 omits any mention of the no-input timer for
>     the hotword
>> mode recognition case; however, none of the sections
>     that deal with
>> the no-input timer make a distinction between the hotword and
>> non-hotword cases. VoiceXML also does not make this distinction.
>> It would seem that
>> section 9.9 should be changed to indicate that no-input
>     timers are
>> started in the hotword case and that no-input-timeout is a valid
>> completion cause for a hotword recognition.
>>
>> A related question worth considering is if the
>     recognition timer is
>> reset at any point, for example, on the detection of
>     silence. Consider
>> the case when maxspeech has a value of say 20 seconds (a
>> typical/reasonable value) and hotword barge-in is being
>     used on a
>> prompt that is 30 seconds long. This would mean that a
>     user that spoke
>> briefly
>> 2 seconds into the prompt (and was silent for the
>     remainder of the
>> prompt) would experience a maxspeech timeout at about 22
>     seconds into
>> the prompt. They would not hear the whole prompt which seems
>> inappropriate. The reason for maxspeech timeout is to
>     catch continuous
>> noise and keep it from occupying a recognizer; but what
>     should happen
>> in periods of silence in the hotword case?
>>
>> Similarly, when is the no-input timer canceled in the
>     hotword case? Is
>> it when speech (not necessarily matching) is detected?
>     Or is it only
>> upon a match?
>>
>> The correct behavior in my opinion is that the no-input timer is
>> canceled only on a match, and that the recognition timer
>     should be
>> reset if silence (determined by complete timeout and incomplete
>> timeout) is detected. If we are just processing
>     intermittent noise,
>> the no-input timer will eventually expire. Continuous
>     noise is handled
>> by the recognition timer. Of course other there are other
>> possibilities as well, this is just one option that I
>     think fits with
>> VoiceXML.
>>> begin:vcard
>> fn:Andrew Wahbe
>> n:Wahbe;Andrew
>> org:VoiceGenie Technologies INC.
>> adr:8th Floor;;1120 Finch Avenue W.;Toronto;ON;M3J 3H7;Canada
>> email;internet:awahbe@voicegenie.com
>> title:Senior Architect
>> tel;work:(416) 736-0905 ext. 258
>> tel;fax:(416) 736-1551
>> x-mozilla-html:TRUE
>> url:http://www.voicegenie.com
>> version:2.1
>> end:vcard
>>
>>> _______________________________________________
>> Speechsc mailing list
>> Speechsc@ietf.org
>> https://www1.ietf.org/mailman/listinfo/speechsc
>>
>
>
>     __________________________________________________
>     Do You Yahoo!?
>     Tired of spam?  Yahoo! Mail has the best spam protection
>     around http://mail.yahoo.com
>
>     _______________________________________________
>     Speechsc mailing list
>     Speechsc@ietf.org
>     https://www1.ietf.org/mailman/listinfo/speechsc
>
>
> _______________________________________________
> Speechsc mailing list
> Speechsc@ietf.org
> https://www1.ietf.org/mailman/listinfo/speechsc
>
> _______________________________________________
> Speechsc mailing list
> Speechsc@ietf.org
> https://www1.ietf.org/mailman/listinfo/speechsc
>
>
> _______________________________________________
> Speechsc mailing list
> Speechsc@ietf.org
> https://www1.ietf.org/mailman/listinfo/speechsc
>
> 					
> -------------------------------------------------------------------------------------------------------------------
> CONFIDENTIALITY NOTICE: This e-mail and any files attached may  
> contain confidential and proprietary information of Alcatel-Lucent  
> and/or its affiliated entities. Access by the intended recipient  
> only is authorized. Any liability arising from any party acting, or  
> refraining from acting, on any information contained in this e-mail  
> is hereby excluded. If you are not the intended recipient, please  
> notify the sender immediately, destroy the original transmission and  
> its attachments and do not disclose the contents to any other  
> person, use it for any purpose, or store or copy the information in  
> any medium. Copyright in this e-mail and any attachments belongs to  
> Alcatel-Lucent and/or its affiliated entities.
> 					
> _______________________________________________
> Speechsc mailing list
> Speechsc@ietf.org
> https://www.ietf.org/mailman/listinfo/speechsc
> Supplemental web site:
> &lt;http://www.standardstrack.com/ietf/speechsc&gt;