Well, one thing is clear, stopping the system manually causes all sorts of insanity in the callbacks. But let's take a closer look at the documentation for stopListening(), mentioned above, by focusing on:
Note that in the default case, this does not need to be called, as the speech endpointer will automatically stop the recognizer listening when it determines speech has completed. However, you can manipulate endpointer parameters directly using the intent extras defined in RecognizerIntent, in which case you may sometimes want to manually call this method to stop listening sooner.
The endpointer seems to be the part of the system that points to the end of the speech, and that when it determines the end has been reached, it stops the system (internally). stopListening() can end it prematurely, but that caused us some issues. Then it lists a third method for stopping, via an extra is the Intent that started the process. Let's take a look at that.
is a set of constants to be used in the Intent when SpeechRecognizer is told to start to listening. In the examples above, it is passed just ACTION_RECOGNIZE_SPEECH
, which tells the system to just recognize the speech, as opposed to ACTION_WEB_SEARCH
, for example.
The constant that are relevant our quest here are: EXTRA_SPEECH_INPUT_COMPLETE_SILENCE_LENGTH_MILLIS
The first one seems to be exactly what we want, but it comes with a warning:
The amount of time that it should take after we stop hearing speech to consider the input complete. Note that it is extremely rare you'd want to specify this value in an intent. If you don't have a very good reason to change these, you should leave them as they are. Note also that certain values may cause undesired or unexpected results - use judiciously!
We've seen undesired and unexpected results with stop. I wonder if this one is the same.
The second constant is about the minimum to listen, not exactly what we want, unless we need to shorten it to make things work. It comes with a similar warning though:
The minimum length of an utterance. We will not stop recording before this amount of time. Note that it is extremely rare you'd want to specify this value in an intent. If you don't have a very good reason to change these, you should leave them as they are. Note also that certain values may cause undesired or unexpected results - use judiciously!
The third constant wants to keep the system listening, by setting a minimum for pauses, and not mistakenly recognize them as the end of speech. It too has the same warning:
The amount of time that it should take after we stop hearing speech to consider the input possibly complete. This is used to prevent the endpointer cutting off during very short mid-speech pauses. Note that it is extremely rare you'd want to specify this value in an intent. If you don't have a very good reason to change these, you should leave them as they are. Note also that certain values may cause undesired or unexpected results - use judiciously
And now that we know how dangerous it is, i'd say it's time to get started.
For this test, the code is rolled back from the last change, keeping the first two code blocks, the original program and the expanded onResults(), plus the expanded log1() and its supporting variable, so timing is easier to see. But that's not all.
Ideally, we would time the end of the speech until the end of the listening, which can be done by timing the start-beep to the end of talking, and from there until the end-beep. The beeps are seemingly an accurate sign of a start and a stop. But that would be prone to error, being hard to match that up with the logs in the code, so we'll skip that. Instead, a second thing to test would be to say the word at the same speed each time (as timing is all-important here). That can be controlled via the TextToSpeech system. That is, we're going to have Android talk to itself. (The speed of the talking is actually controlled by the user, for me in Android 5, it is under Settings->Language and input->Text-to-speech options->Speech rate, which i now have set to Normal.) The system should take the same amount of time each time to say the same word(s), though, there does not seem to be a way to know when it is done speaking. TextToSpeech.isSpeaking() seems to be misnamed, as it only checks if speech is still being queued.
Through a little testing not shown here, i found the TextToSpeech system engages faster than the SpeechToText system. As such, we really need to wait a moment, like one-quarter to one-half a second, before starting to talk. Thread.sleep() will lock up the program preventing the listener from starting, defeating the entire purpose of the sleep. Therefore, a second thread is employed to both sleep and speak.
Speaking of speak(), the original TextToSpeech.speak() was deprecated in API 21 in favor of a newer version, and Studio will warn about that, so i changed build.gradle (Module: app) minSdkVersion to 21 (targetSdkVersion is 23) and resynced when it asked me to.
Note also, the two systems do not "talk" to each other. The phone will not magically hear what is said. The microphone will have to hear the speaker to register the words, so the volume must be loud enough. So test that and check the logs for feedback. (Okay, you
come up with a better joke...)
Here we are trying to have one word recognized.
This wasn't the only test i did, but the results are about the same. The first word causes the begin, that is, the ready happens, the word is said, and the begin starts, and then it just hangs. It does not timeout or anything, even when let to run for many minutes. In this example, i decided to test it a some seconds (30?) later to see if it actually was still listening. I said the word in a near whisper by chance, and when nothing happened decided it was likely too low to be registered, so i said it louder, and was pleasantly surprised that it caught both.
To make the case, i'll try "code ranch," which is only adding one syllable to the original test, also test and testing.
Again, one syllable is not recognized but two are, including the one that triggered the begin. Almost as magical as a Star Trek communicator doing translations on the fly. Note that testing worked fine, yet a real world example didn't work as exected.
I want to try that once more, because testing is two syllables in the same word. What about different words? "code ranch" might be a problem because the words are not normally related, depriving the recognizer of context. How about something simple like, "one two":
I stand corrected. Perhaps it's just temperamental, or based on how well it thinks it understood itself. That is, if it is certain about the word, it is good on its own, otherwise, it reqires another word for context. Let's try a few ore examples:
It's hard to say when it will or will not recognize what i said, but it seems that one word has less of a chance and two words have more. And in some cases, such as with "code," "who," or even "who what," it doesn't even stop listening, ostensibly, ignoring the timeout altogether. In any case, we see that the system takes a while to timeout, usually a little over five seconds. That allows for short pauses, which is a good thing for normal speech, as people pause when they are talking. But what if you are trying to record single words? That's an eternity.
One thing that might be a problem is isSpeaking(), which simply checks the queue. That does not seem to be terribly accurate. I used it based on it's name, though upon reflection, the documentation makes it clear what it is. That doesn't excuse the naming though. Anyway, we'll add UtteranceProgressListener() which provides callbacks from the engine itself as to when the speaking started and finished, and if there were any errors. It is not an interface like TextToSpeech.OnUtteranceCompletedListener, which was deprecated in API 18. To help it stand out in the log, "(upl)" is prefixed in its log message.
speak() allows for certain parameters and also a string
for utterance id. We don't seem to need to pass any parameters, but, the id is passed to the upl methods, so we might as well use it. For conveninence, we'll just use the words to be spoken as the id. And just for fun, change QUEUE_FLUSH to QUEUE_ADD, not that it should matter here.
To do this, in onCreate(), right after tts is instantiated, we'll add the upl method:
speak(), with its simple changes:
I had done some testing first, and the results were interesting, especially as things seemed to be logged out of order on the first run. So after a little testing, i rebooted the phone to run the following on a freshly started engine, to see if it is repeatable. Perhaps it is worthwile to keep in mind that, as we have seen earlier, first runs can produce unexpected results.
|#||start||no match||ready||speak||upl start||begin||upl done||queued||end||results||words|
The first run was out of order, though here it may just be a delay to allow the system to start the first time around. The no match makes no sense, as we have seen before. But it is clear that after the ready, when the voice starts speaking, begin is triggered and the end happens about 6 seconds after the voice is done speaking. So now we can try to lessen that time, specifically because we want to test just a word or two, or, more importantly, not a sentence replete with pauses that have to be waited for.
Of the 3 options mentioned above, only 1 or 2 would seem to be related to our goals. However, it is possible the other option(s) will hinder us, if honoring whatever one option is set to overrides the other. For example, we might tell it to make any quarter-second pause an end of speech, but if the minimum listening time is 3 seconds, it won't even bother with those pauses. And since we do not know what the defaults are, it is perhaps best set all the options to make them known, by bringing them under our control. Perhaps related, a little searching found people complaining about Android not honoring these options. Let's see what we can accomplish.
The options are passed as extras in an Intent, so we'll add it in onCreate(), just under the option that is already there:
I have set the values and tried 100, 1000, and 10000, or just minimum length alone at 10k, and the end comes in about 6 seconds later no matter what. It seems the system simply does not honor these values, regardless of whether they are more than or less than the default values.
Android has issues 16638
, the former has been marked obsolete, but the problem persists.
So, it would seem that short recognition is simply not supported.