• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Code Sample: RecognitionListener (Speech to text) sample with logs

 
Brian Tkatch
Bartender
Posts: 567
25
Linux Notepad Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I've been having some issues with a project i am working on regarding the Speech Recognizer. The issues have to do with turning it on and off and running the Text to Speech engine at the same time. In order to test if the issue was my code or if the system did not support what i wanted to do, i created a simple example app that just logged the actions without really doing anything. This one is just the Speech Recognizer and doesn't even display the results. It's just to show the logs and how it works.

Notes: RMS is not reliable (it is not always called in ICS+), and when it does get called, it outputs a wealth of information, making the log nearly impossible to read. So, it can be filtered separately. OnBufferReceived is not called in ICS+, so don't expect to see it either.

1. A very simple UI, with just two buttons that call two methods:

2. The code:

3. The speech recognizer requires the record audio permission, which needs to be added to AndroidManifest.xml (in manifest, but outside application). This is the way it is requested. Though Marshmallow and beyond might need that fancier request in the code itself. Anyway, leave this out or comment it out at first to see the log produce the lack of permission as an error and simply not run.

To see the log, filter on "moo". To see the RMS log, filter on "cow". But note, it may not show anything at all.
 
Brian Tkatch
Bartender
Posts: 567
25
Linux Notepad Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Now for some testing. Removing the record audio permission will block the service from working:
Note that even trying to stop it generates the permissions error. Which is odd, as stopping it doesn't want to start recording, but wants to stop it. Even so, in this test nothing was started to begin with, so the stop shouldn't be attempting to do anything at all. Most likely, it's just a default error when the permission does not exist.

Adding the permission back in is obviously required. So, let's add it in and see the process cycle.

On first glance, i would think the (basic) cycle should be ready->beginning->end->results. That is, the system readies itself, the speech begins and ends, and the results are shown. We're not trying to retrieve results in this simple example though. Nonetheless, when does end happen? By default, when the user stops talking, but it can also be stopped manually. If the user stops talking, it should timeout, making the order: ready->beginning->error(timeout)->end->results. Let's see what really happens. For this example, only the start button will be clicked:
The system readied itself and then just timed out as noting was said. That's as expected. Now to test it again, but say a word.
This is interesting. The system readied itself twice! One just .11 seconds after the other. That can cause issues if there is code in that method.

To see if it happens again, i did the same exact test.This one only has one ready, but look at the no-match error before it. That's got to be a relic from the last run. But why does it show up now? Also, the error method would need to watch out for this, should any decisions be based on what error occurs. The error might be from the last run!

Now to run again, but not say anything:The error before ready shows up again, but this time it says no match also at the end, even though i did not say anything. Weird. Maybe it picked up something in the background. So, i tried again (5,6) and it seemed to work better. Note that on try 6 again not saying anything after not saying anything in 5, causes no error for no match, but the ready happens twice again.

Reran the code. let's timeout a few times without saying anything, then a few times saying things (i'm using the word, "testing," creative as i am) then again not saying anything. 3 of each, just for fun:
Here it can be seen, that before it readies itself, there another ready or an error, based on whether something was said the last time or not, respectively. 1-4 had nothing said in a prior run. Conversely, 5-7 show the error for the word spoken in 4-6. 8-9 returns to the erstwhile behavior of readying twice.

But what happened the first time (top of the post)? There's only one ready. The only difference is, the system had not been engaged yet. Perhaps, the double ready only happens when the system has been previously run (and is in the background?) To test this theory, i rebooted the phone and ran again without saying anything. Sure enough, there was only one ready. I rebooted and ran again, this time saying something, and there was no speech timeout before the ready either. This means, perhaps, the double ready or error only happens on subsequent runs.

All in all, simply running the listener and letting it timeout has some unexpected results. Ready can be called twice, and a no match error can be generated out of place. The latter can be coded around by making sure the error is generated between a ready and an end or timeout. The former would be handled in much the same way, by ignoring it if ready was the last event. Otherwise, code run in these methods might end up doing the wrong thing!
 
Brian Tkatch
Bartender
Posts: 567
25
Linux Notepad Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Now to test the stop button.

First, let's run it all on it's own to see what it does:
It does absolutely nothing, since nothing was started. Similarly, if a start was let to timeout, a stop does nothing:
The double ready appears because i have not rebooted since the last post. the timeout should stop everything, but the stop now gives me a client error, ostensibly because the system is not running. But why didn't it do that the first time? Ah, but it did. Before the double ready but after the start, the client error was reported. These errors are a strange beast.

Let's try to tackle that by stopping twice, starting with a timeout, and stopping twice.
If there are any stops before a start, the stops show no error. But after a subsequent start, the errors are generated. Conversely, for stops that come after a start, the stops themselves generate the error immediately. And the subsequent start can do the double ready without error.

Let's try his once more, with ten stops and one start, to see if the errors are queued for the start:
Forsooth! The daemon laughs at us, verily.

The start did give us the double ready though. That first ready-sometimes-error seems to be reserved for no-match.

Now to test stop before the timeout. How well does it stop the engine, and when exactly is it processed. First, a simple start/stop/start test, without saying anything:
The first start gave us the double ready, and stop generated a timeout. So, stopping makes the engine stop listening, and since nothing was said, a timeout is generated. Not exactly what i expected, but not too bad. The second start doesn't seem to do anything: No readys at all. But the stop still generates a timeout. That seems odd. The third then works as expected.

This time, we'll say something, but again, start/stop/start:
Even though the app was restarted, the engine remembered the no match error from last time. The word was spoken after the start, then it was stopped, and this time it generated an error client. It seems stop always generates an error client, even when it is a legitimate stop, unless it comes right after a timeout, in which case it just generates that timeout. It then goes to end, which makes sense. The second start times out, and generates the no match from the prior run.

What about start/stop in rapid succession? Let's do 5 rounds:
Rapid succession does not seem to be a problem. Even though we saw that happen in the third listing above. But i can't seem to reproduce it.

On second thought, i can reproduce it. If i hit the stop button .15 seconds or less from the start, the start does nothing. Here's a couple runs i tried to determine and test this theory:
I was afraid the first start would always generate a ready, so i tried again:
.15 seconds is pretty short, but it's long enough to mess up some "quick" testing.
 
Brian Tkatch
Bartender
Posts: 567
25
Linux Notepad Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So, we've seen how start works when let to timeout, with and without words. We then saw how stop works, first after a timeout, and then after a start. But how does a stop work when words are spoken, but before the timeout? Such as when the user wants to quickly records a note and then stops it manually, so as not to pick up anything else. Let's test that.

First, a simple test. Start, say a word, and stop when finished speaking the word:
I don't think i rebooted the phone since the run yesterday, so i expected a double ready, but there's only one. Perhaps because the system was unloaded? Anyway, the stop occurred before the end event, which would happen if the speech timed out (and something had been said), and the results were called even though i hit stop. So, there does not seem to be an issue with stopping it manually.

Let's try again, to see how a subsequent run works.
This is interesting. It generated an initial no-match error even though there were results! I would have expected the double ready. And the stop caused a client error too, which is usually generated when a stop is issued when the system is not running.

Let's try again, a few times in a row to see if that continues:
Something is definitely going on here. The third run is just like the prior run, except there are no results. Well, not right away at least. Perhaps we need to investigate how long it takes to produce results.

In the first case, the system was ready .25 seconds after it was told to start. I began to speak .43 seconds later, hit stop .62 seconds after that, with the system denoting the end .03 seconds after that. The results came in .14 seconds after that, and it was a few minutes before start the second trial. Here's a table for comparison.

#starterror no match founderror recognizer busyonReadyForSpeechonBeginningOfSpeechonResultsstoperror clientonEndOfSpeechonResults
1.25.43.62.03.14
2.07.14.59.72.011.28.86
3.11.07.58.83.01
4.89.09*.03/.38.48.03
5.61.01.23.82.01

Note that in run #4 has two begins. Once before the ready, which we can ask a cesium atom to explain, and another after the ready, which is the normal case. Though, as the first of the double ready we have seen can sometimes be an error, maybe it can also be a begin.

There is another option though, as run #5 has no begin or end. Perhaps the callbacks were out of order, with #4 seeing the begin as the system was confused. But that does not seem to be the case, as #5 errors with  recognizer busy, meaning #5 is not live, which would also explain the client error after the stop, which happens when the system is not running. #5's results, therefore, come from #4, and would explain why the system was still busy. It hadn't finished processing yet!

But i will suggest yet another solution, which may elucidate the entire log. Forget #4, perhaps run #3  was the one that never finished, and caused issues for both #4 and #5. Note that #3 does not have an end, that is contained in #4. Nor does it have results, that is in #5. The first begin in #4 was not a glitch in the matrix, but the equivalent of a recognizer busy error or the end of #3 as a new start was already issued. (One can imagine a flag inside their code causing begin to be called instead of end, because a new start was issued.) The begin and end of #4 would actually belong to #4, and the recognizer busy error of 5 is because #3 hadn't finished yet, or, allowing one to queue, the queue is full. the results would then be for #3. This would be so much easier if the recognizer allowed for an id to be passed back and forth.

Well, this is testable, of course, by using different words in each run and checking the results. But for that, onResults needs to be modified to show those results:
For the test, there will be 5 runs: "does," "anyone," "care," "about this thread," and "just say something, anything".
I would say this completely disproves my theory.

So, perhaps it has to do with the time involved. That is, maybe a minimum of time needs to pass after a start or stop for the system to function as intended.

To test this theory, we'll add code to automatically stop the listener, and lessen the time waited between iterations. Once we're at it, we'll also add a timing element to log1(), as a convenience for this review.

In order to automatically stop the listener, we'll use a second thread, so the main thread does not stop execution after calling the listener and avoid actively waiting like we want it to. To be thread safe, the thread cannot stop the listener itself, and instead must call a handler in the main thread which then starts the process over again, recursively. (Android Studio produces a warning about the Handler, something i will have to learn about another time.)

Here is the new code listing:
No words were spoken during the test, as just timing is being tested. This produces the following results:
That looks strange, so, just to make sure, we'll run it again. I rebooted the phone first.
To be really sure though, maybe the time was too short, one more time but starting with 9 seconds:
We'll put that those into tables to aid comparison. (Let's hope i did that without error.) The sleep column respresents how long the sleep was, the start column will have the time element from sleep (which grabbed it because it's log1() came right before it):
#sleepstarterrorerrorreadybeginenderrorstoperror
0150008.125.04
0240005.004.02
0330004.003.02
0420003.00busybusy0.000.000.00no match2.00
0510002.00clientno match0.000.000.00no match1.00
0609001.00clientno match0.000.000.00no match0.91
0708000.90clientno match0.000.80
0807000.80clientready0.000.71
0906000.70busy0.00no match0.61
1005000.60clientno match0.010.000.51
1104000.50clientbegin0.000.020.42
1203000.40endbusyno match0.32
1302000.30clientno match0.000.20
1401000.20clientready0.000.10
1500000.10busytimeoutclient0.01timeout

#sleepstarterrorerrorreadybeginenderrorstoperror
01500031.175.02
0240005.004.01
0330004.00busyreadytimeout3.00
0420003.00clientready0.00timeout2.01
0510002.00clientready0.000.000.00no match1.00
0609001.00clientno match0.000.90
0708000.90clientready0.000.000.00no match0.80
0807000.80clientno match0.000.70
0906000.70clientready0.000.000.60
1005000.60busy0.00no match0.50
1104000.50clientno match0.000.40
1203000.40clientready0.000.000.30
1302000.30busy0.20
1401000.20clientno match0.00client0.10
1500000.10ready0.000.01busy and timeout

#sleepstarterrorerrorreadybeginenderrorstop
01900013.979.05
0280009.008.01
0370008.007.00
0460007.00busybusy0.000.000.00no match6.00
0550006.00clientno match0.000.000.00no match5.00
0640005.00clientno match0.000.000.00no match4.00
0730004.00clientno match0.000.000.00no match3.00
0820003.00clientno match0.000.000.00no match and client2.01
0910002.00no match0.000.000.00no match1.00
1009001.00clientno match0.000.000.00no match0.90
1108000.90clientno match0.000.00no match0.0
1207000.80clientbegin0.000.000.71
1306000.70busy0.00no match0.50
1405000.60clientno match0.000.00client0.50
1504000.50begin0.000.000.40
1603000.40busy0.00no match0.30
1702000.30clientno match0.00client0.21
1801000.20ready0.000.10
1900000.10busytimeout0.17*0.75*1.02*no match*(0.70)0.00
In 19, the events happened after the stop was issued.

First thing to notice is the system does not seem to start right away, and the amount of time seems less relevant than the amount of starts. In the first test, the system waited until the fourth start, which was ~24 seconds later. The second test picked up on the third try, just ~18 seconds later. The third test, also waited until the fourth attempt, 58 seconds later. In all the cases though, when it did start, it gave two busy errors.

My guess is that it is a mixture of not talking and hitting stop. From all these tests, the system does not seem to like being told to stop. The documentation states about stopListening():
Stops listening for speech. Speech captured so far will be recognized as if the user had stopped speaking at this point. Note that in the default case, this does not need to be called, as the speech endpointer will automatically stop the recognizer listening when it determines speech has completed. However, you can manipulate endpointer parameters directly using the intent extras defined in RecognizerIntent, in which case you may sometimes want to manually call this method to stop listening sooner.

It does say it is supported, but it seems that the system is tuned for naturally timing out.

I'm still wondering about the double error that comes after the start but before the begin, that is, why there are two, if any. The first error seems to be from the prior run (which itself was noted in the second post in this thread). Error client, we have seen (in the third post), shows up after a start that follows a stop if there was no intervening timeout (for most of the cases, at least). No match is also a relic or the last run. Busy is a little more interesting. If there's just one busy error, it seems to be for the current run. But if there are two of them, the first one seems to be from the prior run. Though, when given almost no time to run, busy can be followed by timeout, and it is not obvious for which run they were generated. There are also client errors followed by no match, usually when the last run produced a no match. In some of the cases, the errors seem to be generated in the wrong place.

The double ready, however, is the most perplexing of all. When everything runs without issue, on nearly all runs after the initial run of the system (as noted above), onReadyForSpeech() gets called twice. This was likely one of the issues i bumped into (when attempting to record the duration). But the real issue come from error generation after a start. To which run does it belong? I think that is the main culprit to the woes i was having that sent me on this investigation in the first place.

It ought to be noted that the system clearly beeps before recording, so, perhaps, some of these runs can be matched to the beeps. That is, only the runs that cause the system to beep are actually recording. Whether that is easy to keep track of on such short runs is another question.

In any case, as noted, the system does not like to be told to stop. It'll listen to you (no pun intended) but will throw errors in strange places. Perhaps stop is best left for when you are interested in neither error nor result.
 
Brian Tkatch
Bartender
Posts: 567
25
Linux Notepad Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, one thing is clear, stopping the system manually causes all sorts of insanity in the callbacks. But let's take a closer look at the documentation for stopListening(), mentioned above, by focusing on:
Note that in the default case, this does not need to be called, as the speech endpointer will automatically stop the recognizer listening when it determines speech has completed. However, you can manipulate endpointer parameters directly using the intent extras defined in RecognizerIntent, in which case you may sometimes want to manually call this method to stop listening sooner.

The endpointer seems to be the part of the system that points to the end of the speech, and that when it determines the end has been reached, it stops the system (internally). stopListening() can end it prematurely, but that caused us some issues. Then it lists a third method for stopping, via an extra is the Intent that started the process. Let's take a look at that.

RecognizerIntent is a set of constants to be used in the Intent when SpeechRecognizer is told to start to listening. In the examples above, it is passed just ACTION_RECOGNIZE_SPEECH, which tells the system to just recognize the speech, as opposed to ACTION_WEB_SEARCH, for example.

The constant that are relevant our quest here are: EXTRA_SPEECH_INPUT_COMPLETE_SILENCE_LENGTH_MILLIS,
EXTRA_SPEECH_INPUT_MINIMUM_LENGTH_MILLIS, and
EXTRA_SPEECH_INPUT_POSSIBLY_COMPLETE_SILENCE_LENGTH_MILLIS The first one seems to be exactly what we want, but it comes with a warning:
The amount of time that it should take after we stop hearing speech to consider the input complete. Note that it is extremely rare you'd want to specify this value in an intent. If you don't have a very good reason to change these, you should leave them as they are. Note also that certain values may cause undesired or unexpected results - use judiciously!

We've seen undesired and unexpected results with stop. I wonder if this one is the same.

The second constant is about the minimum to listen, not exactly what we want, unless we need to shorten it to make things work. It comes with a similar warning though:
The minimum length of an utterance. We will not stop recording before this amount of time. Note that it is extremely rare you'd want to specify this value in an intent. If you don't have a very good reason to change these, you should leave them as they are. Note also that certain values may cause undesired or unexpected results - use judiciously!

The third constant wants to keep the system listening, by setting a minimum for pauses, and not mistakenly recognize them as the end of speech. It too has the same warning:
The amount of time that it should take after we stop hearing speech to consider the input possibly complete. This is used to prevent the endpointer cutting off during very short mid-speech pauses. Note that it is extremely rare you'd want to specify this value in an intent. If you don't have a very good reason to change these, you should leave them as they are. Note also that certain values may cause undesired or unexpected results - use judiciously

And now that we know how dangerous it is, i'd say it's time to get started.

For this test, the code is rolled back from the last change, keeping the first two code blocks, the original program and the expanded onResults(), plus the expanded log1() and its supporting variable, so timing is easier to see. But that's not all.

Ideally, we would time the end of the speech until the end of the listening, which can be done by timing the start-beep to the end of talking, and from there until the end-beep. The beeps are seemingly an accurate sign of a start and a stop. But that would be prone to error, being hard to match that up with the logs in the code, so we'll skip that. Instead, a second thing to test would be to say the word at the same speed each time (as timing is all-important here). That can be controlled via the TextToSpeech system. That is, we're going to have Android talk to itself. (The speed of the talking is actually controlled by the user, for me in Android 5, it is under Settings->Language and input->Text-to-speech options->Speech rate, which i now have set to Normal.) The system should take the same amount of time each time to say the same word(s), though, there does not seem to be a way to know when it is done speaking. TextToSpeech.isSpeaking() seems to be misnamed, as it only checks if speech is still being queued.

Through a little testing not shown here, i found the TextToSpeech system engages faster than the SpeechToText system. As such, we really need to wait a moment, like one-quarter to one-half a second, before starting to talk. Thread.sleep() will lock up the program preventing the listener from starting, defeating the entire purpose of the sleep. Therefore, a second thread is employed to both sleep and speak.

Speaking of speak(), the original TextToSpeech.speak() was deprecated in API 21 in favor of a newer version, and Studio will warn about that, so i changed build.gradle (Module: app) minSdkVersion to 21 (targetSdkVersion is 23) and resynced when it asked me to.

Note also, the two systems do not "talk" to each other. The phone will not magically hear what is said. The microphone will have to hear the speaker to register the words, so the volume must be loud enough. So test that and check the logs for feedback. (Okay, you come up with a better joke...)
Here we are trying to have one word recognized.This wasn't the only test i did, but the results are about the same. The first word causes the begin, that is, the ready happens, the word is said, and the begin starts, and then it just hangs. It does not timeout or anything, even when let to run for many minutes. In this example, i decided to test it a some seconds (30?) later to see if it actually was still listening. I said the word in a near whisper by chance, and when nothing happened decided it was likely too low to be registered, so i said it louder, and was pleasantly surprised that it caught both.

To make the case, i'll try "code ranch," which is only adding one syllable to the original test, also test and testing.Again, one syllable is not recognized but two are, including the one that triggered the begin. Almost as magical as a Star Trek communicator doing translations on the fly. Note that testing worked fine, yet a real world example didn't work as exected.

I want to try that once more, because testing is two syllables in the same word. What about different words? "code ranch" might be a problem because the words are not normally related, depriving the recognizer of context. How about something simple like, "one two":I stand corrected. Perhaps it's just temperamental, or based on how well it thinks it understood itself. That is, if it is certain about the word, it is good on its own, otherwise, it reqires another word for context. Let's try a few ore examples:It's hard to say when it will or will not recognize what i said, but it seems that one word has less of a chance and two words have more. And in some cases, such as with "code," "who," or even "who what," it doesn't even stop listening, ostensibly, ignoring the timeout altogether. In any case, we see that the system takes a while to timeout, usually a little over five seconds. That allows for short pauses, which is a good thing for normal speech, as people pause when they are talking. But what if you are trying to record single words? That's an eternity.

One thing that might be a problem is isSpeaking(), which simply checks the queue. That does not seem to be terribly accurate. I used it based on it's name, though upon reflection, the documentation makes it clear what it is. That doesn't excuse the naming though. Anyway, we'll add UtteranceProgressListener() which provides callbacks from the engine itself as to when the speaking started and finished, and if there were any errors. It is not an interface like TextToSpeech.OnUtteranceCompletedListener, which was deprecated in API 18. To help it stand out in the log, "(upl)" is prefixed in its log message.

speak() allows for certain parameters and also a string for utterance id. We don't seem to need to pass any parameters, but, the id is passed to the upl methods, so we might as well use it. For conveninence, we'll just use the words to be spoken as the id. And just for fun, change QUEUE_FLUSH to QUEUE_ADD, not that it should matter here.

To do this, in onCreate(), right after tts is instantiated, we'll add the upl method:speak(), with its simple changes:I had done some testing first, and the results were interesting, especially as things seemed to be logged out of order on the first run. So after a little testing, i rebooted the phone to run the following on a freshly started engine, to see if it is repeatable. Perhaps it is worthwile to keep in mind that, as we have seen earlier, first runs can produce unexpected results.
#startno matchreadyspeakupl startbeginupl donequeuedendresultswords
15.620.5200.08*0.320.7405.850.310.01
23.570.080.150.300.020.410.2106.070.120
30.670.060.130.320.010.400.190.016.100.120.00

The first run was out of order, though here it may just be a delay to allow the system to start the first time around. The no match makes no sense, as we have seen before. But it is clear that after the ready, when the voice starts speaking, begin is triggered and the end happens about 6 seconds after the voice is done speaking. So now we can try to lessen that time, specifically because we want to test just a word or two, or, more importantly, not a sentence replete with pauses that have to be waited for.

Of the 3 options mentioned above, only 1 or 2 would seem to be related to our goals. However, it is possible the other option(s) will hinder us, if honoring whatever one option is set to overrides the other. For example, we might tell it to make any quarter-second pause an end of speech, but if the minimum listening time is 3 seconds, it won't even bother with those pauses. And since we do not know what the defaults are, it is perhaps best set all the options to make them known, by bringing them under our control. Perhaps related, a little searching found people complaining about Android not honoring these options. Let's see what we can accomplish.

The options are passed as extras in an Intent, so we'll add it in onCreate(), just under the option that is already there:I have set the values and tried 100, 1000, and 10000, or just minimum length alone at 10k, and the end comes in about 6 seconds later no matter what. It seems the system simply does not honor these values, regardless of whether they are more than or less than the default values.

Android has issues 16638 and 76130, the former has been marked obsolete, but the problem persists.

So, it would seem that short recognition is simply not supported.
 
Brian Tkatch
Bartender
Posts: 567
25
Linux Notepad Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay, i really ought to shoiw the numebrs instead of just saying it didn't work:

ValueFirst RunSecond RunThird Run
new Long(0)6.016.05
new Long(1)6.066.02
new Long(10)6.006.00
new Long(100)6.036.13
new Long(1000)6.036.05
new Long(10000)6.116.04
100009.436.015.96

In these cases, all three values were changed. I tested the raw number just in case, and did indeed get excited. But as the next two tests show, that was merely a fluke.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic