Implementing Real Time Speech to Text in SUSI Android App

Android has a very interesting feature built in called Speech to Text. As the name suggests, it converts user’s speech into text. So, android provides this feature in form of a recognizer intent. Just the below 4 line code is needed to start recognizer intent to capture speech and convert into text. This was implemented earlier in SUSI Android App too.  Simple, isn’t it?

Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL,
       RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
intent.putExtra(RecognizerIntent.EXTRA_CALLING_PACKAGE,
       "com.domain.app");
startActivityForResult(intent, REQ_CODE_SPEECH_INPUT);

But there is a problem here. It popups an alert dialog box to take a speech input like this :

Although this is great and working but there were many disadvantages of using this :

  1. It breaks connection between user and app.
  2. When user speaks, it first listens to complete sentence and then convert it into text. So, user was not able to see what the first word of sentence was until he says the complete sentence.
  3. There is no way for user to check if the sentence he is speaking is actually converting correctly to speech or not. He only gets to know about it when the sentence is completed.
  4. User can not cancel the conversion of Speech to text if some word is converted wrongly because the user doesn’t even get to know if the word is converted wrongly. It is kind of suspense that a whether the conversion is right or not.

So, now the main challenge was how can we improve the Speech to Text and implement it something like Google now?  So, what google now does is, it converts speech to text alongside taking input making it a real time speech to text convertor. So, basically if you speak second word of sentence, the first word is already converted and displayed. You can also cancel the conversion of rest of the sentence if the first word of sentence is converted wrongly.

I searched this problem a lot on internet, read various blogs and stackoverflow answers, read Documentation of Speech to Text until I finally came across SpeechRecognizer class and RecognitionListener. The plan was to use object of SpeechRecognizer class with RecognizerIntent and RecognitionListener to generate partial results.

So, I just added this line to the the above code snippet of recognizer intent.

intent.putExtra(RecognizerIntent.EXTRA_PARTIAL_RESULTS,true);

And introduced SpeechRecognizer class and RecognitionListener like this

 SpeechRecognizer recognizer = SpeechRecognizer
   .createSpeechRecognizer(this.getApplicationContext());

RecognitionListener listener = new RecognitionListener() {
   @Override
   public void onResults(Bundle results) {
   }
 
   @Override
   public void onError(int error) {
   }
 
   @Override
   public void onPartialResults(Bundle partialResults) {
    //The main method required. Just keep printing results in this.
   }
};

Now combine SpeechRecognizer class with RecognizerIntent and RecognitionListener with these two lines.

recognizer.setRecognitionListener(listener);
recognizer.startListening(intent);

Everything is now set up. You can now use result in onPartialResults method and keep displaying partial results. So, if you want to speak “My name is…” , “My” will be displayed on screen by the time you speak “name” and “My name” will be displayed on screen by the time you speak “is” . Isn’t it cool?

This is how Realtime Speech to Text is implemented in SUSI Android App. If you want to implement this in your app too, just follow the above steps and you are good to go. So, the results look like these.

  

Resources