Implementation of Text to Speech alongside Hotword Detection in SUSI Android App

In this blog post, we’ll be learning about how to implement Text to speech. Now you may be wondering that what is so difficult in implementing text to speech. One can easily find many tutorials on that and can easily look at the official documentation of TTS but there’s a catch here. In this blog post I’ll be telling about how to implement Text to Speech alongside Hotword Detection.

Let me give you a rough idea about how hotword detection works in SUSI Android App. For more details, read my other blog here on Hotword Detection. So, there is a constantly running background recording thread which detects when hotword is detected. Now, you may be thinking why do we need to stop that thread for text to speech. Well there are 2 reasons to do that:

  1. Recording while playing causing problems with mic and may crash the app.
  2. Suppose we even implement that but what will happen if the answer contains word “susi” in it. Now, the hotword will be detected because the speech output contained word “susi” in it (which is our hotword).

So, to avoid these problems we had to come up a way to stop hotword detection only for that particular time when SUSI is giving speech output and resume it back immediately when speech output is finished.

Let’s see how we did that.

Implementation

Check out this video to see how this work in the app

https://youtu.be/V9N6K4SzpXw

Initiating the TTS engine

The first task is to initiate the Text to speech engine. This process takes some time. So, it is done in the starting of app in a new handler.

new Handler().post(new Runnable() {
   @Override
   public void run() {
       textToSpeech = new TextToSpeech(getApplicationContext(), new TextToSpeech.OnInitListener() {
           @Override
           public void onInit(int status) {
               if (status != TextToSpeech.ERROR) {
                   Locale locale = textToSpeech.getLanguage();
                   textToSpeech.setLanguage(locale);
               }
           }
       });
   }
});

Check Audio Focus

The next step is to check whether audio focus is granted. Suppose there is some music playing in the background, in that case we won’t be able to give voice output. So, we check audio focus using below code.

final AudioManager audiofocus = (AudioManager) getSystemService(Context.AUDIO_SERVICE);
 int result = audiofocus.requestAudioFocus(afChangeListener, AudioManager.STREAM_MUSIC, AudioManager.AUDIOFOCUS_GAIN);
if (result == AudioManager.AUDIOFOCUS_REQUEST_GRANTED) {
//DO WORK HERE
}

Using OnAudioFocusChangeListener, we keep a track of when we have access to give speech output and when we don’t.

private AudioManager.OnAudioFocusChangeListener afChangeListener =
       new AudioManager.OnAudioFocusChangeListener() {
           public void onAudioFocusChange(int focusChange) {
               if (focusChange == AUDIOFOCUS_LOSS_TRANSIENT) {
                   textToSpeech.stop();
               } else if (focusChange == AudioManager.AUDIOFOCUS_GAIN) {
                   // Resume playback
               } else if (focusChange == AudioManager.AUDIOFOCUS_LOSS) {
                   textToSpeech.stop();
               }
           }
       };

Converting the given text to speech

Now we have audio focus, we just have to convert given text to speech. Use method textToSpeech.speak().

private void voiceReply(final String reply) {
       Handler handler = new Handler();
       handler.post(new Runnable() {
           @Override
           public void run() {
                   textToSpeech.speak(spokenReply, TextToSpeech.QUEUE_FLUSH, ttsParams);                  
               }
           }
       });
   }
}

Abandon Audio Focus

Now we are done with speech output, it’s time we abandon audio focus.

audiofocus.abandonAudioFocus(afChangeListener);

TTS alongside Hotword Detection

Okay so now the major part. How do we check when to stop hotword detection thread and when to resume it? How do we check if Speech output is finished?

Answer to these questions is textToSpeech.setOnUtteranceProgressListener. The UtteranceProgressListener overrides 3 methods:

  1. onStart: Indicates starting of text to speech conversion. Which means it’s time to stop hotword detection thread.
  2. onDone: Called when every word of the provided text is converted to speech. So, simply resume hotword detection
  3. onError: Called when there is an error and text is not converted to speech. Anyway, we need to resume hotword detection here too.
textToSpeech.setOnUtteranceProgressListener(new UtteranceProgressListener() {
                       @Override
                       public void onStart(String s) {
                           if(recordingThread !=null && isDetectionOn){
                               recordingThread.stopRecording();
                               isDetectionOn = false;
                           }
                       }

                       @Override
                       public void onDone(String s) {
                           if(recordingThread != null && !isDetectionOn && checkHotwordPref()) {
                               recordingThread.startRecording();
                               isDetectionOn = true;
                           }
                       }

                       @Override
                       public void onError(String s) {
                           if(recordingThread != null && !isDetectionOn && checkHotwordPref()) {
                               recordingThread.startRecording();
                               isDetectionOn = true;
                           }
                       }
                   });

                   HashMap<String,String> ttsParams = new HashMap<String, String>();
                   ttsParams.put(TextToSpeech.Engine.KEY_PARAM_UTTERANCE_ID,
                           MainActivity.this.getPackageName());

Summary

So, the main thing required for implementation of Text to Speech alongside Hotword detection is a way to control stopping and resuming hotword detection when Text to speech is in process. For that we used UtteranceProgressListener of TextToSpeech class which makes it so easier to do the task we required. You may follow this same approach as well or if you have a better approach, open an issue here.

Resources

  1. Official Documentation of TextToSpeech https://developer.android.com/reference/android/speech/tts/TextToSpeech.html
  2. Documentation of UtteranceProgressListener https://developer.android.com/reference/android/speech/tts/UtteranceProgressListener.html
  3. Blog link to Hotword Detection https://docs.google.com/document/d/1auTyuk32i15Rw94TOkrSruRJ9LZVtjcThoWVJkvnAz8/edit?usp=sharing