Implementation of Text to Speech alongside Hotword Detection in SUSI Android App

In this blog post, we’ll be learning about how to implement Text to speech. Now you may be wondering that what is so difficult in implementing text to speech. One can easily find many tutorials on that and can easily look at the official documentation of TTS but there’s a catch here. In this blog post I’ll be telling about how to implement Text to Speech alongside Hotword Detection.

Let me give you a rough idea about how hotword detection works in SUSI Android App. For more details, read my other blog here on Hotword Detection. So, there is a constantly running background recording thread which detects when hotword is detected. Now, you may be thinking why do we need to stop that thread for text to speech. Well there are 2 reasons to do that:

  1. Recording while playing causing problems with mic and may crash the app.
  2. Suppose we even implement that but what will happen if the answer contains word “susi” in it. Now, the hotword will be detected because the speech output contained word “susi” in it (which is our hotword).

So, to avoid these problems we had to come up a way to stop hotword detection only for that particular time when SUSI is giving speech output and resume it back immediately when speech output is finished.

Let’s see how we did that.

Implementation

Check out this video to see how this work in the app

https://youtu.be/V9N6K4SzpXw

Initiating the TTS engine

The first task is to initiate the Text to speech engine. This process takes some time. So, it is done in the starting of app in a new handler.

new Handler().post(new Runnable() {
   @Override
   public void run() {
       textToSpeech = new TextToSpeech(getApplicationContext(), new TextToSpeech.OnInitListener() {
           @Override
           public void onInit(int status) {
               if (status != TextToSpeech.ERROR) {
                   Locale locale = textToSpeech.getLanguage();
                   textToSpeech.setLanguage(locale);
               }
           }
       });
   }
});

Check Audio Focus

The next step is to check whether audio focus is granted. Suppose there is some music playing in the background, in that case we won’t be able to give voice output. So, we check audio focus using below code.

final AudioManager audiofocus = (AudioManager) getSystemService(Context.AUDIO_SERVICE);
 int result = audiofocus.requestAudioFocus(afChangeListener, AudioManager.STREAM_MUSIC, AudioManager.AUDIOFOCUS_GAIN);
if (result == AudioManager.AUDIOFOCUS_REQUEST_GRANTED) {
//DO WORK HERE
}

Using OnAudioFocusChangeListener, we keep a track of when we have access to give speech output and when we don’t.

private AudioManager.OnAudioFocusChangeListener afChangeListener =
       new AudioManager.OnAudioFocusChangeListener() {
           public void onAudioFocusChange(int focusChange) {
               if (focusChange == AUDIOFOCUS_LOSS_TRANSIENT) {
                   textToSpeech.stop();
               } else if (focusChange == AudioManager.AUDIOFOCUS_GAIN) {
                   // Resume playback
               } else if (focusChange == AudioManager.AUDIOFOCUS_LOSS) {
                   textToSpeech.stop();
               }
           }
       };

Converting the given text to speech

Now we have audio focus, we just have to convert given text to speech. Use method textToSpeech.speak().

private void voiceReply(final String reply) {
       Handler handler = new Handler();
       handler.post(new Runnable() {
           @Override
           public void run() {
                   textToSpeech.speak(spokenReply, TextToSpeech.QUEUE_FLUSH, ttsParams);                  
               }
           }
       });
   }
}

Abandon Audio Focus

Now we are done with speech output, it’s time we abandon audio focus.

audiofocus.abandonAudioFocus(afChangeListener);

TTS alongside Hotword Detection

Okay so now the major part. How do we check when to stop hotword detection thread and when to resume it? How do we check if Speech output is finished?

Answer to these questions is textToSpeech.setOnUtteranceProgressListener. The UtteranceProgressListener overrides 3 methods:

  1. onStart: Indicates starting of text to speech conversion. Which means it’s time to stop hotword detection thread.
  2. onDone: Called when every word of the provided text is converted to speech. So, simply resume hotword detection
  3. onError: Called when there is an error and text is not converted to speech. Anyway, we need to resume hotword detection here too.
textToSpeech.setOnUtteranceProgressListener(new UtteranceProgressListener() {
                       @Override
                       public void onStart(String s) {
                           if(recordingThread !=null && isDetectionOn){
                               recordingThread.stopRecording();
                               isDetectionOn = false;
                           }
                       }

                       @Override
                       public void onDone(String s) {
                           if(recordingThread != null && !isDetectionOn && checkHotwordPref()) {
                               recordingThread.startRecording();
                               isDetectionOn = true;
                           }
                       }

                       @Override
                       public void onError(String s) {
                           if(recordingThread != null && !isDetectionOn && checkHotwordPref()) {
                               recordingThread.startRecording();
                               isDetectionOn = true;
                           }
                       }
                   });

                   HashMap<String,String> ttsParams = new HashMap<String, String>();
                   ttsParams.put(TextToSpeech.Engine.KEY_PARAM_UTTERANCE_ID,
                           MainActivity.this.getPackageName());

Summary

So, the main thing required for implementation of Text to Speech alongside Hotword detection is a way to control stopping and resuming hotword detection when Text to speech is in process. For that we used UtteranceProgressListener of TextToSpeech class which makes it so easier to do the task we required. You may follow this same approach as well or if you have a better approach, open an issue here.

Resources

  1. Official Documentation of TextToSpeech https://developer.android.com/reference/android/speech/tts/TextToSpeech.html
  2. Documentation of UtteranceProgressListener https://developer.android.com/reference/android/speech/tts/UtteranceProgressListener.html
  3. Blog link to Hotword Detection https://docs.google.com/document/d/1auTyuk32i15Rw94TOkrSruRJ9LZVtjcThoWVJkvnAz8/edit?usp=sharing
Continue ReadingImplementation of Text to Speech alongside Hotword Detection in SUSI Android App

Implementing Text-to-Speech (TTS) in SUSI Android

Mobile assistants are designed to perform tasks that the user “commands” through by chat UI or speech. The Android OS already provides Text to speech (TTS) and Speech to text (STT) features. This feature is available from Android version 1.6 onward. In this blog post I will show how tts is implemented in SUSI Android and how I fix the issue ‘delay in speech response’.

TextToSpeech class controls the tts engine. To use TextToSpeech class import it in the activity where you want to use text to speech feature.

import android.speech.tts.TextToSpeech;

After you import TextToSpeech class now we need to initialize TextToSpeech

TextToSpeech tts = new TextToSpeech(this,this);

Here first parameter is the Context and the other one is the listener. The listener is  use  to  inform our app that the engine is ready to use. In order to be notified we have to  implement  TextToSpeech.OnInitListener.

TextToSpeech.OnInitListener listener = new  TextToSpeech.OnInitListener {
@Override
public void onInit(int status) {
if (status == TextToSpeech.SUCCESS)
tts.setLanguage(Locale.UK/* set the default language*/);
}
}

Hence the engine can be initialized asIf status is success then, it means that TTS is initialized successfully and now we can use it. Otherwise, we can’t use it. setLanguage method is used to set language in which we want reply.

TextToSpeech tts = new TextToSpeech(getApplicationContext,listener)

When you use TTS one thing you have to remember that TTS run  on main thread so sometimes it may cause delays in text to speech conversion or it may block UI for a while. It is better to wrap it like below code.

new Handler().post(new Runnable() {
      @Override
      public void run() {
         tts = new TextToSpeech(getApplicationContext(), listener);
        }
    });

Now our engine is ready to speak, we need simply pass the string we want to read.

tts.speak(text to read,TextToSpeech.QUEUE_FLUSH, null, null);

But before tts.speak, it is important to check for the audio focus change request. It is important because only one audio source can have focus at a time. You can check it using below code.

private AudioManager.OnAudioFocusChangeListener afChangeListener =
           new AudioManager.OnAudioFocusChangeListener() {
                 public void onAudioFocusChange(int focusChange) {
                                                        //check for focus
                                                   }
                                           };

OnAudioFocusChangeListener is called when audio focus of the system is changed and according to value of focusChange either we stop TTS or keep using it.

AudioManager audiofocus = (AudioManager)                                    getSystemService(Context.AUDIO_SERVICE);

audiofocus is instance of AudioManager class. We need it to call requestAudioFocus method of AudioManager class. requestAudioFocus method returns the status of request for audio focus change. This method requires three parameter  instance of AudioManager.OnAudioFocusChangeListener, stream type and duration hint. If request is granted only then we can we can use tts.speak .

int result = audiofocus.requestAudioFocus(afChangeListener,AudioManager.STREAM_MUSIC, AudioManager.AUDIOFOCUS_GAIN);

if (result == AudioManager.AUDIOFOCUS_REQUEST_GRANTED) {

tts.speak(text to read,TextToSpeech.QUEUE_FLUSH, null, null);

}

We were continuously facing issue ‘delay in speech response’ because voiceReply method implementation was wrong. We were initializing TextToSpeech on each call of voiceReply method and since onInit method runs on main thread causing delay in voice response. So I removed it and instead of initializing tts each time I used the tts instance already initialized when activity create.

 String spoken = reply;

textToSpeech.speak(spoken, TextToSpeech.QUEUE_FLUSHnull);

You can also control how the engine read text. Like we can modify pitch and speech rate.

tts.setPitch((float)pitch);

tts.setSpeechRate((float)speed);

Resource

Continue ReadingImplementing Text-to-Speech (TTS) in SUSI Android