STT | blog.fossasia.org

Full-Screen Speech Input Implementation in SUSI Android App

Post author:amitiwary999
Post published:August 28, 2017
Post category:FOSSASIA GSoC
Post comments:0 Comments

SUSI Android has some very good features and one of them is, it can take input in speech format from user i.e if the user says anything then it can detect it and convert it to text. This feature is implemented in SUSI Android app using Android’s built-in speech-to-text functionality. You can implement Android’s built-in speech-to-text functionality using either only RecognizerIntent class or SpeechRecognizerIntent class, RecognitionListner interface and RecognizerIntent class. Using former method has some disadvantages:

During speech input, it shows a dialog box (as shown here) and it breaks the connection between user and app.
We can’t show partial result i.e text to the user but using the later method we can show it.

We used SpeechRecognizerIntent class, RecognitionListner interface and RecognizerIntent class to implement Android’s built-in speech-to-text functionality in SUSI Android and you know the reason for that. In this blog post, I will show you how I implemented this feature in SUSI Android with new UI.

Layout design

You can give speech input to SUSI Android either by clicking mic button

or using ‘Hi SUSI’ hotword. When you click on mic button or use ‘Hi SUSI’ hotword, you can see a screen where you will give speech input.

Two important part of this layout are:

<TextView

android:id=“@+id/txtchat”

android:layout_width=“wrap_content”

android:layout_height=“wrap_content”

…

TextView: It used to show the partial result of speech input i.e it will show converted text (partial) of your speech.

<org.fossasia.susi.ai.speechinputanimation.SpeechProgressView

android:id=“@+id/speechprogress”

android:layout_width=“match_parent”

android:layout_height=“50dp”

android:layout_margin=“8dp”

android:layout_gravity=“center”/>

SpeechProgressView: It is a custom view which use to show the animation when the user gives speech input. When the user starts speaking, the animation starts. This custom view contains five bars and these five bars animate according to user input.

Full-screen speech input implementation

When the user clicks on mic button or uses ‘Hi SUSI’ hotword, a screen comes where the user can give speech input. As already mentioned I used SpeechRecognizerIntent class, RecognitionListner interface and RecognizerIntent class to implement speech-to-text functionality in SUSI Android. RecognizerIntent class starts an intent and asks for speech input

val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH)

intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL,

RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)

intent.putExtra(RecognizerIntent.EXTRA_CALLING_PACKAGE, “com.domain.app”)

intent.putExtra(RecognizerIntent.EXTRA_PARTIAL_RESULTS, true)

and send it through the speech recognizer. It does it through ACTION_RECOGNIZE_SPEECH. SpeechRecognizer class provides access to the speech recognition service. This service allows access to the speech recognizer and recognition related event occurs RecognitionListner receive notification from SpeechRecognizer class.

recognizer = SpeechRecognizer

.createSpeechRecognizer(activity.applicationContext)

val listener = object : RecognitionListener {

//implement all override methods

}

When the user starts speaking, the height of bars changes according to change in sound level. When sound level changes, onRmsChanged method get called where we are calling onRmsChanged method of SpeechProgressView class which is responsible for animating bars according to change in sound level.

override fun onRmsChanged(rmsdB: Float) {

if (speechprogress != null)

speechprogress.onRmsChanged(rmsdB)

}

When user finished speaking onEndOfSpeech method get called where we call onEndOfSpeech method of SpeechProgressView class which is responsible for rotating animation. Rotation is used to show that SUSI Android has finished listening and now it is processing your input.

override fun onEndOfSpeech() {

if (speechprogress != null)

speechprogress.onEndOfSpeech()

}

In case of any error, onError method get called and in case of successful speech input, onResults method get called. In both cases, we reset bars to their initial position and show chat activity user. The user can again give speech input either by clicking on mic or using ‘Hi SUSI’ hotword.

override fun onResults(results: Bundle) {

if (speechprogress != null)

speechprogress.onResultOrOnError()

activity.supportFragmentManager.popBackStackImmediate()

}

Reference

Main site link of RecognitionListener: https://developer.android.com/reference/android/speech/RecognitionListener.html
Main site link of SpeechRecognizer: https://developer.android.com/reference/android/speech/SpeechRecognizer.html
Main site link of RecognizerIntent: https://developer.android.com/reference/android/speech/RecognizerIntent.html
Tutorial on animation by Google: https://developer.android.com/training/animation/index.html

Implementing Speech to Text for Chrome in SUSI Web Chat

Post author:rishiraj824
Post published:July 23, 2017
Post category:FOSSASIA GSoC Open Event SUSI.AI
Post comments:0 Comments

SUSI Web Chat now replies to voice inputs. To achieve this, I made use of the Web Speech API. The voice input saves one from the pain of typing and it’s a much needed feature for the Web Chat and to maintain the similarity with the other SUSI Android and SUSI iOS clients.

To test the feature out in SUSI Web Chat, click on the microphone icon beside the text area on chat.susi.ai.

Say the message once the dialog appears, and you will see the message being sent to the Chat List rendered in text.

Let’s achieve the same result following the steps below.

First, initialize the class Voice Recognition with defaults for the Speech Recognition, for that we create a file VoiceRecognition.js

We first initialize the Speech Recognition API with the window object.
We warn the User with a console message if there is no Speech Recognition API available.
If it’s available call the recognition function using the following line

this.recognition = this.createRecognition(SpeechRecognition)

// Initialise the Speech recognition API
const SpeechRecognition = window.SpeechRecognition
      || window.webkitSpeechRecognition
      || window.mozSpeechRecognition
      || window.msSpeechRecognition
      || window.oSpeechRecognition
    // Warn the user if not available otherwise call the createRecognition function
    if (SpeechRecognition != null) {
      this.recognition = this.createRecognition(SpeechRecognition)
    } else {
      console.warn('The current browser does not support the SpeechRecognition API.');
    }
  }

Then we write the createRecognition function

We set our defaults first as “continuous – true, interimResults – false, and language – ‘en-US’ ”
We pass these options to the recognition object that we created in the above step and finally return the recognition object.

createRecognition = (SpeechRecognition) => {
    const defaults = {
      continuous: true,
      interimResults: false,
      lang: 'en-US'
    }

    const options = Object.assign({}, defaults, this.props)

    let recognition = new SpeechRecognition()

    recognition.continuous = options.continuous
    recognition.interimResults = options.interimResults
    recognition.lang = options.lang

    return recognition
  }

Initialize all the helper functions to be passed as props.

start – This method starts the recognition and invokes the Mic of the browser. It also checks if the browser has the access to the user’s Mic.
stop – Stop method closes the Mic and returns the audio captured so far.
abort – Abort method stops the SpeechRecognition service.
onspeechend – This method is called if there is any inactivity and there is no voice input. Hence, stops the recognition service.
componentWillReceiveProps – This method waits for the stop method and calls it when it has received the stop object.
componentWIllUnmount – This method is invoked just before the component is about to unmount and therefore its function is to abort the Speech Recognition Service
render – We return null as there is nothing to return in this component and all the converted text of the captured Speech will be sent to the parent element.

start = () => {
    this.recognition.start()
  }

  stop = () => {
    this.recognition.stop()
  }

  abort = () => {
    this.recognition.abort()
  }
  onspeechend = () => {
    console.log('no sound detected');
    this.recognition.stop()
  }

  componentWillReceiveProps ({ stop }) {
    if (stop) {
      this.stop()
    }
  }

  componentWillUnmount () {
    this.abort()
  }

  render () {
    return null
  }

Add event listeners to start and stop functions inside componentDidMount() to ensure every action that we want to perform from the parent element is after the component has successfully mounted itself.

start – The start method is set with an action start so that we can pass the required action name to the VoiceRecognition component that we created
end – The end method similarly is set with an action end
After setting up the actions we finally call the bindResult function with the result that we received.

componentDidMount () {
    const events = [
      { name: 'start', action: this.props.onStart },
      { name: 'end', action: this.props.onEnd },
      { name: 'onspeechend', action: this.props.onspeechend }
    ]

    events.forEach(event => {
      this.recognition.addEventListener(event.name, event.action)
    })

    this.recognition.addEventListener('result', this.bindResult)

    this.start()
  }

Bind the result and send it as the props to the parent element.
Combine all interim results of the recognition and send it to the onResult function as finalTranscript
The function bindResult – The function bindResult does all the binding of the interim results that we received and output a final result as finalTranscript.
Lastly, we add the prop validations to ensure the correct props are being passed to our VoiceRecognition component.

// bindResult function
 bindResult = (event) => {
    let interimTranscript = ''
    let finalTranscript = ''
   // Bind all the results to finalTranscript
    for (let i = event.resultIndex; i < event.results.length; ++i) {
      if (event.results[i].isFinal) {
        finalTranscript += event.results[i][0].transcript
      } else {
        interimTranscript += event.results[i][0].transcript
      }
    }

    this.props.onResult({ interimTranscript, finalTranscript })
  }
// Add Prop Validations
VoiceRecognition.propTypes = {
  onStart: PropTypes.func,
  onEnd : PropTypes.func,
  onResult: PropTypes.func,
  Onspeechend: PropTypes.func,
  continuous: PropTypes.bool,
  lang: PropTypes.string,
  stop: PropTypes.bool
};
// Finally export the VoiceRecognition Component
export default VoiceRecognition

Lastly, call the VoiceRecogntion component and pass the props from the MessageComposer Section to it in the following way.

Initialize the default state in the constructor inside this.state

this.state = {
      text: '',
      start: false, // Starting the VoiceRecognition
      stop: false, // Stop the VoiceRecognition
      open: false, // Maintain the modal state
      result:'' // Maintain the result state
    };

onStart function to call the VoiceRecognition component only when the Mic Button is pressed.
onEnd to end the Speech Recognition service.
onResult to send the message through the Actions.createMessage() function

onResult = ({interimTranscript,finalTranscript }) => {
    let result = interimTranscript;
    let voiceResponse = false;
    this.setState({result:result});
    if(finalTranscript) {
      result = finalTranscript;
      this.setState({
      start: false,
      result:result,
      stop: false,
      open:false,
      animate:false
      });
      if(this.props.speechOutputAlways || this.props.speechOutput){
        voiceResponse = true;
      }
      Actions.createMessage(result, this.props.threadID, voiceResponse);
      setTimeout(()=>this.setState({result: ''}),400);
      this.Button = <Mic />
    }
  }

Fire the component based on the value of start variable and pass the requisite props as given below in the code.

// Only when the start is ‘true’ call the VoiceRecognition component

    {this.state.start && (
          <VoiceRecognition
            onStart={this.onStart}
            onEnd={this.onEnd}
            onResult={this.onResult}
            continuous={true}
            lang="en-US"
            stop={this.state.stop}
          />
        )}

Update the text in the “Speak Now” Dialog to show the user the Speech to Text conversion

Update the text in the Modal when it is converted from Speech to Text, i.e. when we set the state of the result variable.

{this.state.result !=='' ? this.state.result :
          'Speak Now...'}

To get access to the full code, go to the repository https://github.com/fossasia/chat.susi.ai

Resources

Mozilla Web Speech Recognition API
Web Speech API Concepts
For the component of the Dialog – www.material-ui.com/#/components/dialog
Tutorial from Google – https://developers.google.com/web/updates/2013/01/Voice-Driven-Web-Apps-Introduction-to-the-Web-Speech-API
HTML5/JS Speech Recognition – https://shapeshed.com/html5-speech-recognition-api/
Follow up tutorial – https://www.labnol.org/software/add-speech-recognition-to-website/19989/