Implementing Speech To Text in SUSI iOS

SUSI being an intelligent bot has the capabilities by which the user can provide input in a hands-free mode by talking and not requiring to even lift the phone for typing. The speech to text feature is available in SUSI iOS with the help of the Speech framework which was released alongside iOS 10 which enables continuous speech detection and transcription. The detection is really fast and supports around 50 languages and dialects from Arabic to Vietnamese. The speech recognition API does its heavy tasks of detection on Apple’s servers which requires an internet connection. The same API is also not always available on all newer devices and also provides the ability to check if a specific language is supported at a particular time.

How to use the Speech to Text feature?

  • Go to the view controller and import the speech framework
  • Now, because the speech is transmitted over the internet and uses Apple’s servers for computation, we need to ask the user for permissions to use the microphone and speech recognition feature. Add the following two keys to the Info.plist file which displays alerts asking user permission to use speech recognition and for accessing the microphone. Add a specific sentence for each key string which will be displayed to the user in the alerts.
    1. NSSpeechRecognitionUsageDescription
    2. NSMicrophoneUsageDescription

The prompts appear automatically when the functionality is used in the app. Since we already have the Hot word recognition enabled, the microphone alert would show up automatically after login and the speech one shows after the microphone button is tapped.

3) To request the user for authorization for Speech Recognition, we use the method SFSpeechRecognizer.requestAuthorization.

func configureSpeechRecognizer() {
        speechRecognizer?.delegate = self

        SFSpeechRecognizer.requestAuthorization { (authStatus) in
            var isEnabled = false

            switch authStatus {
            case .authorized:
                print("Autorized speech")
                isEnabled = true
            case .denied:
                print("Denied speech")
                isEnabled = false
            case .restricted:
                print("speech restricted")
                isEnabled = false
            case .notDetermined:
                print("not determined")
                isEnabled = false
            }

            OperationQueue.main.addOperation {

                // handle button enable/disable

                self.sendButton.tag = isEnabled ? 0 : 1

                self.addTargetSendButton()
            }
        }
    }

4)   Now, we create instances of the AVAudioEngine, SFSpeechRecognizer, SFSpeechAudioBufferRecognitionRequest,SFSpeechRecognitionTask

let speechRecognizer = SFSpeechRecognizer(locale: Locale.init(identifier: "en-US"))
var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
var recognitionTask: SFSpeechRecognitionTask?
let audioEngine = AVAudioEngine()

5)  Create a method called `readAndRecognizeSpeech`. Here, we do all the recognition related stuff. We first check if the recognitionTask is running or not and if it does we cancel the task.

if recognitionTask != nil {
  recognitionTask?.cancel()
  recognitionTask = nil
}

6)  Now, create an instance of AVAudioSession to prepare the audio recording where we set the category of the session as recording, the mode and activate it. Since these might throw an exception, they are added inside the do catch block.

let audioSession = AVAudioSession.sharedInstance()

do {

    try audioSession.setCategory(AVAudioSessionCategoryRecord)

    try audioSession.setMode(AVAudioSessionModeMeasurement)

    try audioSession.setActive(true, with: .notifyOthersOnDeactivation)

} catch {

    print("audioSession properties weren't set because of an error.")

}

7)  Instantiate the recognitionRequest.

recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

8) Check if the device has an audio input else throw an error.

guard let inputNode = audioEngine.inputNode else {

fatalError("Audio engine has no input node")

}

9)  Enable recognitionRequest to report partial results and start the recognitionTask.

recognitionRequest.shouldReportPartialResults = true

recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in

  var isFinal = false // to indicate if final result

  if result != nil {

    self.inputTextView.text = result?.bestTranscription.formattedString

    isFinal = (result?.isFinal)!

  }

  if error != nil || isFinal {

    self.audioEngine.stop()

    inputNode.removeTap(onBus: 0)

    self.recognitionRequest = nil

    self.recognitionTask = nil

  }
})

10) Next, we start with writing the method that performs the actual speech recognition. This will record and process the speech continuously.

  • First, we create a singleton for the incoming audio using .inputNode
  • .installTap configures the node and sets up the buffer size and the format
let recordingFormat = inputNode.outputFormat(forBus: 0)

inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, _) in

    self.recognitionRequest?.append(buffer)

}

11)  Next, we prepare and start the audio engine.

audioEngine.prepare()

do {

  try audioEngine.start()

} catch {

  print("audioEngine couldn't start because of an error.")

}

12)  Create a method that stops the Speech recognition.

func stopSTT() {

    print("audioEngine stopped")

    audioEngine.inputNode?.removeTap(onBus: 0)

    audioEngine.stop()

    recognitionRequest?.endAudio()

    indicatorView.removeFromSuperview()



    if inputTextView.text.isEmpty {

        self.sendButton.setImage(UIImage(named: ControllerConstants.mic), for: .normal)

    } else {

        self.sendButton.setImage(UIImage(named: ControllerConstants.send), for: .normal)

    }

        self.inputTextView.isUserInteractionEnabled = true
}

13)  Update the view when the speech recognition is running indicating the user its status. Add below code just below audio engine preparation.

// Listening indicator swift

self.indicatorView.frame = self.sendButton.frame

self.indicatorView.isUserInteractionEnabled = true

let gesture: UITapGestureRecognizer = UITapGestureRecognizer(target: self, action: #selector(startSTT))

gesture.numberOfTapsRequired = 1

self.indicatorView.addGestureRecognizer(gesture)
self.sendButton.setImage(UIImage(), for: .normal)

indicatorView.startAnimating()

self.sendButton.addSubview(indicatorView)

self.sendButton.addConstraintsWithFormat(format: "V:|[v0(24)]|", views: indicatorView)

self.sendButton.addConstraintsWithFormat(format: "H:|[v0(24)]|", views: indicatorView)

self.inputTextView.isUserInteractionEnabled = false

The screenshot of the implementation is below:

       

References