Hotword Recognition in SUSI iOS

Hot word recognition is a feature by which a specific action can be performed each time a specific word is spoken. There is a service called Snowboy which helps us achieve this for various clients (for ex: iOS, Android, Raspberry pi, etc.). It is basically a DNN based hotword recognition toolkit.

In this blog, we will learn how to integrate the snowboy hotword detection wrapper in the SUSI iOS client. This service can be used in any open source project but for using it commercially, a commercial license needs to be obtained.

Following are the files that need to be added to the project which are provided by the service itself: snowboy-detect.h libsnowboy-detect.a and a trained model file which can be created using their online service: snowboy.kitt.ai. For the sake of this blog, we will be using the hotword “Susi”, the model file can be found here.

The way how snowboy works is that speech is recorded for a few seconds and this data is detected with an already trained model by a specific hotword, now if snowboy returns a 1 means word has been successfully detected else wasn’t.

We start with creation of a wrapper class in Objective-C which can be found wrapper and the bridging header in case this needs to be added to a Swift project. The wrapper contains methods for setting sensitivity, audio gain and running the detection using the buffer. It is a wrapper class built on top of the snowboy-detect.h header file.

Let’s initialize the service and run it. Below are the steps followed to enable hotword recognition and print out whether it successfully detected the hotword or not:

  • Create a ViewController class with extensions
    • AVAudioRecorderDelegate
    • AVAudioPlayerDelegate

since we will be recording speech.

  • Import AVFoundation
  • Create a basic layout containing a label which detects whether hotword detected or not and create corresponding `IBOutlet` in the ViewController and a button to trigger the start and stop of recognition.
  • Create the following variables:
    • let WAKE_WORD = “Susi” // hotword used
    • let RESOURCE = Bundle.main.path(forResource: “common”, ofType: “res”)
    • let MODEL = Bundle.main.path(forResource: “susi”, ofType: “umdl”) //path where the model file is stored
    • var wrapper: SnowboyWrapper! = nil // wrapper instance for running detection
    • var audioRecorder: AVAudioRecorder! // audio recorder instance
    • var audioPlayer: AVAudioPlayer!
    • var soundFileURL: URL! //stores the URL of the temp reording file
    • var timer: Timer! //timer to fire a function after an interval
    • var isStarted = false // variable to check if audio recorder already started
  • In `viewDidLoad` initialize the wrapper and set sensitivity and audio gain. Recognition best happens when sensitivity is set to `0.5` and audio gain is set to `1.0` according to the docs.
override func viewDidLoad() {
    super.viewDidLoad()
    wrapper = SnowboyWrapper(resources: RESOURCE, modelStr: MODEL)
    wrapper.setSensitivity("0.5")
    wrapper.setAudioGain(1.0)
}
  • Create an `IBAction` for the button to start recognition. This action will be used to start or stop the recording in which the action toggles based on the `isStarted` variable. When true, recording is stopped and the timer invalidated else a timer is started which calls the `startRecording` method with an interval of 4 seconds.
@IBAction func onClickBtn(_ sender: Any) {
  if (isStarted) {
    stopRecording()
    timer.invalidate()
    btn.setTitle("Start", for: .normal)
    isStarted = false
  } else {
    timer = Timer.scheduledTimer(timeInterval: 4, target: self, 
    selector: #selector(startRecording), userInfo: nil, repeats: true)
    timer.fire()
    btn.setTitle("Stop", for: .normal)
    isStarted = true
  }
}
  • Next, we add the start and stop recording methods.
    • First, a temp file is created which stores the recorded audio output
    • After which, necessary record configurations are made such as setting the sampling rate.
    • The recording is then started and the output stored in the temp file.
func startRecording() {
  do {
    let fileMgr = FileManager.default
    let dirPaths = fileMgr.urls(for: .documentDirectory, in: .userDomainMask)
    soundFileURL = dirPaths[0].appendingPathComponent("temp.wav")
    let recordSettings = [AVEncoderAudioQualityKey: 
    AVAudioQuality.high.rawValue,
    AVEncoderBitRateKey: 128000,
    AVNumberOfChannelsKey: 1,
    AVSampleRateKey: 16000.0] as [String : Any]
    let audioSession = AVAudioSession.sharedInstance()
    try audioSession.setCategory(AVAudioSessionCategoryRecord)
    try audioRecorder = AVAudioRecorder(url: soundFileURL,
settings: recordSettings as [String : AnyObject])
    audioRecorder.delegate = self
    audioRecorder.prepareToRecord()
    audioRecorder.record(forDuration: 2.0)
    instructionLabel.text = "Speak wake word: \(WAKE_WORD)"print("Started recording...")
  } catch let error {
    print("Audio session error: \(error.localizedDescription)")
  }
}
  • The stop recording method, stops the audioRecorder instance and updates the instruction label to show the same.
func stopRecording() {
  if (audioRecorder != nil && audioRecorder.isRecording) {
    audioRecorder.stop()
  }
  instructionLabel.text = "Stop"
  print("Stopped recording...")
}

The final recognition is done in the `audioRecorderDidFinishRecording` delegate method which runs the snowboy detection function which processes the audio recording in the temp file by creating a buffer and storing the audio in it and giving the wrapper this buffer as input which processes the buffer and returning a `1` is hotword was successfully detected.

func runSnowboy() {
  let file = try! AVAudioFile(forReading: soundFileURL)
  let format = AVAudioFormat(commonFormat: .pcmFormatFloat32, 
  sampleRate: 16000.0, channels: 1, interleaved: false)
  let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(file.length))
  try! file.read(into: buffer)
  let array = Array(UnsafeBufferPointer(start: 
  buffer.floatChannelData![0], count:Int(buffer.frameLength)))
  // print output
  let result = wrapper.runDetection(array, length: Int32(buffer.frameLength))
  print("Result: \(result)")
}

To test this out, click the start button and speak different words and you will notice that once the Hot Word is spoken, log with `result: 1` is printed out.

The snowboy hotword recognition also offers to train the personalized model with the help of Rest Apis for which the docs can be found here. The complete project implementation can be found here.

Sources: