Skip to content

stt: add speech-to-text capability with whisper.cpp

Gabriel Lafond-Thenaille requested to merge live.stt.7 into master

This is a new version of the Speech-To-Text implementation.

Differences from the last version:

  • Removed the core interface of STT.
  • Used an SPU decoder instead of an STT decoder and audio filters.

What's new:

  • A new SPU fourcc VLC_CODEC_STT.
  • A new SPU decoder used as a subdecoder to decode audio frames to SPU.
  • A core interface to load an STT model asynchronously because the model can take several seconds to load and initialize accelerators if needed.

How it's works:


  • Added functions in es_out to enable/disable STT.
  • Automatically creates an SPU ES when an audio track is selected.
  • When the new SPU track is selected, it starts a new thread to load the model. Once loaded, the track is selected. Simultaneously, it uses pts_delay to increase buffering to obtain the right amount of audio ready to be sent directly to the STT.
  • The audio track is sent to its STT sub-decoder for the audio frame just before sending it to the AOUT.
  • When the STT creates the new SPUs, it sends them to the SPU queue (decoder_QueueSPU).

Stream Output:

  • Created a new type of stream output, STT, to be used without transcoding.
  • As STT requires PCM, when a new audio track is added, it first loads the model, then creates an audio decoder and the STT SPU decoder, chaining them together.
  • Each new frame is first sent to the audio decoder and then to the SPU decoder.
  • As the SPU decoder returns SPUs asynchronously, it converts SPU to frames in VLC_CODEC_TEXT format before sending them to the next SOUT module in the chain.


  • For now, the user has to obtain the model they want to use themselves.
  • For macOS, iOS, and tvOS, there is a problem with the compilation of Whisper and some frameworks until I find a fix.
  • The WIP commit is because Whisper requires features that require a minimum of macOS 13.0. I don't know if there is a better solution than what I've done.
Edited by Gabriel Lafond-Thenaille

Merge request reports