Skip to content

Draft: stt: add speech-to-text capability with whisper

Gabriel Lafond-Thenaille requested to merge gabriel_lt/vlc:live.stt.10 into master

This is a new version of the Speech-To-Text implementation.

Supersedes !4705 (closed)

Differences from the last version:

  • Removed the stream_output stt support.

What's new:

  • Store the stt context inside input resource to keep it loaded between media.
  • Added to the stt module a Downloader if gcrypt is present to download the model if it's missing (currently it's downloaded from huggingface.co)

How it works:

Live:

  • Added functions in es_out to enable/disable STT.
  • Automatically creates an SPU ES when an audio track is selected.
  • When the new SPU track is selected, it starts a new thread to load the model and store it in input resource. Once loaded, the track is selected. Simultaneously, it uses pts_delay to increase buffering to obtain the right amount of audio ready to be sent directly to the STT.
  • The audio track is sent to its STT sub-decoder for the audio frame just before sending it to the AOUT.
  • When the STT creates the new SPUs, it sends them to the SPU queue (decoder_QueueSPU).

Info:

  • For macOS, iOS, and tvOS, there is a problem with the compilation of Whisper because some framework aren't found in the CI.
  • The WIP commit is because Whisper requires features that require a minimum of macOS 13.0. I don't know if there is a better solution than what I've done.

Acceleration:

  • On apple silicon use of CoreML and metal

Not done yet:

  • Create a SPU decoder that take audio in input.
  • Support of Openvino
  • Make whisper module run in an other process
Edited by Gabriel Lafond-Thenaille

Merge request reports