Draft: stt: add speech-to-text capability with whisper (!5155) · Merge requests · VideoLAN / VLC

This is a new version of the Speech-To-Text implementation.

Differences from the last version:

What's new:

Store the stt context inside input resource to keep it loaded between media.
Added to the stt module a Downloader if gcrypt is present to download the model if it's missing (currently it's downloaded from huggingface.co)

How it works:

Live:

Added functions in es_out to enable/disable STT.
Automatically creates an SPU ES when an audio track is selected.
When the new SPU track is selected, it starts a new thread to load the model and store it in input resource. Once loaded, the track is selected. Simultaneously, it uses pts_delay to increase buffering to obtain the right amount of audio ready to be sent directly to the STT.
The audio track is sent to its STT sub-decoder for the audio frame just before sending it to the AOUT.
When the STT creates the new SPUs, it sends them to the SPU queue (decoder_QueueSPU).

Info:

For macOS, iOS, and tvOS, there is a problem with the compilation of Whisper because some framework aren't found in the CI.
The WIP commit is because Whisper requires features that require a minimum of macOS 13.0. I don't know if there is a better solution than what I've done.

Acceleration:

Not done yet:

Draft: stt: add speech-to-text capability with whisper