Skip to content

Draft: stt: add speech-to-text capability with whisper.cpp

Gabriel Lafond-Thenaille requested to merge gabriel_lt/vlc:whisper.cpp.6 into master

Add a Speech-To-Text API and module type to convert audio to text.

Create a new stt module named whisper that uses whisper.cpp to create subtitles from audio.

A problem is that whisper.cpp needs a model to work with, and they are huge (hundreds of megabytes for the smaller ones), and some models are fine-tuned for specific languages, so sometimes many models are needed.

Create a new sout module named stt that uses the core stt capability to perform speech-to-text. Since the whisper stt module is asynchronous and the sout is synchronous, the sout stt module waits for whisper to be ready to process again, so all controls, except for closing, are blocked during this time.

Add an srt (SubRip Text) mux.


Future Evolution:

To work with whisper, you need at least 10 seconds of audio. To add live stt, whisper needs to be at least 10 seconds ahead of the played audio . One possibility is to find a way of caching input, decode audio, and send it to whisper. Then, after some delay, get back the generated subtitles and use the cached input to play normally. The pros are that we cache input, so it's not as heavy on RAM, but the cons are that we have to decode audio twice.

Merge request reports