Draft: stt: add speech-to-text capability with whisper
This is a new version of the Speech-To-Text implementation.
Supersedes !4705 (closed)
Differences from the last version:
- Removed the stream_output stt support.
What's new:
- Store the stt context inside input resource to keep it loaded between media.
- Added to the stt module a Downloader if gcrypt is present to download the model if it's missing (currently it's downloaded from huggingface.co)
How it works:
Live:
- Added functions in
es_out
to enable/disable STT. - Automatically creates an SPU ES when an audio track is selected.
- When the new SPU track is selected, it starts a new thread to load the model and store it in input resource. Once loaded, the track is selected. Simultaneously, it uses
pts_delay
to increase buffering to obtain the right amount of audio ready to be sent directly to the STT. - The audio track is sent to its STT sub-decoder for the audio frame just before sending it to the AOUT.
- When the STT creates the new SPUs, it sends them to the SPU queue (
decoder_QueueSPU
).
Info:
- For macOS, iOS, and tvOS, there is a problem with the compilation of Whisper because some framework aren't found in the CI.
- The WIP commit is because Whisper requires features that require a minimum of macOS 13.0. I don't know if there is a better solution than what I've done.
Acceleration:
- On apple silicon use of CoreML and metal
Not done yet:
- Create a SPU decoder that take audio in input.
- Support of Openvino
- Make whisper module run in an other process
Edited by Gabriel Lafond-Thenaille