Draft: stt: add speech-to-text capability with whisper
8 unresolved threads
8 unresolved threads
This is a new version of the Speech-To-Text implementation.
Supersedes !4705 (closed)
Differences from the last version:
- Removed the stream_output stt support.
What's new:
- Store the stt context inside input resource to keep it loaded between media.
- Added to the stt module a Downloader if gcrypt is present to download the model if it's missing (currently it's downloaded from huggingface.co)
How it works:
Live:
- Added functions in
es_out
to enable/disable STT. - Automatically creates an SPU ES when an audio track is selected.
- When the new SPU track is selected, it starts a new thread to load the model and store it in input resource. Once loaded, the track is selected. Simultaneously, it uses
pts_delay
to increase buffering to obtain the right amount of audio ready to be sent directly to the STT. - The audio track is sent to its STT sub-decoder for the audio frame just before sending it to the AOUT.
- When the STT creates the new SPUs, it sends them to the SPU queue (
decoder_QueueSPU
).
Info:
- For macOS, iOS, and tvOS, there is a problem with the compilation of Whisper because some framework aren't found in the CI.
- The WIP commit is because Whisper requires features that require a minimum of macOS 13.0. I don't know if there is a better solution than what I've done.
Acceleration:
- On apple silicon use of CoreML and metal
Not done yet:
- Create a SPU decoder that take audio in input.
- Support of Openvino
- Make whisper module run in an other process
Edited by Gabriel Lafond-Thenaille
Merge request reports
Activity
Filter activity
1220 1221 } 1221 1222 } 1222 1223 1224 static bool SubDecoderIsStt(vlc_input_decoder_t *subdec) 1225 { 1226 return subdec->dec.fmt_in->i_cat == SPU_ES && 1227 subdec->dec.fmt_in->i_extra == sizeof(vlc_stt_extra_t) && 1228 subdec->dec.fmt_in->i_codec == VLC_CODEC_STT; 1229 } - Comment on lines +1224 to +1229
Remove the STT fourcc and use SPU decoder that use audio in and spu out (see here)
- Resolved by Gabriel Lafond-Thenaille
- Resolved by Gabriel Lafond-Thenaille
- Resolved by Gabriel Lafond-Thenaille
- Resolved by Gabriel Lafond-Thenaille
- modules/codec/stt_whisper.c 0 → 100644
54 int model_index; 55 atomic_bool stop; 56 vlc_thread_t thread; 57 } vlc_stt_loader_sys_t; 58 59 /* Describe the models */ 60 typedef struct { 61 const char *nickname; /* used to select the model */ 62 const char *name; /* real name of the model */ 63 const char *checksum; /* checksum of the model */ 64 const char *url; /* url to download the model */ 65 } vlc_stt_whisper_model_t; 66 67 #define STT_WHISPER_NMODELS 9 68 /* List of all available models */ 69 static const vlc_stt_whisper_model_t whisper_models[STT_WHISPER_NMODELS] = { This should not be hardcoded in the module. It's pointing toward an uncontrolled website and with a fixed hash. The day they change it everything is broken and we're not going to release a new VLC everytime the model changes even if we maintain it.
Also it shouldn't be done in the decoder module. It probably belongs more in a LUA extension.
- Resolved by Gabriel Lafond-Thenaille
- Resolved by Gabriel Lafond-Thenaille
added 13 commits
- 5bfc374d...c8fda192 - 3 earlier commits
- 9979390b - fourcc: add VLC_CODEC_STT as spu codec
- cdd9b204 - core: add stt-enable
- 3e59a306 - codec: add struct vlc_stt_extra
- 077a9904 - decoder: add STT support
- d8d9beb2 - es_out: Add STT support
- 29f5b312 - input: Add STT support
- 25c3a405 - player: Add STT support
- 7569a19f - WIP: build: Update minimal macosx version to compile whisper.cpp
- d903316c - whisper: add a contrib rule
- f66c05cb - stt: create whisper stt decoder
Toggle commit listSee !4705 (closed) and !4468 (closed)
- include/vlc_stt.h 0 → 100644
30 31 struct vlc_stt_ctx { 32 /** 33 * The context loaded and used by the module. 34 */ 35 void *data; 36 37 /** 38 * The module name 39 */ 40 const char *name; 41 }; 42 43 struct vlc_stt_ctx_callbacks { 44 /** 45 * Called when the context is loaded or when an error occur in the loader - modules/codec/stt_whisper.c 0 → 100644
185 if (cache_dir == NULL) { 186 ret = -ENOMEM; 187 goto ret_ctxgetmodelpath; 188 } 189 if(asprintf(&model_dir, "%s" DIR_SEP "models", cache_dir) < 0) { 190 ret = -ENOMEM; 191 goto ret_ctxgetmodelpath; 192 } 193 194 /* create the models cache dir if missing */ 195 struct stat st; 196 if (vlc_stat(model_dir, &st) == -1) { 197 msg_Info(loader, "creating models folder"); 198 if (vlc_mkdir_parent(model_dir, 0755) != 0) { 199 ret = VLC_EGENERIC; 200 goto ret_ctxgetmodelpath; - modules/codec/stt_whisper.c 0 → 100644
200 goto ret_ctxgetmodelpath; 201 } 202 } 203 204 /* get the path of the model */ 205 char *model_path = NULL; 206 ret = asprintf(&model_path, "%s" DIR_SEP "%s", model_dir, 207 whisper_models[i].name); 208 if (ret > 0) { 209 sys->model_index = i; 210 sys->model_path = model_path; 211 sys->model_dir = model_dir; 212 ret = VLC_SUCCESS; 213 } else { 214 free(model_dir); 215 ret = VLC_EGENERIC; - modules/codec/stt_whisper.c 0 → 100644
305 vlc_hex_encode_binary(hash, 32, hash_hex); 306 msg_Info(loader, "hash:\n1: %s\n2: %s", hash_hex, 307 whisper_models[index].checksum); 308 309 vlc_dialog_release(loader, dialog); 310 gcry_md_close(gctx); 311 if (strcmp(hash_hex, whisper_models[index].checksum)) { 312 msg_Info(loader, "Error wrong hash!"); 313 return VLC_EGENERIC; 314 } 315 return VLC_SUCCESS; 316 #else 317 VLC_UNUSED(stream); 318 VLC_UNUSED(fd); 319 VLC_UNUSED(index); 320 return VLC_EGENERIC; - modules/codec/stt_whisper.c 0 → 100644
630 .use_gpu = true, 631 }; 632 ctx.data = whisper_init_from_file_with_params(sys->model_path, ctxparams); 633 if (ctx.data == NULL) { 634 vlc_dialog_release(loader, dialog); 635 goto vlc_stt_whisper_CtxThread_error; 636 } 637 638 vlc_dialog_release(loader, dialog); 639 640 /* Print whisper context information. */ 641 msg_Info(loader, "%s", info); 642 643 vlc_stt_whisper_CtxThread_error: 644 if (ctx.data == NULL) { 645 msg_Err(loader, "Whisper context isn't load properly speech-to-text desactivated."); mentioned in merge request !6024