Skip to content
Snippets Groups Projects

Draft: stt: add speech-to-text capability with whisper

Open Gabriel Lafond-Thenaille requested to merge gabriel_lt/vlc:live.stt.10 into master
8 unresolved threads

This is a new version of the Speech-To-Text implementation.

Supersedes !4705 (closed)

Differences from the last version:

  • Removed the stream_output stt support.

What's new:

  • Store the stt context inside input resource to keep it loaded between media.
  • Added to the stt module a Downloader if gcrypt is present to download the model if it's missing (currently it's downloaded from huggingface.co)

How it works:

Live:

  • Added functions in es_out to enable/disable STT.
  • Automatically creates an SPU ES when an audio track is selected.
  • When the new SPU track is selected, it starts a new thread to load the model and store it in input resource. Once loaded, the track is selected. Simultaneously, it uses pts_delay to increase buffering to obtain the right amount of audio ready to be sent directly to the STT.
  • The audio track is sent to its STT sub-decoder for the audio frame just before sending it to the AOUT.
  • When the STT creates the new SPUs, it sends them to the SPU queue (decoder_QueueSPU).

Info:

  • For macOS, iOS, and tvOS, there is a problem with the compilation of Whisper because some framework aren't found in the CI.
  • The WIP commit is because Whisper requires features that require a minimum of macOS 13.0. I don't know if there is a better solution than what I've done.

Acceleration:

  • On apple silicon use of CoreML and metal

Not done yet:

  • Create a SPU decoder that take audio in input.
  • Support of Openvino
  • Make whisper module run in an other process
Edited by Gabriel Lafond-Thenaille

Merge request reports

Members who can merge are allowed to add commits.

Merge request pipeline #453473 failed

Merge request pipeline failed for f66c05cb

Test coverage 18.36% (-0.28%) from 1 job
Approval is optional
Ready to merge by members who can write to the target branch.

Merge details

  • The source branch is 5466 commits behind the target branch.
  • 19 commits and 1 merge commit will be added to .
  • Source branch will not be deleted.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
1220 1221 }
1221 1222 }
1222 1223
1224 static bool SubDecoderIsStt(vlc_input_decoder_t *subdec)
1225 {
1226 return subdec->dec.fmt_in->i_cat == SPU_ES &&
1227 subdec->dec.fmt_in->i_extra == sizeof(vlc_stt_extra_t) &&
1228 subdec->dec.fmt_in->i_codec == VLC_CODEC_STT;
1229 }
  • Denis Charmet
  • Denis Charmet
  • Denis Charmet
  • Denis Charmet
  • Denis Charmet
    Denis Charmet @typx started a thread on commit 5bfc374d
  • 54 int model_index;
    55 atomic_bool stop;
    56 vlc_thread_t thread;
    57 } vlc_stt_loader_sys_t;
    58
    59 /* Describe the models */
    60 typedef struct {
    61 const char *nickname; /* used to select the model */
    62 const char *name; /* real name of the model */
    63 const char *checksum; /* checksum of the model */
    64 const char *url; /* url to download the model */
    65 } vlc_stt_whisper_model_t;
    66
    67 #define STT_WHISPER_NMODELS 9
    68 /* List of all available models */
    69 static const vlc_stt_whisper_model_t whisper_models[STT_WHISPER_NMODELS] = {
    • This should not be hardcoded in the module. It's pointing toward an uncontrolled website and with a fixed hash. The day they change it everything is broken and we're not going to release a new VLC everytime the model changes even if we maintain it.

      Also it shouldn't be done in the decoder module. It probably belongs more in a LUA extension.

    • Please register or sign in to reply
  • Denis Charmet
  • Denis Charmet
  • Something feels wrong in the es_out part (waiting/notwaiting/stopwaiting) but I can't point it out right now and lack bandwidth to properly dive into it.

  • Gabriel Lafond-Thenaille changed the description

    changed the description

  • added 13 commits

    Compare with previous version

  • Appart from the videolan.org download failures, some macOS/iOS fail with this error, but not all of them:

    Accelerate framework not found

  • include/vlc_stt.h 0 → 100644
    30
    31 struct vlc_stt_ctx {
    32 /**
    33 * The context loaded and used by the module.
    34 */
    35 void *data;
    36
    37 /**
    38 * The module name
    39 */
    40 const char *name;
    41 };
    42
    43 struct vlc_stt_ctx_callbacks {
    44 /**
    45 * Called when the context is loaded or when an error occur in the loader
  • 185 if (cache_dir == NULL) {
    186 ret = -ENOMEM;
    187 goto ret_ctxgetmodelpath;
    188 }
    189 if(asprintf(&model_dir, "%s" DIR_SEP "models", cache_dir) < 0) {
    190 ret = -ENOMEM;
    191 goto ret_ctxgetmodelpath;
    192 }
    193
    194 /* create the models cache dir if missing */
    195 struct stat st;
    196 if (vlc_stat(model_dir, &st) == -1) {
    197 msg_Info(loader, "creating models folder");
    198 if (vlc_mkdir_parent(model_dir, 0755) != 0) {
    199 ret = VLC_EGENERIC;
    200 goto ret_ctxgetmodelpath;
  • 200 goto ret_ctxgetmodelpath;
    201 }
    202 }
    203
    204 /* get the path of the model */
    205 char *model_path = NULL;
    206 ret = asprintf(&model_path, "%s" DIR_SEP "%s", model_dir,
    207 whisper_models[i].name);
    208 if (ret > 0) {
    209 sys->model_index = i;
    210 sys->model_path = model_path;
    211 sys->model_dir = model_dir;
    212 ret = VLC_SUCCESS;
    213 } else {
    214 free(model_dir);
    215 ret = VLC_EGENERIC;
  • 305 vlc_hex_encode_binary(hash, 32, hash_hex);
    306 msg_Info(loader, "hash:\n1: %s\n2: %s", hash_hex,
    307 whisper_models[index].checksum);
    308
    309 vlc_dialog_release(loader, dialog);
    310 gcry_md_close(gctx);
    311 if (strcmp(hash_hex, whisper_models[index].checksum)) {
    312 msg_Info(loader, "Error wrong hash!");
    313 return VLC_EGENERIC;
    314 }
    315 return VLC_SUCCESS;
    316 #else
    317 VLC_UNUSED(stream);
    318 VLC_UNUSED(fd);
    319 VLC_UNUSED(index);
    320 return VLC_EGENERIC;
  • 630 .use_gpu = true,
    631 };
    632 ctx.data = whisper_init_from_file_with_params(sys->model_path, ctxparams);
    633 if (ctx.data == NULL) {
    634 vlc_dialog_release(loader, dialog);
    635 goto vlc_stt_whisper_CtxThread_error;
    636 }
    637
    638 vlc_dialog_release(loader, dialog);
    639
    640 /* Print whisper context information. */
    641 msg_Info(loader, "%s", info);
    642
    643 vlc_stt_whisper_CtxThread_error:
    644 if (ctx.data == NULL) {
    645 msg_Err(loader, "Whisper context isn't load properly speech-to-text desactivated.");
  • Denis Charmet mentioned in merge request !6024

    mentioned in merge request !6024

  • Please register or sign in to reply
    Loading