Skip to content
Snippets Groups Projects

WIP: RFC: Avoid TLS to implement mutexes

1 unresolved thread

This MR is definitely not meant to be merged as is, but rather to start a discussion on our mutex implementations.

Also, full disclosure, I don't claim to fully understand how TLS is implemented in various architectures, but most specifically ELF & glibc as is the case for the rest of this description.

A bit of context: while profiling the media library I realized that we spend a lot of time in vlc_mutex_held. As far as I can see, most of the time is spent obtaining the thread_self address, which is declared as static _Thread_local char thread_self[1];

From there, it seems that we spend most of our time in __tls_get_addr_slow, which (again, as far as I understand) is used to rebuilt the dynamic thread vector each time the generation changes.

What I don't understand is why the DTV changes that often, after the initial module cache is populated, we almost never invoke dlopen/dlclose, which I thought was responsible for a bump in the generation. However if that was the case, we'd see a performance stabilisation after some time, which we don't, as seen in this graph:

chart

The number of run in the X axis is simply the number of full analysis of my music samples folder with the medialibrary tool to analyse a specific folder from CLI. As you can guess, the "-patched" version is the one with the submitted patch applied.

Using pthread_self instead of a thread_local variable is definitely faster since all we care about is getting a thread identifier to compare with the mutex owner, we don't actually care about what's in there. pthread_self is, at least on x86/elf, implemented by loading the value in register fs, while a TLS access will cause a call to __tls_get_addr in dynamic builds. Not relying on TLS also mean we get not to care about a new shared library being loaded and causing a DTV generation bump.

Some profiling samples with/without this patch:

without with
before after

Not relying on TLS means that we go from ≃ 3.1% of IRs spent in vlc_mutex_held to ≃ 1.34%. This also drastically reduces the amount of occurences of __tls_get_addr_slow but I'm not sure if that's something we need to care about.

Now, this patch is wrong and kludgy, but I think there's a valuable improvement to be dug further upon. I'm not sure what the correct way would be though. I feel like it's a bad idea to rely on various platform specific thread functions in misc/threads.c as it's supposed to be the other way around. Relying on vlc_thread_id in a portable way is not an option AFAICS since some implementations might return -1 which would be less thread specific.

Opinion welcome, and if somebody knows more about this and feels like explaining, I'm definitely interested!

Merge request reports

Members who can merge are allowed to add commits.

Merge request pipeline #142516 failed

Merge request pipeline failed for f928ab02

Approval is optional
Merge blocked: 4 checks failed
Unresolved discussions must be resolved.
Pipeline must succeed.
Merge conflicts must be resolved.
Merge request must be rebased, because a fast-forward merge is not possible.

Merge details

  • The source branch is 16456 commits behind the target branch.
  • 1 commit will be added to master.
  • Source branch will be deleted.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
120 120 vlc_mutex_init_common(mtx, true);
121 121 }
122 122
123 #ifdef HAVE_PTHREAD_H
124 # define THREAD_SELF ((const void*)(uintptr_t)pthread_self())
  • This will not compile if pthread_t is not an integer or pointer type, and is not portable if pthread_t is not a pointer type.

    If it were that easy, I would have done it when I wrote the original code.

  • I tried to make it abundantly clear that I did not think this was a simple issue, nor that the current code is incorrect/incomplete/wrong. This is definitely not the message here.

    To be fair, I'd be happy with the current code if __tls_get_addr_slow wasn't invoked that often, which I don't understand why

  • AFAICT, that is either a bug in glibc or the intended behaviour in the given circumstances. But I only that function getting called from the Qt5 library code, not VLC (in production builds).

    Edited by Rémi Denis-Courmont
  • Almost all calls are indeed via vlc_mutex_assert() which is an assert(). The only production build call remaining is after an unlikely() test in vlc_mutex_trylock().

    Now if the profiling shows this is actually called within vlc_mutex_trylock(), it means the unlikely is not so unlikely.

    BTW, both vlc_mutex_held() and vlc_mutex_assert() are only called from the core. Maybe vlc_mutex_held() doesn't need to be public.

    Edited by Steve Lhomme
  • There is pthread_equal to compare pthread_t values in a portable way.

  • It doesn't help here because pthread_t has no error value for when the mutex is not held. Also _Atomic pthread_t is not going to work on some platforms.

  • In the previous implementation, asserting a lock was only possible inside the core. So modules have not had much time, so to speak, to take vlc_mutex_held() into use.

    But I don't think we should prevent them from doing so going forward. It could most likely be useful in some plugins.

  • Yeah, same thing with vlc_thread_id() it's not used by modules, but it's useful sometimes for debugging.

    BTW, even in release builds THREAD_SELF is still used to fill mtx->owner. So you will likely see some performance hit to access the TLS as well.

  • I know right. That's necessary for vlc_mutex_held() to work. We have to handle the case that a module has debugging enabled even if the core does not (especially external modules).

  • Please register or sign in to reply
    • Resolved by Rémi Denis-Courmont

      The owner field is only read if the mutex is recursive or run-time debug is enabled. We could skip writing it in the normal case.

      But the flip side is that we cannot use vlc_thread_id() unless we require a functional implementation from all platforms or we eliminate vlc_mutex_init_recursive().

      The earlier seems vain anyway, as the Linux implementation of vlc_thread_id() uses the TLS internally. The alternative of making a system call every single time would be even slower.

      Edited by Rémi Denis-Courmont
  • Steve Lhomme mentioned in merge request !767 (merged)

    mentioned in merge request !767 (merged)

  • Steve Lhomme mentioned in merge request !768 (merged)

    mentioned in merge request !768 (merged)

  • If there really is a performance problem with vlc_thread_id()on GNU/Linux, it should probably be raised to glibc at least for comments at least to clarify.

    If pthread_self() is really faster than _Thread_local, then a likely work-around consists of using pthread_key_t instead in the Linux implementation of vlc_thread_id()

  • Steve Lhomme changed milestone to %4.0

    changed milestone to %4.0

  • Please register or sign in to reply
    Loading