WIP: RFC: Avoid TLS to implement mutexes
This MR is definitely not meant to be merged as is, but rather to start a discussion on our mutex implementations.
Also, full disclosure, I don't claim to fully understand how TLS is implemented in various architectures, but most specifically ELF & glibc as is the case for the rest of this description.
A bit of context: while profiling the media library I realized that we spend a lot of time in vlc_mutex_held
. As far as I can see, most of the time is spent obtaining the thread_self
address, which is declared as static _Thread_local char thread_self[1];
From there, it seems that we spend most of our time in __tls_get_addr_slow
, which (again, as far as I understand) is used to rebuilt the dynamic thread vector each time the generation changes.
What I don't understand is why the DTV changes that often, after the initial module cache is populated, we almost never invoke dlopen/dlclose, which I thought was responsible for a bump in the generation. However if that was the case, we'd see a performance stabilisation after some time, which we don't, as seen in this graph:
The number of run in the X axis is simply the number of full analysis of my music samples folder with the medialibrary tool to analyse a specific folder from CLI. As you can guess, the "-patched" version is the one with the submitted patch applied.
Using pthread_self
instead of a thread_local
variable is definitely faster since all we care about is getting a thread identifier to compare with the mutex owner, we don't actually care about what's in there. pthread_self
is, at least on x86/elf, implemented by loading the value in register fs
, while a TLS access will cause a call to __tls_get_addr
in dynamic builds. Not relying on TLS also mean we get not to care about a new shared library being loaded and causing a DTV generation bump.
Some profiling samples with/without this patch:
without | with |
---|---|
![]() |
![]() |
Not relying on TLS means that we go from ≃ 3.1% of IRs spent in vlc_mutex_held
to ≃ 1.34%. This also drastically reduces the amount of occurences of __tls_get_addr_slow
but I'm not sure if that's something we need to care about.
Now, this patch is wrong and kludgy, but I think there's a valuable improvement to be dug further upon. I'm not sure what the correct way would be though. I feel like it's a bad idea to rely on various platform specific thread functions in misc/threads.c as it's supposed to be the other way around. Relying on vlc_thread_id
in a portable way is not an option AFAICS since some implementations might return -1
which would be less thread specific.
Opinion welcome, and if somebody knows more about this and feels like explaining, I'm definitely interested!
Merge request reports
Activity
added Component::Core label
120 120 vlc_mutex_init_common(mtx, true); 121 121 } 122 122 123 #ifdef HAVE_PTHREAD_H 124 # define THREAD_SELF ((const void*)(uintptr_t)pthread_self()) I tried to make it abundantly clear that I did not think this was a simple issue, nor that the current code is incorrect/incomplete/wrong. This is definitely not the message here.
To be fair, I'd be happy with the current code if
__tls_get_addr_slow
wasn't invoked that often, which I don't understand whyAFAICT, that is either a bug in glibc or the intended behaviour in the given circumstances. But I only that function getting called from the Qt5 library code, not VLC (in production builds).
Edited by Rémi Denis-CourmontAlmost all calls are indeed via
vlc_mutex_assert()
which is anassert()
. The only production build call remaining is after anunlikely()
test invlc_mutex_trylock()
.Now if the profiling shows this is actually called within
vlc_mutex_trylock()
, it means the unlikely is not so unlikely.BTW, both
vlc_mutex_held()
andvlc_mutex_assert()
are only called from the core. Maybevlc_mutex_held()
doesn't need to be public.Edited by Steve LhommeThere is pthread_equal to compare
pthread_t
values in a portable way.
- Resolved by Rémi Denis-Courmont
The owner field is only read if the mutex is recursive or run-time debug is enabled. We could skip writing it in the normal case.
But the flip side is that we cannot use
vlc_thread_id()
unless we require a functional implementation from all platforms or we eliminatevlc_mutex_init_recursive()
.The earlier seems vain anyway, as the Linux implementation of
vlc_thread_id()
uses the TLS internally. The alternative of making a system call every single time would be even slower.Edited by Rémi Denis-Courmont
mentioned in merge request !767 (merged)
mentioned in merge request !768 (merged)
If there really is a performance problem with
vlc_thread_id()
on GNU/Linux, it should probably be raised to glibc at least for comments at least to clarify.If
pthread_self()
is really faster than_Thread_local
, then a likely work-around consists of usingpthread_key_t
instead in the Linux implementation ofvlc_thread_id()
changed milestone to %4.0
added MRStatus::NotCompliant label