Skip to content

WIP: RFC: Avoid TLS to implement mutexes

This MR is definitely not meant to be merged as is, but rather to start a discussion on our mutex implementations.

Also, full disclosure, I don't claim to fully understand how TLS is implemented in various architectures, but most specifically ELF & glibc as is the case for the rest of this description.

A bit of context: while profiling the media library I realized that we spend a lot of time in vlc_mutex_held. As far as I can see, most of the time is spent obtaining the thread_self address, which is declared as static _Thread_local char thread_self[1];

From there, it seems that we spend most of our time in __tls_get_addr_slow, which (again, as far as I understand) is used to rebuilt the dynamic thread vector each time the generation changes.

What I don't understand is why the DTV changes that often, after the initial module cache is populated, we almost never invoke dlopen/dlclose, which I thought was responsible for a bump in the generation. However if that was the case, we'd see a performance stabilisation after some time, which we don't, as seen in this graph:

chart

The number of run in the X axis is simply the number of full analysis of my music samples folder with the medialibrary tool to analyse a specific folder from CLI. As you can guess, the "-patched" version is the one with the submitted patch applied.

Using pthread_self instead of a thread_local variable is definitely faster since all we care about is getting a thread identifier to compare with the mutex owner, we don't actually care about what's in there. pthread_self is, at least on x86/elf, implemented by loading the value in register fs, while a TLS access will cause a call to __tls_get_addr in dynamic builds. Not relying on TLS also mean we get not to care about a new shared library being loaded and causing a DTV generation bump.

Some profiling samples with/without this patch:

without with
before after

Not relying on TLS means that we go from ≃ 3.1% of IRs spent in vlc_mutex_held to ≃ 1.34%. This also drastically reduces the amount of occurences of __tls_get_addr_slow but I'm not sure if that's something we need to care about.

Now, this patch is wrong and kludgy, but I think there's a valuable improvement to be dug further upon. I'm not sure what the correct way would be though. I feel like it's a bad idea to rely on various platform specific thread functions in misc/threads.c as it's supposed to be the other way around. Relying on vlc_thread_id in a portable way is not an option AFAICS since some implementations might return -1 which would be less thread specific.

Opinion welcome, and if somebody knows more about this and feels like explaining, I'm definitely interested!

Merge request reports