Task threading mutex contention
The ttd->lock
inside dav1d_worker_task()
is heavily contented on systems with a large number of logical cores which significantly limits multithreaded scaling.
Things work reasonably well up to 16 or so logical cores but starts going downhill after that, and performance drops as higher thread counts are used:
The issue is that there's a single mutex for everything related to task threading and it's being held by each thread for long periods of time.
As a result it's easy to end up in a situation where there are several available tasks that can be done in parallel and plenty of available threads that can do them, but most threads can't actually begin performing any work as they're all stuck trying to grab the same mutex.
I don't have a silver bullet solution to this, but holding the mutex for a shorter time (by doing as much work as possible without holding the mutex) and/or splitting The One Big Mutex the covers everything into several mutexes that each covers smaller sections would help a lot if it's possible.