Implement smart threading defaults based on content and system
dav1d need smart defaults for how and when to use threading.
Related to #101 (closed) and !238 (closed) we should look into threading and how to do it right. CPU's are extremely tricky with different power/performance curves, caching and even different architectures within one processor (big.Little, DynamIQ).
If we look at it from a mobile (battery limited) perspective, there are even more factors to consider. Even if you only have 0.7x scaling (1.4 times the performance with double the threads), running at 30% lower clocks can easily save more that 50% power which makes doubling the number of active cores the better option. Futhermore, on most systems most cores are already running background tasks at very low frequencies which mean they're already using a little power, and scaling them up a little doesn't increase power usage a lot.
Then there's also the issue of varying workloads, some frames require more decoding operations than others (keyframes and frames with high bitrates e.a.). In applications we don't want to decode as many frames per second as possible, and even not reaching an certain number of frames on average. We want to consistently decode each frame within a limited time span (16.6 ms for 60 fps, somewhat more if we maintain a buffer), so we need more performance for certain frames/segments than for others. If we're already high on clock speeds we can't turbo even further up, but if we are going wide and slow we can.
Or ideally, we try to hit just the right spot on the power/performance curve and vary the number of cores used depending on how demanding the content is.
Furthermore are mobile SoC's extremely interesting: big out-of-order cores with NEON instructions are more performant but are not power efficient and small in-order cores are not as performant but are way more power efficient (and still have NEON but offer less wide pipelines).
Also, how do other decoders this? These decisions could best be made by the CPU scheduler in real time and not by dav1d. Can we trow tasks/threads in such a way to the CPU that it understands the requirements (like: This needs to be done in 10ms) and it can make the decision how to do it, either with adjusting clock speeds or adjusting core count (turbo up or spin up another core).