Allow toggling on "frame perfect" mode for the tone mapping code

The current tone mapping / peak detection algorithm has its results delayed by a frame for performance/simplicity reasons. Often times, it's undesirable to have results delayed like this - e.g. when transcoding offline, or when used on a still image scenario. We should have an option to allow turning on "frame perfect" mode at the cost of performance.

Possible ways to achieve this:

Duplicate the current shader up to that point and run it in a second pass to measure the peak (more compute time, less texture bandwidth/memory required)
Finalize the current shader and render it out to a cache texture while measuring the peak, then dispatch a lightweight second pass that just samples from this cache (no redundant computations, more texture sampling/bandwidth/memory required)

In either case, the intermediate step would require dispatching some shader - so the use of a pl_dispatch becomes unavoidable. As such, the tone mapping shader would have to be split up into two passes, each of which gets called separately by the parent (e.g. pl_renderer).

In order to determine which strategy of the above two is best, we could support both and apply some heuristics in pl_renderer to guess whether or not the shader would be cheaper to re-execute or not, either based on the settings or based on some property of the pl_shader itself (e.g. how many texture samples have been recorded into it)

Or we could just agree on one of the two strategies and use that always.