Deband with shared memory / compute shaders
This might be faster due to avoiding so many collisions between cache lines etc. when debanding from random locations. Caveats:
- need to reimplement bilinear filtering in shmem
- could use more aggressive settings by default because the majority of the extra "work" is already spent up-front getting pixels into LDS
- could load multiple planes independently without needing to pre-merge them for performance
We'd have to test and benchmark this, possibly in mpv, first.