Deband with shared memory / compute shaders

This might be faster due to avoiding so many collisions between cache lines etc. when debanding from random locations. Caveats:

need to reimplement bilinear filtering in shmem
could use more aggressive settings by default because the majority of the extra "work" is already spent up-front getting pixels into LDS
could load multiple planes independently without needing to pre-merge them for performance

We'd have to test and benchmark this, possibly in mpv, first.

VideoLAN code repository instance