src/x86/looprestoration.asm · fe2bb774243bc734f39d94d2519e23c1eabb7b35 · VideoLAN / dav1d

Henrik Gramner authored Feb 10, 2021

The previous implementation did multiple passes in the horizontal
and vertical directions, with the intermediate values being stored
in buffers on the stack. This caused bad cache thrashing.

By interleaving the all the different passes in combination with a
ring buffer for storing only a few rows at a time the performance
is improved by a significant amount.

Also slightly speed up neighbor calculations by packing the a and b
values into a single 32-bit unsigned integer which allows calculations
on both values simultaneously.

fe2bb774