looprestoration: Rewrite the C version of the wiener filter
This reduces the stack usage of these functions (the C version) significantly. These C versions aren't used on architectures that already have wiener filters implemented in assembly, but they matter both if running e.g. with assembly disabled (e.g. for sanitizer builds), and matter as example for how to do a cache efficient SIMD implementation. This roughly matches how these functions are implemented in the aarch64 assembly (although that assembly function uses a mainloop function written in assembly, and custom calling conventions between the functions). With this in place, dav1d can run with around 76 KB of stack with assembly disabled. This increases the binary size by around 14 KB (in the case of aarch64 with Xcode Clang 16), unless built with (the default) -Dtrim_dsp=true. (By default, the C version of the wiener filter gets skipped entirely.) On 32 bit arm, the assembly wiener function implementation still uses large buffers on the stack though, but due to other functions using less stack there, dav1d can still run with 72 KB of stack there. Unfortunately, this change also makes the functions slower, depending on how well the compiler was able to optimize the previous version. On GCC (which didn't manage to vectorize the functions so well before), it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang (where it was very well vectorized before). Most of this performance can be gained back with later changes on top, though.
Loading
Please register or sign in to comment