looprestoration: Rewrite the C version of the SGR filter
This reduces the stack usage of these functions (the C version) significantly, and gives them a 15-40% speedup (on an Apple M3, with Xcode Clang 16).
The C versions of this function does matter; even though we have assembly implementations of it on x86 and aarch64, those only covert the 8 and 10 bpc cases, while the C version is used as fallback for 12 bpc.
This matches how these functions are implemented in the aarch64 assembly; operate over a window of 3 or 5 lines (of 384 pixels each), instead of doing a full 384 x 64 block.
The individual functions for filtering a line each end up much simpler, and closer to how this can be implemented in assembly - but the overall business logic ends up much much more complex.
The main difference to the aarch64 assembly implementation, is that any buffer which is of int16_t size in the aarch64 assembly implementation, uses the type "coef" here, which is 32 bit in the 10/12 bpc cases. (This is required for handling the 12 bpc cases.)
Adding a _neon suffix to the names of the preexisting functions in the aarch64 implementation, where the names would clash.
With this in place, dav1d can run with around 66 KB of stack on x86_64 with assembly enabled, with around 74 KB of stack on aarch64 with assembly enabled, and with 118 KB of stack with assembly disabled.
On 32 bit arm, dav1d still requires around 270 KB of stack, as that assembly implementation of the SGR filter uses a different algorithm.
CC @ltrudeau
This fixes the biggest parts of #442.