Skip to content
Snippets Groups Projects
Martin Storsjö's avatar
Martin Storsjö authored
This reduces the stack usage of these functions (the C version)
significantly, and gives them a 15-40% speedup (on an Apple M3,
with Xcode Clang 16).

The C versions of this function does matter; even though we have
assembly implementations of it on x86 and aarch64, those only
covert the 8 and 10 bpc cases, while the C version is used as
fallback for 12 bpc.

This matches how these functions are implemented in the aarch64
assembly; operate over a window of 3 or 5 lines (of 384 pixels
each), instead of doing a full 384 x 64 block.

The individual functions for filtering a line each end up
much simpler, and closer to how this can be implemented in
assembly - but the overall business logic ends up much much
more complex.

The main difference to the aarch64 assembly implementation,
is that any buffer which is of int16_t size in the aarch64
assembly implementation, uses the type "coef" here, which
is 32 bit in the 10/12 bpc cases. (This is required for handling
the 12 bpc cases.)

With this in place, dav1d can run with around 66 KB of stack
on x86_64 with assembly enabled, with around 74 KB of stack on
aarch64 with assembly enabled, and with 118 KB of stack with
assembly disabled.

This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16).

On 32 bit arm, dav1d still requires around 270 KB of stack, as
that assembly implementation of the SGR filter uses a different
algorithm.
f32b3146
Name Last commit Last update