Skip to content
Snippets Groups Projects
Commit b129d9f2 authored by Martin Storsjö's avatar Martin Storsjö
Browse files

mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap}

For the bilin cases, this seems to make things marginally faster
(measured on x86_64; 7-25% faster with compiler autovectorization).
For 8tap, it doesn't make much of a difference at all.

Before:                                      GCC   Clang
mc_scaled_8tap_regular_w128_8bpc_c:     115155.5   98549.3
mc_scaled_8tap_regular_w128_8bpc_ssse3:  17936.0   18411.1
mc_scaled_bilinear_w128_8bpc_c:          40290.0   51812.9
mc_scaled_bilinear_w128_8bpc_ssse3:      18243.9   18177.0
After:
mc_scaled_8tap_regular_w128_8bpc_c:     116304.3   99453.2
mc_scaled_8tap_regular_w128_8bpc_ssse3:  18387.0   18077.3
mc_scaled_bilinear_w128_8bpc_c:          37381.4   41145.0
mc_scaled_bilinear_w128_8bpc_ssse3:      18423.8   18031.6

(Benchmarked with the seed 0; the total runtime for the scaled
benchmarks are significantly affected by the random seed.)

This reduces the stack usage of these functions from around 65 KB
each, to less than 1 KB for bilin, and around 2 KB for 8tap.

With this in place, the required stack space for dav1d should
be mostly identical across configurations; on x86_64 (both with
and without assembly), it can run with 62 KB of stack, and
on arm and aarch64, it can run with 58 KB of stack.
parent cd5bfa12
No related branches found
No related tags found
1 merge request!1779mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap}
Pipeline #551025 passed with stages
in 35 minutes and 38 seconds