Commit e3dbf926 authored by Martin Storsjö's avatar Martin Storsjö

arm64: looprestoration: NEON implementation of SGR for 10 bpc

This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can
be int16_t for 10 bpc, but need to be int32_t for 12 bpc.

Make actual templates out of the functions in looprestoration_tmpl.S,
and add box3/5_h to looprestoration16.S.

Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter
(which is passed even in 8bpc mode), add a define to bitdepth.h for
passing such a parameter in all modes. This makes this function
a few instructions slower in 8bpc mode than it was before (overall impact
seems to be around 1% of the total runtime of SGR), but allows using the
same actual function instantiation for all modes, saving a bit of code
size.

Examples of checkasm runtimes:
                           Cortex A53        A72        A73
selfguided_3x3_10bpc_neon:   516755.8   389412.7   349058.7
selfguided_5x5_10bpc_neon:   380699.9   293486.6   254591.6
selfguided_mix_10bpc_neon:   878142.3   667495.9   587844.6

Corresponding 8 bpc numbers for comparison:
selfguided_3x3_8bpc_neon:    491058.1   361473.4   347705.9
selfguided_5x5_8bpc_neon:    352655.0   266423.7   248192.2
selfguided_mix_8bpc_neon:    826094.1   612372.2   581943.1
parent 7cf5d753
Pipeline #13676 passed with stages
in 6 minutes and 46 seconds