-
Martin Storsjö authored
This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can be int16_t for 10 bpc, but need to be int32_t for 12 bpc. Make actual templates out of the functions in looprestoration_tmpl.S, and add box3/5_h to looprestoration16.S. Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter (which is passed even in 8bpc mode), add a define to bitdepth.h for passing such a parameter in all modes. This makes this function a few instructions slower in 8bpc mode than it was before (overall impact seems to be around 1% of the total runtime of SGR), but allows using the same actual function instantiation for all modes, saving a bit of code size. Examples of checkasm runtimes: Cortex A53 A72 A73 selfguided_3x3_10bpc_neon: 516755.8 389412.7 349058.7 selfguided_5x5_10bpc_neon: 380699.9 293486.6 254591.6 selfguided_mix_10bpc_neon: 878142.3 667495.9 587844.6 Corresponding 8 bpc numbers for comparison: selfguided_3x3_8bpc_neon: 491058.1 361473.4 347705.9 selfguided_5x5_8bpc_neon: 352655.0 266423.7 248192.2 selfguided_mix_8bpc_neon: 826094.1 612372.2 581943.1
e3dbf926
Loading