Skip to content

arm64: filmgrain: Add NEON implementation of the generate_grain_uv functions

Martin Storsjö requested to merge mstorsjo/dav1d:arm64-gen-grain-uv into master

The existing functions/macros for generate_grain_y are templated for adding in the a final coefficient from the y buffer, while trying to keep the binary size down.

Relative speedup over C code:

                            Cortex A53    A72    A73   Apple M1
gen_grain_uv_ar0_8bpc_420_neon:   4.62   4.55   5.27   9.08
gen_grain_uv_ar0_8bpc_422_neon:   4.81   4.90   5.33   7.25
gen_grain_uv_ar0_8bpc_444_neon:   5.05   5.17   5.69   7.04
gen_grain_uv_ar1_8bpc_420_neon:   3.61   3.09   3.68   3.92
gen_grain_uv_ar1_8bpc_422_neon:   3.71   3.22   3.64   3.46
gen_grain_uv_ar1_8bpc_444_neon:   3.59   3.40   3.67   3.11
gen_grain_uv_ar2_8bpc_420_neon:   4.77   3.85   4.81   4.55
gen_grain_uv_ar2_8bpc_422_neon:   4.88   3.96   4.85   4.15
gen_grain_uv_ar2_8bpc_444_neon:   5.18   4.65   5.18   3.83
gen_grain_uv_ar3_8bpc_420_neon:   6.14   5.25   6.14   5.64
gen_grain_uv_ar3_8bpc_422_neon:   6.27   5.27   6.28   5.42
gen_grain_uv_ar3_8bpc_444_neon:   6.84   6.40   6.79   5.18

This goes on top of !1185 (merged).

Merge request reports