arm64: filmgrain: Add NEON implementation of the generate_grain_uv functions
The existing functions/macros for generate_grain_y are templated for adding in the a final coefficient from the y buffer, while trying to keep the binary size down.
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1
gen_grain_uv_ar0_8bpc_420_neon: 4.62 4.55 5.27 9.08
gen_grain_uv_ar0_8bpc_422_neon: 4.81 4.90 5.33 7.25
gen_grain_uv_ar0_8bpc_444_neon: 5.05 5.17 5.69 7.04
gen_grain_uv_ar1_8bpc_420_neon: 3.61 3.09 3.68 3.92
gen_grain_uv_ar1_8bpc_422_neon: 3.71 3.22 3.64 3.46
gen_grain_uv_ar1_8bpc_444_neon: 3.59 3.40 3.67 3.11
gen_grain_uv_ar2_8bpc_420_neon: 4.77 3.85 4.81 4.55
gen_grain_uv_ar2_8bpc_422_neon: 4.88 3.96 4.85 4.15
gen_grain_uv_ar2_8bpc_444_neon: 5.18 4.65 5.18 3.83
gen_grain_uv_ar3_8bpc_420_neon: 6.14 5.25 6.14 5.64
gen_grain_uv_ar3_8bpc_422_neon: 6.27 5.27 6.28 5.42
gen_grain_uv_ar3_8bpc_444_neon: 6.84 6.40 6.79 5.18
This goes on top of !1185 (merged).