arm32: filmgrain: Add NEON implementations of fgy and fguv for 8 bpc
This goes on top of !1217 (merged).
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_8bpc_420_csfl0_neon: 4.20 2.19 3.48 4.93 3.60 5.93
fguv_32x32xn_8bpc_420_csfl1_neon: 3.92 1.52 2.84 4.34 3.82 5.93
fguv_32x32xn_8bpc_422_csfl0_neon: 4.27 2.13 3.58 5.02 4.04 5.95
fguv_32x32xn_8bpc_422_csfl1_neon: 3.99 1.56 2.91 4.43 3.89 6.00
fguv_32x32xn_8bpc_444_csfl0_neon: 4.48 2.08 3.89 5.66 4.07 6.51
fguv_32x32xn_8bpc_444_csfl1_neon: 4.45 1.41 2.99 5.28 3.63 6.09
fgy_32x32xn_8bpc_neon: 3.61 1.10 2.62 4.35 3.06 3.74
Absolute numbers:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_8bpc_420_csfl0_neon: 5318.8 11167.7 6024.6 3909.9 2945.2 2993.5
fguv_32x32xn_8bpc_420_csfl1_neon: 4351.0 10929.7 5269.5 3316.8 2166.5 2256.9
fguv_32x32xn_8bpc_422_csfl0_neon: 5387.9 11746.7 6080.0 3945.8 2988.1 3046.3
fguv_32x32xn_8bpc_422_csfl1_neon: 4396.0 11083.2 5300.8 3354.9 2216.4 2291.4
fguv_32x32xn_8bpc_444_csfl0_neon: 4347.9 10595.0 5134.4 3079.1 2277.7 2392.9
fguv_32x32xn_8bpc_444_csfl1_neon: 3295.0 10518.2 4442.6 2476.3 1716.3 1829.2
fgy_32x32xn_8bpc_neon: 12376.2 41046.9 17259.7 9153.1 6610.4 7005.3
Corresponding numbers for arm64:
Cortex A53 A72 A73
fguv_32x32xn_8bpc_420_csfl0_neon: 3822.9 2920.0 2935.7
fguv_32x32xn_8bpc_420_csfl1_neon: 3209.7 2231.7 2335.4
fguv_32x32xn_8bpc_422_csfl0_neon: 3807.9 2886.5 2966.7
fguv_32x32xn_8bpc_422_csfl1_neon: 3197.1 2187.9 2355.9
fguv_32x32xn_8bpc_444_csfl0_neon: 2757.8 2227.4 2334.4
fguv_32x32xn_8bpc_444_csfl1_neon: 2244.6 1719.1 1786.7
fgy_32x32xn_8bpc_neon: 8192.2 6563.3 6969.1