arm32: filmgrain: Add NEON implementation of fgy and fguv for 16 bpc
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_16bpc_420_csfl0_neon: 3.47 1.72 2.99 4.18 2.68 6.19
fguv_32x32xn_16bpc_420_csfl1_neon: 3.24 1.36 2.58 3.78 2.73 5.27
fguv_32x32xn_16bpc_422_csfl0_neon: 3.57 2.07 3.05 4.32 2.74 6.20
fguv_32x32xn_16bpc_422_csfl1_neon: 3.33 1.44 2.62 3.89 2.71 5.28
fguv_32x32xn_16bpc_444_csfl0_neon: 3.48 1.69 3.06 4.48 2.97 6.69
fguv_32x32xn_16bpc_444_csfl1_neon: 3.06 1.16 2.36 3.85 2.75 5.19
fgy_32x32xn_16bpc_neon: 2.89 1.05 2.29 3.49 2.49 3.15
Absolute numbers:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_16bpc_420_csfl0_neon: 6237.3 12701.0 6687.1 4525.8 3220.8 3195.4
fguv_32x32xn_16bpc_420_csfl1_neon: 5143.2 11684.8 5926.4 3857.2 2604.7 2556.5
fguv_32x32xn_16bpc_422_csfl0_neon: 6347.3 11005.2 6797.5 4582.4 3300.4 3250.5
fguv_32x32xn_16bpc_422_csfl1_neon: 5275.2 11594.8 5992.6 3931.1 2668.7 2607.3
fguv_32x32xn_16bpc_444_csfl0_neon: 5181.6 11310.0 5575.4 3629.7 2383.8 2530.0
fguv_32x32xn_16bpc_444_csfl1_neon: 4081.9 10958.8 4868.5 2962.9 1870.3 2034.2
fgy_32x32xn_16bpc_neon: 15439.1 43129.0 19406.6 11542.3 7463.9 7827.8
Corresponding numbers for arm64:
Cortex A53 A72 A73
fguv_32x32xn_16bpc_420_csfl0_neon: 4019.2 3247.4 3259.6
fguv_32x32xn_16bpc_420_csfl1_neon: 3460.1 2628.7 2640.8
fguv_32x32xn_16bpc_422_csfl0_neon: 4034.4 3329.9 3287.5
fguv_32x32xn_16bpc_422_csfl1_neon: 3468.3 2749.3 2686.6
fguv_32x32xn_16bpc_444_csfl0_neon: 3117.7 2447.4 2539.8
fguv_32x32xn_16bpc_444_csfl1_neon: 2641.2 1977.2 2132.8
fgy_32x32xn_16bpc_neon: 9873.5 7605.7 7656.2