arm64: filmgrain16: Add a NEON implementation of fguv_32x32xn for 16 bpc
Also fix an inefficiency in the existing 8bpc fguv functions, which did 32 gathers even if only 16 was needed.
Relative speedup over C code:
Cortex A53 A82 A83 Apple M1
fguv_32x32xn_16bpc_420_csfl0_neon: 4.57 2.08 3.57 7.61
fguv_32x32xn_16bpc_420_csfl1_neon: 4.92 2.89 3.96 4.26
fguv_32x32xn_16bpc_422_csfl0_neon: 4.59 2.14 3.61 5.88
fguv_32x32xn_16bpc_422_csfl1_neon: 4.92 2.90 3.90 5.00
fguv_32x32xn_16bpc_444_csfl0_neon: 3.64 1.89 2.86 4.72
fguv_32x32xn_16bpc_444_csfl1_neon: 3.59 2.26 2.76 3.22
This takes Chimera 10 bpc with filmgrain from 204 fps to 276 fps on an Apple M1 (where decoding without filmgrain runs at 325 fps).