arm64: filmgrain: Add a NEON implementation of fgy_32x32xn for 16 bpc
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1
fgy_32x32xn_16bpc_neon: 3.87 2.28 2.78 3.45
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1
fgy_32x32xn_16bpc_neon: 3.87 2.28 2.78 3.45
VideoLAN code repository instance