arm64: filmgrain16: Add a NEON implementation of fguv_32x32xn for 16 bpc (!1217) · Merge requests · VideoLAN / dav1d

Martin Storsjö requested to merge mstorsjo/dav1d:arm64-fguv-16bpc into master Jun 07, 2021

Also fix an inefficiency in the existing 8bpc fguv functions, which did 32 gathers even if only 16 was needed.

Relative speedup over C code:

                               Cortex A53    A82    A83   Apple M1
fguv_32x32xn_16bpc_420_csfl0_neon:   4.57   2.08   3.57   7.61
fguv_32x32xn_16bpc_420_csfl1_neon:   4.92   2.89   3.96   4.26
fguv_32x32xn_16bpc_422_csfl0_neon:   4.59   2.14   3.61   5.88
fguv_32x32xn_16bpc_422_csfl1_neon:   4.92   2.90   3.90   5.00
fguv_32x32xn_16bpc_444_csfl0_neon:   3.64   1.89   2.86   4.72
fguv_32x32xn_16bpc_444_csfl1_neon:   3.59   2.26   2.76   3.22

This takes Chimera 10 bpc with filmgrain from 204 fps to 276 fps on an Apple M1 (where decoding without filmgrain runs at 325 fps).

arm64: filmgrain16: Add a NEON implementation of fguv_32x32xn for 16 bpc

Merge request reports