arm64: Add NEON implementation of fgy_32x32xn (!1180) · Merge requests · VideoLAN / dav1d

Relative speedup over C code:

                   Cortex A53    A72    A73   Apple M1
fgy_32x32xn_8bpc_neon:   4.48   2.84   3.73       5.64

The code assumes it's ok to overwrite past the right edge up to alignment of 32 pixels, and to write at least 2 rows (for the vertical overlap case).

The code uses a C frontend function for the highlevel logic, and calls one assembly function per 32x32 pixel block.

Only did fgy_32x32xn for now, for early feedback before proceeding with other functions, CC @janne.

arm64: Add NEON implementation of fgy_32x32xn