arm64: Add NEON implementation of fgy_32x32xn
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1
fgy_32x32xn_8bpc_neon: 4.48 2.84 3.73 5.64
The code assumes it's ok to overwrite past the right edge up to alignment of 32 pixels, and to write at least 2 rows (for the vertical overlap case).
The code uses a C frontend function for the highlevel logic, and calls one assembly function per 32x32 pixel block.
Only did fgy_32x32xn for now, for early feedback before proceeding with other functions, CC @janne.