arm64: filmgrain: Add NEON implementation of the generate_grain_y function
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1 gen_grain_y_ar0_8bpc_neon: 5.03 5.18 5.46 5.22 gen_grain_y_ar1_8bpc_neon: 3.40 3.16 3.48 2.78 gen_grain_y_ar2_8bpc_neon: 4.99 4.11 5.02 3.99 gen_grain_y_ar3_8bpc_neon: 6.73 5.69 6.66 5.34
The inner loop for doing the AR filtering of each output entry (which depends on the previous one output) is made with scalar instructions (as opposed to NEON) as it consists of fairly long chains of instructions that depend on each other, with no (for ar1) or little (ar2/3) opportunity for parallelism. The AR filtering from rows above is done with NEON though, achieving proper full utilization of the SIMD instructions used.
The code is tuned to keep the binary size low (by preferring outlined functions instead of inlining with macros in lots of cases), sacrificing some fraction of performance while keeping the binary size increase moderate. (This MR adds around 4.3 KB to the text sectio.)
For ar1, the previous output row is kept in registers, but for ar2 and ar3, the previous rows are read back from memory. For ar2, it would have been possible to keep the last two rows in registers, but that requires using more macros instead of functions - that variant (that I visited while developing it cost around 1.1 KB of code size).
This branch is made on top of MR !1182 (merged), so this MR only covers the last 2 commits in the branch, the rest belong to the previous MR.