arm64: mc: NEON implementation of warp8x8{,t}
Relative speedup vs C code:
Cortex A53 A72 A73
warp_8x8_8bpc_neon: 3.19 2.60 3.66
warp_8x8t_8bpc_neon: 3.09 2.50 3.58
@gramner I'm making the warp filter table order conditional in tables.c/mc_tmpl.c here, which effectively reverts a0692eb8 for other architectures than x86. The order that is beneficial for x86 SIMD is not beneficial for other architectures.
For a NEON implementation of the warp filter, reordering the filter coefficients back in the right order took 1/4 of the filter runtime.
Edited by Martin Storsjö