arm: mc: Port the ARM64 warp filter to arm32
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
warp_8x8_8bpc_neon: 2.79 5.45 4.18 3.96 4.16 4.51
warp_8x8t_8bpc_neon: 2.79 5.33 4.18 3.98 4.22 4.25
Comparison to original ARM64 assembly:
ARM64: Cortex A53 A72 A73
warp_8x8_8bpc_neon: 1854.6 1072.5 1102.5
warp_8x8t_8bpc_neon: 1839.6 1069.4 1089.5
ARM32:
warp_8x8_8bpc_neon: 2132.5 1160.3 1218.0
warp_8x8t_8bpc_neon: 2113.7 1148.0 1209.1
Edited by Jean-Baptiste Kempf