arm32: mc: NEON implementation of warp8x8 for 16 bpc
Checkasm benchmarks:
Cortex A7 A8 A53 A72 A73
warp_8x8_16bpc_neon: 4062.6 2109.4 2462.0 1338.9 1391.1
warp_8x8t_16bpc_neon: 3996.3 2102.4 2412.0 1273.8 1368.9
Corresponding numbers for arm64, for comparison:
Cortex A53 A72 A73
warp_8x8_16bpc_neon: 2037.0 1148.8 1222.0
warp_8x8t_16bpc_neon: 2008.0 1120.4 1200.9