arm64: mc: NEON implementation of emu_edge for 8bpc
Relative speedups over C code:
Cortex A53 A72 A73
emu_edge_w4_8bpc_neon: 3.82 2.93 2.41
emu_edge_w8_8bpc_neon: 3.28 2.86 2.51
emu_edge_w16_8bpc_neon: 3.58 3.27 2.63
emu_edge_w32_8bpc_neon: 3.04 1.68 2.12
emu_edge_w64_8bpc_neon: 2.58 1.45 1.48
emu_edge_w128_8bpc_neon: 1.79 1.02 1.57
The benchmark numbers for the larger size on A72 fluctuate a whole lot and thus seem very unreliable.