arm64: msac: Implement NEON msac_decode_symbol_adapt
Cortex A53 A72 A73
msac_decode_symbol_adapt4_c: 107.5 57.1 67.8
msac_decode_symbol_adapt4_neon: 70.1 53.4 55.0
msac_decode_symbol_adapt8_c: 157.3 74.5 90.2
msac_decode_symbol_adapt8_neon: 75.3 57.9 56.2
msac_decode_symbol_adapt16_c: 257.4 106.3 136.0
msac_decode_symbol_adapt16_neon: 101.2 61.2 65.8
Total decoding speedup of Chimera is around 0.8%.
@janne Do you have an opinion on the use of macros here? I'm avoiding duplicating the main ~60 line block of SIMD code by templating it out to three versions. Templating between widths 4 and 8 is trivial (just changing between .4h and .8h register specifiers), but templating between using one or two registers (for width 8 vs 16) is done with a lot of small macros, one per instruction type. The macro definitions end up using more lines of code than it would be to duplicate the code once more...