AArch64: Add Neon implementation of load_tmvs
This patch adds a vectorised variant of the mv_projection
calculation and a faster initialisation of motion vectors for load_tmvs_neon
.
checkasm
uplifts after this patch on some Neoverse and Cortex CPU cores compared to the C reference compiled with GCC-13 and Clang-19:
GCC Clang
AWS Graviton 4: 1.62x 1.59x
Cortex-X4: 1.45x 1.46x
Cortex-X3: 1.68x 1.69x
Cortex-X1: 1.55x 1.52x
Cortex-A720: 1.54x 1.57x
Cortex-A715: 1.47x 1.55x
Cortex-A78: 1.21x 1.18x
Cortex-A76: 1.38x 1.37x
Cortex-A72: 1.08x 1.11x
Cortex-A520: 0.97x 1.18x
Cortex-A510: 0.99x 1.14x
Cortex-A55: 1.16x 1.23x
This patch increases the .text
by ~660 bytes, but smaller than the reference implementation by about 0.5 KiB.