AArch64: Optimize put_neon function

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_put_opt into master

Optimize the put_neon function, details are in the commit messages.

Relative performance of micro benchmarks including all commits (lower is better):

Cortex-A55 w2: 0.991x w4: 0.992x w8: 0.999x w16: 0.875x w32: 0.775x w64: 0.914x w128: 0.998x
Cortex-A510 w2: 0.159x w4: 0.080x w8: 0.583x w16: 0.588x w32: 0.966x w64: 1.111x w128: 0.957x
Cortex-A76 w2: 0.903x w4: 0.683x w8: 0.944x w16: 0.948x w32: 0.919x w64: 0.855x w128: 0.991x
Cortex-A78 w32: 0.867x w64: 0.820x w128: 1.011x
Cortex-A715 w32: 0.834x w64: 0.778x w128: 1.000x
Cortex-X1 w32: 0.809x w64: 0.762x w128: 1.000x
Cortex-X3 w32: 0.733x w64: 0.720x w128: 0.999x

