arm64: mc: Optimize mc_8tap_regular_w4_hv_8bpc for A53

Before:                       Cortex A53   Snapdragon 835
mc_8tap_regular_w4_hv_8bpc_neon:   543.6   359.1
mc_8tap_regular_w4_hv_8bpc_neon:   466.7   355.5

The same kind of change doesn't seem to give any benefits on the 8
pixel wide hv filtering though, potentially related to the fact that
it uses not only smull/smlal but also smull2/smlal2.
