arm64: mc: Optimize mc_8tap_regular_w4_hv_8bpc for A53

Before:                       Cortex A53   Snapdragon 835
mc_8tap_regular_w4_hv_8bpc_neon:   543.6   359.1
After:
mc_8tap_regular_w4_hv_8bpc_neon:   466.7   355.5

The same kind of change doesn't seem to give any benefits on the 8
pixel wide hv filtering though, potentially related to the fact that
it uses not only smull/smlal but also smull2/smlal2.
14 jobs for master in 5 minutes and 19 seconds (queued for 2 seconds)
Status Job ID Name Coverage
  Style
passed #248303
amd64 debian
style-check

00:00:21

 
  Build
passed #248304
amd64 debian
build-debian

00:00:34

passed #248309
debian aarch64
build-debian-aarch64

00:01:27

passed #248310
debian aarch64
build-debian-aarch64-clang-5

00:00:58

passed #248305
amd64 debian
build-debian-static

00:00:35

passed #248312
debian aarch64
build-debian-werror

00:00:31

passed #248306
amd64 debian
build-debian32

00:00:27

passed #248311
macos
build-macos

00:00:38

passed #248307
win32
build-win32

00:00:31

passed #248308
win64
build-win64

00:00:34

 
  Test
passed #248313
amd64 debian
test-debian

00:00:46

passed #248314
amd64 debian
test-debian-asan

00:02:03

passed #248315
amd64 debian
test-debian-msan

00:01:02

passed #248316
amd64 debian
test-debian-ubsan

00:01:17