arm64: itx: Add NEON implementation of itx for 10 bpc

Add an element size specifier to the existing individual transform
functions for 8 bpc, naming them e.g. inv_dct_8h_x8_neon, to clarify
that they operate on input vectors of 8h, and make the symbols
public, to let the 10 bpc case call them from a different object file.
The same convention is used in the new itx16.S, like inv_dct_4s_x8_neon.

Make the existing itx.S compiled regardless of whether 8 bpc support
is enabled. For builds with 8 bpc support disabled, this does include
the unused frontend functions though, but this is hopefully tolerable
to avoid having to split the file into a sharable file for transforms
and a separate one for frontends.

This only implements the 10 bpc case, as that case can use transforms
operating on 16 bit coefficients in the second pass.

Relative speedup vs C for a few functions:

                                     Cortex A53    A72    A73
inv_txfm_add_4x4_dct_dct_0_10bpc_neon:     4.14   4.06   4.49
inv_txfm_add_4x4_dct_dct_1_10bpc_neon:     6.51   6.49   6.42
inv_txfm_add_8x8_dct_dct_0_10bpc_neon:     5.02   4.63   6.23
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:     8.54   7.13  11.96
inv_txfm_add_16x16_dct_dct_0_10bpc_neon:   5.52   6.60   8.03
inv_txfm_add_16x16_dct_dct_1_10bpc_neon:  11.27   9.62  12.22
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:   9.60   6.97   8.59
inv_txfm_add_32x32_dct_dct_0_10bpc_neon:   2.60   3.48   3.19
inv_txfm_add_32x32_dct_dct_1_10bpc_neon:  14.65  12.64  16.86
inv_txfm_add_32x32_dct_dct_2_10bpc_neon:  11.57   8.80  12.68
inv_txfm_add_32x32_dct_dct_3_10bpc_neon:   8.79   8.00   9.21
inv_txfm_add_32x32_dct_dct_4_10bpc_neon:   7.58   6.21   7.80
inv_txfm_add_64x64_dct_dct_0_10bpc_neon:   2.41   2.85   2.75
inv_txfm_add_64x64_dct_dct_1_10bpc_neon:  12.91  10.27  12.24
inv_txfm_add_64x64_dct_dct_2_10bpc_neon:  10.96   7.97  10.31
inv_txfm_add_64x64_dct_dct_3_10bpc_neon:   8.95   7.42   9.55
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   7.97   6.12   7.82
30 jobs for arm64-itx-10bpc in 4 minutes and 48 seconds (queued for 2 seconds)
Status Job ID Name Coverage
  Style
passed #391116
amd64 docker
style-check

00:00:12

 
  Build
passed #391127
amd64 docker
build-android-aarch64

00:00:17

passed #391126
amd64 docker
build-android-armv7

00:00:17

passed #391117
amd64 avx2 docker
build-debian

00:00:26

passed #391128
aarch64 docker
build-debian-aarch64

00:01:08

passed #391129
aarch64 docker
build-debian-aarch64-clang-5

00:00:50

passed #391132
armv7 docker
build-debian-armv7

00:00:37

passed #391133
armv7 docker
build-debian-armv7-clang-5

00:00:30

passed #391120
amd64 docker
build-debian-examples

00:00:25

passed #391135
ppc64le docker
build-debian-ppc64le

00:00:46

passed #391118
amd64 docker
build-debian-static

00:00:26

passed #391131
aarch64 docker
build-debian-werror

00:00:31

passed #391119
amd64 docker
build-debian32

00:00:28

passed #391130
macos
build-macos

00:00:35

passed #391136
amd64 docker
build-pages

00:00:13

passed #391134
amd64 docker
build-ubuntu-snap

00:00:26

passed #391124
amd64 docker
build-win-arm32

00:00:17

passed #391125
amd64 docker
build-win-arm64

00:00:17

passed #391121
amd64 docker
build-win32

00:00:32

passed #391122
amd64 docker
build-win32-unaligned-stack

00:00:32

passed #391123
amd64 docker
build-win64

00:00:36

 
  Test
passed #391137
amd64 docker
test-debian

00:00:33

passed #391143
aarch64 docker
test-debian-aarch64

00:02:03

passed #391145
armv7 docker
test-debian-armv7-clang-5

00:00:51

passed #391139
amd64 docker
test-debian-asan

00:01:24

passed #391140
amd64 docker
test-debian-msan

00:01:16

passed #391144
ppc64le docker
test-debian-ppc64le

00:01:15

passed #391141
amd64 docker
test-debian-ubsan

00:01:10

passed #391138
amd64 avx2 docker
test-debian-unaligned-stack

00:00:35

passed #391142
amd64 avx2 docker
test-win64

00:01:04