arm: itx: Add NEON implementation of itx for 8 bpc

The transforms process vectors of up to 8 elements at a time, for
transforms up to size 8; for larger transforms, it uses vectors of
4 elements.

Overall, the speedup over C code seems to be around 8-14x for the
larger transforms, and 10-19x for the smaller ones.

Relative speedup over C code (built with GCC 7.5) for a few functions:

                                    Cortex A7     A8     A9    A53    A72    A73
inv_txfm_add_4x4_dct_dct_0_8bpc_neon:    3.83   3.42   2.57   3.36   2.97   7.47
inv_txfm_add_4x4_dct_dct_1_8bpc_neon:    7.25  13.53   8.38   8.82   7.96  12.37
inv_txfm_add_8x8_dct_dct_0_8bpc_neon:    4.78   6.61   4.82   4.65   5.27   9.76
inv_txfm_add_8x8_dct_dct_1_8bpc_neon:   10.20  19.07  13.07  14.69  11.45  15.50
inv_txfm_add_16x16_dct_dct_0_8bpc_neon:  4.26   5.06   3.00   3.74   4.05   4.49
inv_txfm_add_16x16_dct_dct_1_8bpc_neon: 10.51  16.02  13.57  14.03  12.86  18.16
inv_txfm_add_16x16_dct_dct_2_8bpc_neon:  7.95  11.75   9.09  10.64  10.06  14.07
inv_txfm_add_32x32_dct_dct_0_8bpc_neon:  5.31   5.58   3.14   4.18   4.80   4.57
inv_txfm_add_32x32_dct_dct_1_8bpc_neon: 12.66  16.07  14.34  16.00  15.24  21.32
inv_txfm_add_32x32_dct_dct_4_8bpc_neon:  8.25  10.69   8.90  10.59  10.41  14.39
inv_txfm_add_64x64_dct_dct_0_8bpc_neon:  4.69   5.97   3.17   3.96   4.57   4.34
inv_txfm_add_64x64_dct_dct_1_8bpc_neon: 11.47  12.68  10.18  14.73  14.20  17.95
inv_txfm_add_64x64_dct_dct_4_8bpc_neon:  8.84  10.13   7.94  11.25  10.58  13.88
33 jobs for arm32-itx in 9 minutes and 18 seconds (queued for 8 minutes and 5 seconds)
Status Job ID Name Coverage
  Style
passed #407103
amd64 docker
style-check

00:00:15

 
  Build
passed #407114
amd64 docker
build-android-aarch64

00:00:19

passed #407113
amd64 docker
build-android-armv7

00:00:19

passed #407104
amd64 avx2 docker
build-debian

00:00:27

passed #407115
aarch64 docker
build-debian-aarch64

00:01:14

passed #407116
aarch64 docker
build-debian-aarch64-clang-5

00:00:51

passed #407119
armv7 docker
build-debian-armv7

00:00:41

passed #407120
armv7 docker
build-debian-armv7-clang-5

00:00:33

passed #407107
amd64 docker
build-debian-examples

00:00:24

passed #407122
ppc64le docker
build-debian-ppc64le

00:00:58

passed #407105
amd64 docker
build-debian-static

00:00:27

passed #407118
aarch64 docker
build-debian-werror

00:00:30

passed #407106
amd64 docker
build-debian32

00:00:37

passed #407117
macos
build-macos

00:00:37

passed #407123
amd64 docker
build-pages

00:00:16

passed #407121
amd64 docker
build-ubuntu-snap

00:00:26

passed #407111
amd64 docker
build-win-arm32

00:00:20

passed #407112
amd64 docker
build-win-arm64

00:00:21

passed #407108
amd64 docker
build-win32

00:00:36

passed #407109
amd64 docker
build-win32-unaligned-stack

00:00:33

passed #407110
amd64 docker
build-win64

00:00:39

 
  Test
passed #407124
amd64 docker
test-debian

00:00:43

98.1518%
passed #407133
aarch64 docker
test-debian-aarch64

00:02:14

passed #407135
armv7 docker
test-debian-armv7-clang-5

00:00:52

passed #407129
amd64 docker
test-debian-asan

00:01:16

passed #407125
avx2 amd64 docker
test-debian-asm

00:00:56

passed #407130
amd64 docker
test-debian-msan

00:00:44

passed #407127
amd64 docker
test-debian-mt

00:00:45

passed #407134
ppc64le docker
test-debian-ppc64le

00:01:28

passed #407131
amd64 docker
test-debian-ubsan

00:01:05

passed #407128
amd64 avx2 docker
test-debian-unaligned-stack

00:00:34

passed #407126
avx2 amd64 docker
test-debian32-asm

00:00:56

passed #407132
amd64 avx2 docker
test-win64

00:01:00