arm64: itx: Add NEON implementation of itx for 10 bpc
This branch contains a number of minor fixups for the existing 8 bpc itx as well.
Add an element size specifier to the existing individual transform functions for 8 bpc, naming them e.g. inv_dct_8h_x8_neon, to clarify that they operate on input vectors of 8h, and make the symbols public, to let the 10 bpc case call them from a different object file. The same convention is used in the new itx16.S, like inv_dct_4s_x8_neon.
Make the existing itx.S compiled regardless of whether 8 bpc support is enabled. For builds with 8 bpc support disabled, this does include the unused frontend functions though, but this is hopefully tolerable to avoid having to split the file into a sharable file for transforms and a separate one for frontends.
This only implements the 10 bpc case, as that case can use transforms operating on 16 bit coefficients in the second pass.
Relative speedup vs C for a few functions:
Cortex A53 A72 A73
inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 4.14 4.06 4.49
inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 6.51 6.49 6.42
inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 5.02 4.63 6.23
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 8.54 7.13 11.96
inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.52 6.60 8.03
inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 11.27 9.62 12.22
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 9.60 6.97 8.59
inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.60 3.48 3.19
inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 14.65 12.64 16.86
inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 11.57 8.80 12.68
inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 8.79 8.00 9.21
inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 7.58 6.21 7.80
inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 2.41 2.85 2.75
inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 12.91 10.27 12.24
inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 10.96 7.97 10.31
inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 8.95 7.42 9.55
inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 7.97 6.12 7.82