arm32: itx: Add a NEON implementation of itx for 10 bpc
Relative speedup vs C for a few functions:
Cortex A7 A8 A9 A53 A72 A73
inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 2.79 5.08 2.99 2.83 3.49 4.44
inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 5.74 9.43 5.72 7.19 6.73 6.92
inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 3.13 3.68 2.79 3.25 3.21 3.33
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 7.09 10.41 7.00 10.55 8.06 9.02
inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.01 6.76 4.56 5.58 5.52 2.97
inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 8.62 12.48 13.71 11.75 15.94 16.86
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 6.05 8.81 6.13 8.18 7.90 12.27
inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.90 3.90 2.16 2.63 3.56 2.74
inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 13.57 17.00 13.30 13.76 14.54 17.08
inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 8.29 10.54 8.05 10.68 12.75 14.36
inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 6.78 8.40 7.60 10.12 8.97 12.96
inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 6.48 6.74 6.00 7.38 7.67 9.70
inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 3.02 4.59 2.21 2.65 3.36 2.47
inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 9.86 11.30 9.14 13.80 12.46 14.83
inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 8.65 9.76 7.60 12.05 10.55 12.62
inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 7.78 8.65 6.98 10.63 9.15 11.73
inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 6.61 7.01 5.52 8.41 8.33 9.69