arm64: itx: Add NEON implementation of itx for 10 bpc (!985) · Merge requests · VideoLAN / dav1d

Martin Storsjö requested to merge mstorsjo/dav1d:arm64-itx-10bpc into master May 05, 2020

This branch contains a number of minor fixups for the existing 8 bpc itx as well.

Add an element size specifier to the existing individual transform functions for 8 bpc, naming them e.g. inv_dct_8h_x8_neon, to clarify that they operate on input vectors of 8h, and make the symbols public, to let the 10 bpc case call them from a different object file. The same convention is used in the new itx16.S, like inv_dct_4s_x8_neon.

Make the existing itx.S compiled regardless of whether 8 bpc support is enabled. For builds with 8 bpc support disabled, this does include the unused frontend functions though, but this is hopefully tolerable to avoid having to split the file into a sharable file for transforms and a separate one for frontends.

This only implements the 10 bpc case, as that case can use transforms operating on 16 bit coefficients in the second pass.

Relative speedup vs C for a few functions:

                                     Cortex A53    A72    A73
inv_txfm_add_4x4_dct_dct_0_10bpc_neon:     4.14   4.06   4.49
inv_txfm_add_4x4_dct_dct_1_10bpc_neon:     6.51   6.49   6.42
inv_txfm_add_8x8_dct_dct_0_10bpc_neon:     5.02   4.63   6.23
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:     8.54   7.13  11.96
inv_txfm_add_16x16_dct_dct_0_10bpc_neon:   5.52   6.60   8.03
inv_txfm_add_16x16_dct_dct_1_10bpc_neon:  11.27   9.62  12.22
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:   9.60   6.97   8.59
inv_txfm_add_32x32_dct_dct_0_10bpc_neon:   2.60   3.48   3.19
inv_txfm_add_32x32_dct_dct_1_10bpc_neon:  14.65  12.64  16.86
inv_txfm_add_32x32_dct_dct_2_10bpc_neon:  11.57   8.80  12.68
inv_txfm_add_32x32_dct_dct_3_10bpc_neon:   8.79   8.00   9.21
inv_txfm_add_32x32_dct_dct_4_10bpc_neon:   7.58   6.21   7.80
inv_txfm_add_64x64_dct_dct_0_10bpc_neon:   2.41   2.85   2.75
inv_txfm_add_64x64_dct_dct_1_10bpc_neon:  12.91  10.27  12.24
inv_txfm_add_64x64_dct_dct_2_10bpc_neon:  10.96   7.97  10.31
inv_txfm_add_64x64_dct_dct_3_10bpc_neon:   8.95   7.42   9.55
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   7.97   6.12   7.82

arm64: itx: Add NEON implementation of itx for 10 bpc

Merge request reports