arm: itx: Add NEON implementation of itx for 8 bpc (!1017) · Merge requests · VideoLAN / dav1d

Overall speedup for decoding the whole 8 bit Chimera is around from 28% (goes from 63 to 81 fps).

The transforms process vectors of up to 8 elements at a time, for transforms up to size 8; for larger transforms, it uses vectors of 4 elements.

Overall, the speedup over C code seems to be around 8-14x for the larger transforms, and 10-19x for the smaller ones.

Relative speedup over C code (built with GCC 7.5) for a few functions:

                                    Cortex A7     A8     A9    A53    A72    A73
inv_txfm_add_4x4_dct_dct_0_8bpc_neon:    3.83   3.42   2.57   3.36   2.97   7.47
inv_txfm_add_4x4_dct_dct_1_8bpc_neon:    7.25  13.53   8.38   8.82   7.96  12.37
inv_txfm_add_8x8_dct_dct_0_8bpc_neon:    4.78   6.61   4.82   4.65   5.27   9.76
inv_txfm_add_8x8_dct_dct_1_8bpc_neon:   10.20  19.07  13.07  14.69  11.45  15.50
inv_txfm_add_16x16_dct_dct_0_8bpc_neon:  4.26   5.06   3.00   3.74   4.05   4.49
inv_txfm_add_16x16_dct_dct_1_8bpc_neon: 10.51  16.02  13.57  14.03  12.86  18.16
inv_txfm_add_16x16_dct_dct_2_8bpc_neon:  7.95  11.75   9.09  10.64  10.06  14.07
inv_txfm_add_32x32_dct_dct_0_8bpc_neon:  5.31   5.58   3.14   4.18   4.80   4.57
inv_txfm_add_32x32_dct_dct_1_8bpc_neon: 12.66  16.07  14.34  16.00  15.24  21.32
inv_txfm_add_32x32_dct_dct_4_8bpc_neon:  8.25  10.69   8.90  10.59  10.41  14.39
inv_txfm_add_64x64_dct_dct_0_8bpc_neon:  4.69   5.97   3.17   3.96   4.57   4.34
inv_txfm_add_64x64_dct_dct_1_8bpc_neon: 11.47  12.68  10.18  14.73  14.20  17.95
inv_txfm_add_64x64_dct_dct_4_8bpc_neon:  8.84  10.13   7.94  11.25  10.58  13.88

Edited Aug 09, 2021 by Jean-Baptiste Kempf

arm: itx: Add NEON implementation of itx for 8 bpc

Merge request reports