Skip to content

arm64: ipred: NEON implementation of paeth/smooth/palette/filter/cfl_pred/cfl_ac prediction functions

Martin Storsjö requested to merge mstorsjo/dav1d:arm-ipred2 into master

Relative speedups over the C code:

                              Cortex A53    A72    A73
intra_pred_paeth_w4_8bpc_neon:      8.36   6.55   7.27
intra_pred_paeth_w8_8bpc_neon:     15.24  11.36  11.34
intra_pred_paeth_w16_8bpc_neon:    16.63  13.20  14.17
intra_pred_paeth_w32_8bpc_neon:    10.83   9.21   9.87
intra_pred_paeth_w64_8bpc_neon:     8.37   7.07   7.45
intra_pred_smooth_h_w4_8bpc_neon:   8.02   4.53   7.09
intra_pred_smooth_h_w8_8bpc_neon:  16.59   5.91   9.32
intra_pred_smooth_h_w16_8bpc_neon: 18.80   5.54  10.10
intra_pred_smooth_h_w32_8bpc_neon:  5.07   4.43   4.60
intra_pred_smooth_h_w64_8bpc_neon:  5.03   4.26   4.34
intra_pred_smooth_v_w4_8bpc_neon:   9.11   5.51   7.75
intra_pred_smooth_v_w8_8bpc_neon:  17.07   6.86  10.55
intra_pred_smooth_v_w16_8bpc_neon: 17.98   6.38  11.52
intra_pred_smooth_v_w32_8bpc_neon: 11.69   5.66   8.09
intra_pred_smooth_v_w64_8bpc_neon:  8.44   4.34   5.72
intra_pred_smooth_w4_8bpc_neon:     9.81   4.85   6.93
intra_pred_smooth_w8_8bpc_neon:    16.05   5.60   9.26
intra_pred_smooth_w16_8bpc_neon:   14.01   5.02   8.96
intra_pred_smooth_w32_8bpc_neon:    9.29   5.02   7.25
intra_pred_smooth_w64_8bpc_neon:    6.53   3.94   5.26
intra_pred_filter_w4_8bpc_neon:     6.38   2.81   4.43
intra_pred_filter_w8_8bpc_neon:     9.30   3.62   5.71
intra_pred_filter_w16_8bpc_neon:    9.85   3.98   6.42
intra_pred_filter_w32_8bpc_neon:   10.77   4.08   7.09
pal_pred_w4_8bpc_neon:              8.75   6.15   7.60
pal_pred_w8_8bpc_neon:             19.93  11.79  10.98
pal_pred_w16_8bpc_neon:            24.68  13.28  16.06
pal_pred_w32_8bpc_neon:            23.56  11.81  16.74
pal_pred_w64_8bpc_neon:            23.16  12.19  17.60
cfl_pred_cfl_128_w4_8bpc_neon:     10.81   7.90   9.80
cfl_pred_cfl_128_w8_8bpc_neon:     18.38  11.15  13.24
cfl_pred_cfl_128_w16_8bpc_neon:    16.52  10.83  16.00
cfl_pred_cfl_128_w32_8bpc_neon:     3.27   3.60   3.70
cfl_pred_cfl_left_w4_8bpc_neon:     9.82   7.38   8.76
cfl_pred_cfl_left_w8_8bpc_neon:    17.22  10.63  11.97
cfl_pred_cfl_left_w16_8bpc_neon:   16.03  10.49  15.66
cfl_pred_cfl_left_w32_8bpc_neon:    3.28   3.61   3.72
cfl_pred_cfl_top_w4_8bpc_neon:      9.74   7.39   9.29
cfl_pred_cfl_top_w8_8bpc_neon:     17.48  10.89  12.58
cfl_pred_cfl_top_w16_8bpc_neon:    16.01  10.62  15.31
cfl_pred_cfl_top_w32_8bpc_neon:     3.25   3.62   3.75
cfl_pred_cfl_w4_8bpc_neon:          8.39   6.34   8.04
cfl_pred_cfl_w8_8bpc_neon:         15.99  10.12  12.42
cfl_pred_cfl_w16_8bpc_neon:        15.25  10.40  15.12
cfl_pred_cfl_w32_8bpc_neon:         3.23   3.58   3.71
cfl_ac_420_w4_8bpc_neon:            7.73   6.48   9.22
cfl_ac_420_w8_8bpc_neon:            6.70   5.56   6.95
cfl_ac_420_w16_8bpc_neon:           6.51   6.93   6.67
cfl_ac_422_w4_8bpc_neon:            9.25   7.70   9.75
cfl_ac_422_w8_8bpc_neon:            8.53   5.95   7.13
cfl_ac_422_w16_8bpc_neon:           7.08   6.87   6.06

The C code for cfl_pred gets autovectorized for w >= 32, which is why the relative speedup looks strange (but the performance of the NEON functions is completely as expected).

Merge request reports

Loading