Skip to content

arm: 32 bit implementation of remaining ipred functions, arm64 (8+16 bpc) cfl_ac_444

Martin Storsjö requested to merge mstorsjo/dav1d:arm32-ipred into master

This provides the remaining (except for Z1/Z2/Z3) ipred functions for arm32, and also adds an implementation of cfl_ac for 444 for arm64 (8 and 16 bpc).

Total speedup around 1% in single threaded mode.

Relative speedups over C code (vs potentially autovectorized code, built with Clang):

                                Cortex A7     A8     A9    A53    A72    A73
intra_pred_paeth_w4_8bpc_neon:       4.81   7.61   5.82   5.50   5.61   6.94
intra_pred_paeth_w8_8bpc_neon:       7.83  11.95   9.51  11.05   8.90  10.51
intra_pred_paeth_w16_8bpc_neon:      4.86   4.49   3.90   4.60   3.76   3.54
intra_pred_paeth_w32_8bpc_neon:      4.55   4.03   3.52   4.27   3.30   3.21
intra_pred_paeth_w64_8bpc_neon:      4.38   3.72   3.32   3.95   3.08   3.00
intra_pred_smooth_h_w4_8bpc_neon:    5.74  10.80   5.32   6.79   4.77   6.48
intra_pred_smooth_h_w8_8bpc_neon:   10.59  17.95   9.39  16.03   6.94   8.98
intra_pred_smooth_h_w16_8bpc_neon:   2.81   3.19   2.12   3.70   2.90   3.59
intra_pred_smooth_h_w32_8bpc_neon:   2.63   2.41   1.86   3.44   2.24   2.66
intra_pred_smooth_h_w64_8bpc_neon:   2.42   2.52   1.79   3.24   1.81   2.11
intra_pred_smooth_v_w4_8bpc_neon:    4.15   7.99   3.46   4.63   3.83   4.39
intra_pred_smooth_v_w8_8bpc_neon:    7.31  12.42   7.04  10.00   4.26   6.20
intra_pred_smooth_v_w16_8bpc_neon:   3.70   3.44   2.53   3.33   2.76   3.21
intra_pred_smooth_v_w32_8bpc_neon:   3.91   3.74   2.70   3.51   2.50   2.96
intra_pred_smooth_v_w64_8bpc_neon:   4.03   3.94   2.80   3.64   2.36   2.80
intra_pred_smooth_w4_8bpc_neon:      4.09   7.74   4.54   4.79   3.26   5.10
intra_pred_smooth_w8_8bpc_neon:      5.63   8.93   6.62   8.28   3.73   6.04
intra_pred_smooth_w16_8bpc_neon:     3.97   3.40   3.32   3.74   3.01   3.77
intra_pred_smooth_w32_8bpc_neon:     3.75   3.14   3.07   3.28   2.65   3.17
intra_pred_smooth_w64_8bpc_neon:     3.60   3.04   2.93   2.97   2.35   2.85
intra_pred_filter_w4_8bpc_neon:      5.54   6.43   4.90   7.26   3.44   4.61
intra_pred_filter_w8_8bpc_neon:      7.05   7.15   5.50  10.05   4.29   6.02
intra_pred_filter_w16_8bpc_neon:     7.36   6.46   5.27  11.51   4.75   6.70
intra_pred_filter_w32_8bpc_neon:     7.56   6.32   5.01  12.34   4.47   6.97
pal_pred_w4_8bpc_neon:               5.47   7.76   4.40   5.20   8.32   7.03
pal_pred_w8_8bpc_neon:              11.11  14.12   8.44  13.95  11.88  12.43
pal_pred_w16_8bpc_neon:             14.38  20.95   9.84  17.43  14.77  13.56
pal_pred_w32_8bpc_neon:             12.91  19.85  10.87  19.03  14.63  14.62
pal_pred_w64_8bpc_neon:             14.01  19.23  10.82  19.82  16.23  16.32
cfl_ac_420_w4_8bpc_neon:             8.11  13.41   7.92   9.26  10.55   9.36
cfl_ac_420_w8_8bpc_neon:             7.77  15.71   7.69   8.94   9.76   8.56
cfl_ac_420_w16_8bpc_neon:            7.72  13.71   8.30   9.05   9.81   9.02
cfl_ac_422_w4_8bpc_neon:             8.85  15.80   8.26  10.97  13.04  10.00
cfl_ac_422_w8_8bpc_neon:             8.77  16.96   7.57  10.46  12.16   9.92
cfl_ac_422_w16_8bpc_neon:            8.28  14.91   7.16   9.69  10.57   9.18
cfl_ac_444_w4_8bpc_neon:             7.47  14.13   7.50   9.76  11.11   9.39
cfl_ac_444_w8_8bpc_neon:             6.81  15.46   5.27   9.11  12.09   9.76
cfl_ac_444_w16_8bpc_neon:            6.11  13.68   4.62   8.17  10.78   8.92
cfl_ac_444_w32_8bpc_neon:            5.71  12.11   4.28   7.53   9.53   8.52
cfl_pred_cfl_128_w4_8bpc_neon:       7.46  12.63   8.48   8.03   7.64   9.29
cfl_pred_cfl_128_w8_8bpc_neon:       5.05   5.16   3.79   4.64   5.07   4.42
cfl_pred_cfl_128_w16_8bpc_neon:      4.44   5.17   3.65   4.20   4.41   4.74
cfl_pred_cfl_128_w32_8bpc_neon:      4.51   5.25   3.67   4.29   4.39   4.73
cfl_pred_cfl_left_w4_8bpc_neon:      6.60  11.74   7.75   6.91   7.44   9.14
cfl_pred_cfl_left_w8_8bpc_neon:      4.92   5.15   3.80   4.41   5.44   4.81
cfl_pred_cfl_left_w16_8bpc_neon:     4.40   5.26   3.66   4.10   4.63   4.94
cfl_pred_cfl_left_w32_8bpc_neon:     4.50   5.31   3.68   4.25   4.43   4.82
cfl_pred_cfl_top_w4_8bpc_neon:       7.00  11.88   7.88   7.50   7.43   9.68
cfl_pred_cfl_top_w8_8bpc_neon:       4.96   5.07   3.78   4.51   5.31   4.75
cfl_pred_cfl_top_w16_8bpc_neon:      4.42   5.31   3.69   4.16   4.60   4.93
cfl_pred_cfl_top_w32_8bpc_neon:      4.52   5.36   3.71   4.29   4.47   4.83
cfl_pred_cfl_w4_8bpc_neon:           5.92  10.54   7.25   6.21   6.79   8.33
cfl_pred_cfl_w8_8bpc_neon:           4.67   5.16   3.77   4.14   5.20   4.71
cfl_pred_cfl_w16_8bpc_neon:          4.29   5.29   3.70   3.97   4.53   4.86
cfl_pred_cfl_w32_8bpc_neon:          4.47   5.34   3.72   4.20   4.42   4.83

Merge request reports