arm64: ipred: NEON implementation of paeth/smooth/palette/filter/cfl_pred/cfl_ac prediction functions
Relative speedups over the C code:
Cortex A53 A72 A73
intra_pred_paeth_w4_8bpc_neon: 8.36 6.55 7.27
intra_pred_paeth_w8_8bpc_neon: 15.24 11.36 11.34
intra_pred_paeth_w16_8bpc_neon: 16.63 13.20 14.17
intra_pred_paeth_w32_8bpc_neon: 10.83 9.21 9.87
intra_pred_paeth_w64_8bpc_neon: 8.37 7.07 7.45
intra_pred_smooth_h_w4_8bpc_neon: 8.02 4.53 7.09
intra_pred_smooth_h_w8_8bpc_neon: 16.59 5.91 9.32
intra_pred_smooth_h_w16_8bpc_neon: 18.80 5.54 10.10
intra_pred_smooth_h_w32_8bpc_neon: 5.07 4.43 4.60
intra_pred_smooth_h_w64_8bpc_neon: 5.03 4.26 4.34
intra_pred_smooth_v_w4_8bpc_neon: 9.11 5.51 7.75
intra_pred_smooth_v_w8_8bpc_neon: 17.07 6.86 10.55
intra_pred_smooth_v_w16_8bpc_neon: 17.98 6.38 11.52
intra_pred_smooth_v_w32_8bpc_neon: 11.69 5.66 8.09
intra_pred_smooth_v_w64_8bpc_neon: 8.44 4.34 5.72
intra_pred_smooth_w4_8bpc_neon: 9.81 4.85 6.93
intra_pred_smooth_w8_8bpc_neon: 16.05 5.60 9.26
intra_pred_smooth_w16_8bpc_neon: 14.01 5.02 8.96
intra_pred_smooth_w32_8bpc_neon: 9.29 5.02 7.25
intra_pred_smooth_w64_8bpc_neon: 6.53 3.94 5.26
intra_pred_filter_w4_8bpc_neon: 6.38 2.81 4.43
intra_pred_filter_w8_8bpc_neon: 9.30 3.62 5.71
intra_pred_filter_w16_8bpc_neon: 9.85 3.98 6.42
intra_pred_filter_w32_8bpc_neon: 10.77 4.08 7.09
pal_pred_w4_8bpc_neon: 8.75 6.15 7.60
pal_pred_w8_8bpc_neon: 19.93 11.79 10.98
pal_pred_w16_8bpc_neon: 24.68 13.28 16.06
pal_pred_w32_8bpc_neon: 23.56 11.81 16.74
pal_pred_w64_8bpc_neon: 23.16 12.19 17.60
cfl_pred_cfl_128_w4_8bpc_neon: 10.81 7.90 9.80
cfl_pred_cfl_128_w8_8bpc_neon: 18.38 11.15 13.24
cfl_pred_cfl_128_w16_8bpc_neon: 16.52 10.83 16.00
cfl_pred_cfl_128_w32_8bpc_neon: 3.27 3.60 3.70
cfl_pred_cfl_left_w4_8bpc_neon: 9.82 7.38 8.76
cfl_pred_cfl_left_w8_8bpc_neon: 17.22 10.63 11.97
cfl_pred_cfl_left_w16_8bpc_neon: 16.03 10.49 15.66
cfl_pred_cfl_left_w32_8bpc_neon: 3.28 3.61 3.72
cfl_pred_cfl_top_w4_8bpc_neon: 9.74 7.39 9.29
cfl_pred_cfl_top_w8_8bpc_neon: 17.48 10.89 12.58
cfl_pred_cfl_top_w16_8bpc_neon: 16.01 10.62 15.31
cfl_pred_cfl_top_w32_8bpc_neon: 3.25 3.62 3.75
cfl_pred_cfl_w4_8bpc_neon: 8.39 6.34 8.04
cfl_pred_cfl_w8_8bpc_neon: 15.99 10.12 12.42
cfl_pred_cfl_w16_8bpc_neon: 15.25 10.40 15.12
cfl_pred_cfl_w32_8bpc_neon: 3.23 3.58 3.71
cfl_ac_420_w4_8bpc_neon: 7.73 6.48 9.22
cfl_ac_420_w8_8bpc_neon: 6.70 5.56 6.95
cfl_ac_420_w16_8bpc_neon: 6.51 6.93 6.67
cfl_ac_422_w4_8bpc_neon: 9.25 7.70 9.75
cfl_ac_422_w8_8bpc_neon: 8.53 5.95 7.13
cfl_ac_422_w16_8bpc_neon: 7.08 6.87 6.06
The C code for cfl_pred gets autovectorized for w >= 32, which is why the relative speedup looks strange (but the performance of the NEON functions is completely as expected).