arm: 32 bit implementation of remaining ipred functions, arm64 (8+16 bpc) cfl_ac_444
This provides the remaining (except for Z1/Z2/Z3) ipred functions for arm32, and also adds an implementation of cfl_ac for 444 for arm64 (8 and 16 bpc).
Total speedup around 1% in single threaded mode.
Relative speedups over C code (vs potentially autovectorized code, built with Clang):
Cortex A7 A8 A9 A53 A72 A73
intra_pred_paeth_w4_8bpc_neon: 4.81 7.61 5.82 5.50 5.61 6.94
intra_pred_paeth_w8_8bpc_neon: 7.83 11.95 9.51 11.05 8.90 10.51
intra_pred_paeth_w16_8bpc_neon: 4.86 4.49 3.90 4.60 3.76 3.54
intra_pred_paeth_w32_8bpc_neon: 4.55 4.03 3.52 4.27 3.30 3.21
intra_pred_paeth_w64_8bpc_neon: 4.38 3.72 3.32 3.95 3.08 3.00
intra_pred_smooth_h_w4_8bpc_neon: 5.74 10.80 5.32 6.79 4.77 6.48
intra_pred_smooth_h_w8_8bpc_neon: 10.59 17.95 9.39 16.03 6.94 8.98
intra_pred_smooth_h_w16_8bpc_neon: 2.81 3.19 2.12 3.70 2.90 3.59
intra_pred_smooth_h_w32_8bpc_neon: 2.63 2.41 1.86 3.44 2.24 2.66
intra_pred_smooth_h_w64_8bpc_neon: 2.42 2.52 1.79 3.24 1.81 2.11
intra_pred_smooth_v_w4_8bpc_neon: 4.15 7.99 3.46 4.63 3.83 4.39
intra_pred_smooth_v_w8_8bpc_neon: 7.31 12.42 7.04 10.00 4.26 6.20
intra_pred_smooth_v_w16_8bpc_neon: 3.70 3.44 2.53 3.33 2.76 3.21
intra_pred_smooth_v_w32_8bpc_neon: 3.91 3.74 2.70 3.51 2.50 2.96
intra_pred_smooth_v_w64_8bpc_neon: 4.03 3.94 2.80 3.64 2.36 2.80
intra_pred_smooth_w4_8bpc_neon: 4.09 7.74 4.54 4.79 3.26 5.10
intra_pred_smooth_w8_8bpc_neon: 5.63 8.93 6.62 8.28 3.73 6.04
intra_pred_smooth_w16_8bpc_neon: 3.97 3.40 3.32 3.74 3.01 3.77
intra_pred_smooth_w32_8bpc_neon: 3.75 3.14 3.07 3.28 2.65 3.17
intra_pred_smooth_w64_8bpc_neon: 3.60 3.04 2.93 2.97 2.35 2.85
intra_pred_filter_w4_8bpc_neon: 5.54 6.43 4.90 7.26 3.44 4.61
intra_pred_filter_w8_8bpc_neon: 7.05 7.15 5.50 10.05 4.29 6.02
intra_pred_filter_w16_8bpc_neon: 7.36 6.46 5.27 11.51 4.75 6.70
intra_pred_filter_w32_8bpc_neon: 7.56 6.32 5.01 12.34 4.47 6.97
pal_pred_w4_8bpc_neon: 5.47 7.76 4.40 5.20 8.32 7.03
pal_pred_w8_8bpc_neon: 11.11 14.12 8.44 13.95 11.88 12.43
pal_pred_w16_8bpc_neon: 14.38 20.95 9.84 17.43 14.77 13.56
pal_pred_w32_8bpc_neon: 12.91 19.85 10.87 19.03 14.63 14.62
pal_pred_w64_8bpc_neon: 14.01 19.23 10.82 19.82 16.23 16.32
cfl_ac_420_w4_8bpc_neon: 8.11 13.41 7.92 9.26 10.55 9.36
cfl_ac_420_w8_8bpc_neon: 7.77 15.71 7.69 8.94 9.76 8.56
cfl_ac_420_w16_8bpc_neon: 7.72 13.71 8.30 9.05 9.81 9.02
cfl_ac_422_w4_8bpc_neon: 8.85 15.80 8.26 10.97 13.04 10.00
cfl_ac_422_w8_8bpc_neon: 8.77 16.96 7.57 10.46 12.16 9.92
cfl_ac_422_w16_8bpc_neon: 8.28 14.91 7.16 9.69 10.57 9.18
cfl_ac_444_w4_8bpc_neon: 7.47 14.13 7.50 9.76 11.11 9.39
cfl_ac_444_w8_8bpc_neon: 6.81 15.46 5.27 9.11 12.09 9.76
cfl_ac_444_w16_8bpc_neon: 6.11 13.68 4.62 8.17 10.78 8.92
cfl_ac_444_w32_8bpc_neon: 5.71 12.11 4.28 7.53 9.53 8.52
cfl_pred_cfl_128_w4_8bpc_neon: 7.46 12.63 8.48 8.03 7.64 9.29
cfl_pred_cfl_128_w8_8bpc_neon: 5.05 5.16 3.79 4.64 5.07 4.42
cfl_pred_cfl_128_w16_8bpc_neon: 4.44 5.17 3.65 4.20 4.41 4.74
cfl_pred_cfl_128_w32_8bpc_neon: 4.51 5.25 3.67 4.29 4.39 4.73
cfl_pred_cfl_left_w4_8bpc_neon: 6.60 11.74 7.75 6.91 7.44 9.14
cfl_pred_cfl_left_w8_8bpc_neon: 4.92 5.15 3.80 4.41 5.44 4.81
cfl_pred_cfl_left_w16_8bpc_neon: 4.40 5.26 3.66 4.10 4.63 4.94
cfl_pred_cfl_left_w32_8bpc_neon: 4.50 5.31 3.68 4.25 4.43 4.82
cfl_pred_cfl_top_w4_8bpc_neon: 7.00 11.88 7.88 7.50 7.43 9.68
cfl_pred_cfl_top_w8_8bpc_neon: 4.96 5.07 3.78 4.51 5.31 4.75
cfl_pred_cfl_top_w16_8bpc_neon: 4.42 5.31 3.69 4.16 4.60 4.93
cfl_pred_cfl_top_w32_8bpc_neon: 4.52 5.36 3.71 4.29 4.47 4.83
cfl_pred_cfl_w4_8bpc_neon: 5.92 10.54 7.25 6.21 6.79 8.33
cfl_pred_cfl_w8_8bpc_neon: 4.67 5.16 3.77 4.14 5.20 4.71
cfl_pred_cfl_w16_8bpc_neon: 4.29 5.29 3.70 3.97 4.53 4.86
cfl_pred_cfl_w32_8bpc_neon: 4.47 5.34 3.72 4.20 4.42 4.83