arm64: ipred: 8 bpc NEON implementations of the Z1 and Z3 functions (!1478) · Merge requests · VideoLAN / dav1d

The Z3 implementation is a hybrid between two approaches; one generic (but non-ideal) for cases with large max_base_y, which fills two pixel columns at a time, i.e. looping over pixels first vertically, then horizontally - i.e. in a non-optimal manner.

For cases with smaller max_base_y, it does two rows at a time, essentially doing gathers with the TBX instruction.

Relative speedup over the C code:

                         Cortex A53    A55    A72    A73    A76   Apple M1
intra_pred_z1_w4_8bpc_neon:    4.09   3.15   3.63   4.16   3.27  13.00
intra_pred_z1_w8_8bpc_neon:    6.93   5.66   5.57   6.76   5.51   5.50
intra_pred_z1_w16_8bpc_neon:   7.81   6.85   6.24   7.78   6.59   9.00
intra_pred_z1_w32_8bpc_neon:  10.56   9.95   8.72  10.95   8.28  13.33
intra_pred_z1_w64_8bpc_neon:  11.00  11.38   9.11  11.62   8.65  14.61
intra_pred_z3_w4_8bpc_neon:    3.32   2.89   2.78   3.52   2.52   9.67
intra_pred_z3_w8_8bpc_neon:    6.24   5.55   4.76   5.60   4.11   6.40
intra_pred_z3_w16_8bpc_neon:   7.64   7.07   4.37   6.23   4.18   8.60
intra_pred_z3_w32_8bpc_neon:   7.51   7.21   4.34   5.92   4.27   7.88
intra_pred_z3_w64_8bpc_neon:   6.82   6.25   4.08   5.83   3.52   7.31

(The speedup numbers for M1 are kinda noisy due to the very coarse granularity of the timer used there.)

arm64: ipred: 8 bpc NEON implementations of the Z1 and Z3 functions

Merge request reports