AArch64: Optimize vertical i8mm subpel filters (!1657) · Merge requests · VideoLAN / dav1d

Replace the accumulator initializations of the vertical subpel filters with register fills by zeros (which are usually zero latency operations in this feature class), this implies the usage of rounding shifts at the end in the prep cases. Out-of-order CPU cores can benefit from this change.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:

mct_8tap_sharp_w16_v_8bpc_i8mm:	0.910x
mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.986x

mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.864x
mc_8tap_sharp_w8_v_8bpc_i8mm:  	0.882x
mc_8tap_sharp_w4_v_8bpc_i8mm:  	0.933x
mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.926x

Cortex-A715:

mct_8tap_sharp_w16_v_8bpc_i8mm:	0.855x
mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.784x
mct_8tap_sharp_w4_v_8bpc_i8mm:  1.069x

mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.850x
mc_8tap_sharp_w8_v_8bpc_i8mm:  	0.779x
mc_8tap_sharp_w4_v_8bpc_i8mm:  	0.971x
mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.975x

Cortex-A510:

mct_8tap_sharp_w16_v_8bpc_i8mm: 1.001x
mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.979x
mct_8tap_sharp_w4_v_8bpc_i8mm: 	0.998x

mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.998x
mc_8tap_sharp_w8_v_8bpc_i8mm:   1.004x
mc_8tap_sharp_w4_v_8bpc_i8mm:   1.003x
mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.996x

AArch64: Optimize vertical i8mm subpel filters

Merge request reports