Skip to content

AArch64: Optimize horizontal i8mm prep filters

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_sbd_i8mm_h into master

Replace the accumulator initializations of the horizontal prep filters with register fills by zeros. Most i8mm capable CPUs can do these with zero latency, but we also need to use rounding shifts at the end of the filter. We can see better performance with this change on out-of-order CPUs.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:

mct_8tap_sharp_w32_h_8bpc_i8mm:  0.914x
mct_8tap_sharp_w16_h_8bpc_i8mm:  0.906x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.877x

Cortex-A715:

mct_8tap_sharp_w32_h_8bpc_i8mm:  0.819x
mct_8tap_sharp_w16_h_8bpc_i8mm:  0.805x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.779x

Cortex-A510:

mct_8tap_sharp_w32_h_8bpc_i8mm:  0.999x
mct_8tap_sharp_w16_h_8bpc_i8mm:  1.001x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.996x
mct_8tap_sharp_w4_h_8bpc_i8mm:   0.915x

Merge request reports