Skip to content

Add ipred_z3 AVX2 asm

Henrik Gramner requested to merge gramner/dav1d:ipred_z3 into master

Somewhat similar to z1 at a first glance, but there are subtle differences pretty much everywhere so very little code is actually identical.

The main differences compared to z1 is that we do a jump table on height instead of width, the order of input values is inverted, and the output has to be transposed which adds some additional complexity.

Average benchmark results on Skylake-X from a large number of runs with random input (code is branchy so run time can vary a lot depending on input):

intra_pred_z3_w4_8bpc_c: 225.5
intra_pred_z3_w4_8bpc_avx2: 44.0
intra_pred_z3_w8_8bpc_c: 614.9
intra_pred_z3_w8_8bpc_avx2: 67.5
intra_pred_z3_w16_8bpc_c: 1769.5
intra_pred_z3_w16_8bpc_avx2: 140.7
intra_pred_z3_w32_8bpc_c: 3931.0
intra_pred_z3_w32_8bpc_avx2: 290.0
intra_pred_z3_w64_8bpc_c: 9099.9
intra_pred_z3_w64_8bpc_avx2: 637.5

Merge request reports