Skip to content

x86: Add high bit-depth loopfilter AVX-512 (Ice Lake) asm

Henrik Gramner requested to merge gramner/dav1d:loopfilter16_avx512icl into master

Overall a decent amount faster than AVX2, vertical being more beneficial than horizontal mainly due to the transposes in the latter being a bit of a bottleneck (current Intel CPUs can do two 256-bit shuffles or one 512-bit shuffle per cycle).

                                w4      w8      w16
lpf_v_sb_y_16bpc_avx2:         184.3   370.9   544.6
lpf_v_sb_y_16bpc_avx512icl:    111.7   210.2   336.4

lpf_h_sb_y_16bpc_avx2:         321.8   546.1   844.6
lpf_h_sb_y_16bpc_avx512icl:    253.9   405.9   717.7

                                w4      w6
lpf_v_sb_uv_16bpc_avx2:         95.4   161.2
lpf_v_sb_uv_16bpc_avx512icl:    59.2    90.3

lpf_h_sb_uv_16bpc_avx2:        163.3   236.2
lpf_h_sb_uv_16bpc_avx512icl:   133.1   168.9

Merge request reports