Skip to content

x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2

Ronald S. Bultje requested to merge rbultje/dav1d:itx-avx512icl-hbd-32x64 into master
inv_txfm_add_32x64_dct_dct_0_10bpc_c:           1783.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_0_10bpc_sse4:         243.3 ( 7.33x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx2:         119.1 (14.97x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx512icl:    142.6 (12.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_c:          50422.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4:        2880.5 (17.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2:        1423.4 (35.43x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl:    741.6 (67.99x)
inv_txfm_add_32x64_dct_dct_2_10bpc_c:          50433.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_2_10bpc_sse4:        4015.1 (12.56x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx2:        1767.7 (28.53x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx512icl:    960.8 (52.49x)
inv_txfm_add_32x64_dct_dct_3_10bpc_c:          50422.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_3_10bpc_sse4:        4500.5 (11.20x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx2:        2111.7 (23.88x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx512icl:   1777.1 (28.37x)
inv_txfm_add_32x64_dct_dct_4_10bpc_c:          50444.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_4_10bpc_sse4:        5592.8 ( 9.02x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx2:        2458.1 (20.52x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx512icl:   1867.2 (27.02x)

As with the 16x64, the dc-only is a bit slower than AVX2, which is apparently an issue on my testing side (@gramner could not reproduce).

Merge request reports