x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2
inv_txfm_add_32x64_dct_dct_0_10bpc_c: 1783.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_0_10bpc_sse4: 243.3 ( 7.33x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx2: 119.1 (14.97x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx512icl: 142.6 (12.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_c: 50422.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4: 2880.5 (17.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2: 1423.4 (35.43x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl: 741.6 (67.99x)
inv_txfm_add_32x64_dct_dct_2_10bpc_c: 50433.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_2_10bpc_sse4: 4015.1 (12.56x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx2: 1767.7 (28.53x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx512icl: 960.8 (52.49x)
inv_txfm_add_32x64_dct_dct_3_10bpc_c: 50422.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_3_10bpc_sse4: 4500.5 (11.20x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx2: 2111.7 (23.88x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx512icl: 1777.1 (28.37x)
inv_txfm_add_32x64_dct_dct_4_10bpc_c: 50444.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_4_10bpc_sse4: 5592.8 ( 9.02x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx2: 2458.1 (20.52x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx512icl: 1867.2 (27.02x)
As with the 16x64, the dc-only is a bit slower than AVX2, which is apparently an issue on my testing side (@gramner could not reproduce).
Merge request reports
Activity
requested review from @gramner
assigned to @rbultje
mentioned in issue #316
- Resolved by Ronald S. Bultje
I think the somewhat-mediocre gains on the 3 variant are because I am doing a 1/2 (16 non-zero) DCT32 for the right half, whereas the AVX2 presumably does a 1/4 (8 non-zero) on the third-of-4. We already have a "packed 8 non-zero" path for the left half, plugging that into the right half (also for the 32x32) should be fairly trivial and lead to gains for this codepath. I'm not sure whether it's super-relevant for real file playback, most coef blocks will have only a handful of non-zero coefs and run the 1 variant.
- Resolved by Ronald S. Bultje
added 5 commits
-
d0d5fae3...ed997f5f - 4 commits from branch
videolan:master
- 6ae57667 - x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2
-
d0d5fae3...ed997f5f - 4 commits from branch
changed milestone to %1.2.0
added performance x86 labels