x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2

requested review from @gramner

assigned to @rbultje

mentioned in issue #316

I think the somewhat-mediocre gains on the 3 variant are because I am doing a 1/2 (16 non-zero) DCT32 for the right half, whereas the AVX2 presumably does a 1/4 (8 non-zero) on the third-of-4. We already have a "packed 8 non-zero" path for the left half, plugging that into the right half (also for the 32x32) should be fairly trivial and lead to gains for this codepath. I'm not sure whether it's super-relevant for real file playback, most coef blocks will have only a handful of non-zero coefs and run the 1 variant.

resolved all threads

added 5 commits

d0d5fae3...ed997f5f - 4 commits from branch videolan:master
6ae57667 - x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2

Compare with previous version

approved this merge request

merged

changed milestone to %1.2.0

added performance x86 labels

x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2

Merge request reports

Activity