Skip to content
Snippets Groups Projects

x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2

Merged Ronald S. Bultje requested to merge rbultje/dav1d:itx-avx512icl-hbd-32x64 into master
All threads resolved!
inv_txfm_add_32x64_dct_dct_0_10bpc_c:           1783.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_0_10bpc_sse4:         243.3 ( 7.33x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx2:         119.1 (14.97x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx512icl:    142.6 (12.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_c:          50422.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4:        2880.5 (17.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2:        1423.4 (35.43x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl:    741.6 (67.99x)
inv_txfm_add_32x64_dct_dct_2_10bpc_c:          50433.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_2_10bpc_sse4:        4015.1 (12.56x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx2:        1767.7 (28.53x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx512icl:    960.8 (52.49x)
inv_txfm_add_32x64_dct_dct_3_10bpc_c:          50422.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_3_10bpc_sse4:        4500.5 (11.20x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx2:        2111.7 (23.88x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx512icl:   1777.1 (28.37x)
inv_txfm_add_32x64_dct_dct_4_10bpc_c:          50444.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_4_10bpc_sse4:        5592.8 ( 9.02x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx2:        2458.1 (20.52x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx512icl:   1867.2 (27.02x)

As with the 16x64, the dc-only is a bit slower than AVX2, which is apparently an issue on my testing side (@gramner could not reproduce).

Merge request reports

Pipeline #332157 passed

Pipeline passed for 6ae57667 on rbultje:itx-avx512icl-hbd-32x64

Test coverage 92.08% (-0.02%) from 1 job
Approved by

Merged by Ronald S. BultjeRonald S. Bultje 1 year ago (Apr 12, 2023 11:45pm UTC)

Merge details

  • Changes merged into master with 6ae57667.
  • Deleted the source branch.

Pipeline #332168 passed

Pipeline passed for 6ae57667 on master

Test coverage 92.00% (-0.02%) from 1 job

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Author Developer

    I think the somewhat-mediocre gains on the 3 variant are because I am doing a 1/2 (16 non-zero) DCT32 for the right half, whereas the AVX2 presumably does a 1/4 (8 non-zero) on the third-of-4. We already have a "packed 8 non-zero" path for the left half, plugging that into the right half (also for the 32x32) should be fairly trivial and lead to gains for this codepath. I'm not sure whether it's super-relevant for real file playback, most coef blocks will have only a handful of non-zero coefs and run the 1 variant.

  • Henrik Gramner
  • Ronald S. Bultje resolved all threads

    resolved all threads

  • Ronald S. Bultje added 5 commits

    added 5 commits

    Compare with previous version

  • Henrik Gramner approved this merge request

    approved this merge request

  • Jean-Baptiste Kempf changed milestone to %1.2.0

    changed milestone to %1.2.0

  • Please register or sign in to reply
    Loading