Skip to content
Snippets Groups Projects

x86: add AVX512-IceLake implementation of HBD 64x64 DCT^2

Merged Ronald S. Bultje requested to merge rbultje/dav1d:itx-avx512icl-hbd-64x64 into master
All threads resolved!

Also implement "fast3" path for pass2.dct64 (where 1/8th of the coefficients are non-zero), which affects 32x64 as well as 64x64.

Before:

inv_txfm_add_32x64_dct_dct_1_10bpc_c:          51008.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4:        3351.9 (15.22x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2:        1419.5 (35.93x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl:    744.8 (68.49x)

After:

inv_txfm_add_32x64_dct_dct_1_10bpc_c:          51019.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4:        3276.1 (15.57x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2:        1420.7 (35.91x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl:    668.3 (76.34x)

(Not sure why the SSE4 speed changed.)

And speed for 64x64:

inv_txfm_add_64x64_dct_dct_0_10bpc_c:           3506.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_0_10bpc_sse4:         535.6 ( 6.55x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx2:         223.5 (15.69x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx512icl:    252.4 (13.89x)
inv_txfm_add_64x64_dct_dct_1_10bpc_c:         108353.7 ( 1.00x)
inv_txfm_add_64x64_dct_dct_1_10bpc_sse4:        6551.9 (16.54x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx2:        2876.8 (37.66x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx512icl:   1310.1 (82.70x)
inv_txfm_add_64x64_dct_dct_2_10bpc_c:         108347.6 ( 1.00x)
inv_txfm_add_64x64_dct_dct_2_10bpc_sse4:        7985.4 (13.57x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx2:        3561.8 (30.42x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx512icl:   1962.6 (55.20x)
inv_txfm_add_64x64_dct_dct_3_10bpc_c:         108455.5 ( 1.00x)
inv_txfm_add_64x64_dct_dct_3_10bpc_sse4:        9709.0 (11.17x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx2:        4220.5 (25.70x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx512icl:   2991.1 (36.26x)
inv_txfm_add_64x64_dct_dct_4_10bpc_c:         108349.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_4_10bpc_sse4:       11048.0 ( 9.81x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx2:        4898.1 (22.12x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx512icl:   3108.1 (34.86x)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Ronald S. Bultje resolved all threads

    resolved all threads

  • added 1 commit

    • d14c773c - x86: add AVX512-IceLake implementation of HBD 64x64 DCT^2

    Compare with previous version

  • Ronald S. Bultje resolved all threads

    resolved all threads

  • Ronald S. Bultje resolved all threads

    resolved all threads

  • Ronald S. Bultje resolved all threads

    resolved all threads

  • Henrik Gramner approved this merge request

    approved this merge request

  • Ronald S. Bultje added 2 commits

    added 2 commits

    • feeeccb6 - 1 commit from branch videolan:master
    • ad0f3e6a - x86: add AVX512-IceLake implementation of HBD 64x64 DCT^2

    Compare with previous version

  • Ronald S. Bultje enabled an automatic merge when the pipeline for ad0f3e6a succeeds

    enabled an automatic merge when the pipeline for ad0f3e6a succeeds

  • Jean-Baptiste Kempf changed milestone to %1.2.0

    changed milestone to %1.2.0

  • Please register or sign in to reply
    Loading