Skip to content
Snippets Groups Projects

x86: add AVX512-IceLake implementation of HBD 64x32 DCT^2

Merged Ronald S. Bultje requested to merge rbultje/dav1d:itx-avx512icl-hbd-64x32 into master
1 unresolved thread
inv_txfm_add_64x32_dct_dct_0_10bpc_c:           1760.6 ( 1.00x)
inv_txfm_add_64x32_dct_dct_0_10bpc_sse4:         271.1 ( 6.49x)
inv_txfm_add_64x32_dct_dct_0_10bpc_avx2:         121.3 (14.52x)
inv_txfm_add_64x32_dct_dct_0_10bpc_avx512icl:    116.3 (15.14x)
inv_txfm_add_64x32_dct_dct_1_10bpc_c:          66507.4 ( 1.00x)
inv_txfm_add_64x32_dct_dct_1_10bpc_sse4:        3712.4 (17.91x)
inv_txfm_add_64x32_dct_dct_1_10bpc_avx2:        1830.5 (36.33x)
inv_txfm_add_64x32_dct_dct_1_10bpc_avx512icl:    805.4 (82.58x)
inv_txfm_add_64x32_dct_dct_2_10bpc_c:          66491.6 ( 1.00x)
inv_txfm_add_64x32_dct_dct_2_10bpc_sse4:        5325.3 (12.49x)
inv_txfm_add_64x32_dct_dct_2_10bpc_avx2:        2578.5 (25.79x)
inv_txfm_add_64x32_dct_dct_2_10bpc_avx512icl:   1394.5 (47.68x)
inv_txfm_add_64x32_dct_dct_3_10bpc_c:          66490.2 ( 1.00x)
inv_txfm_add_64x32_dct_dct_3_10bpc_sse4:        6418.5 (10.36x)
inv_txfm_add_64x32_dct_dct_3_10bpc_avx2:        3305.6 (20.11x)
inv_txfm_add_64x32_dct_dct_3_10bpc_avx512icl:   2571.5 (25.86x)
inv_txfm_add_64x32_dct_dct_4_10bpc_c:          66508.6 ( 1.00x)
inv_txfm_add_64x32_dct_dct_4_10bpc_sse4:        8671.2 ( 7.67x)
inv_txfm_add_64x32_dct_dct_4_10bpc_avx2:        4054.2 (16.40x)
inv_txfm_add_64x32_dct_dct_4_10bpc_avx512icl:   2691.6 (24.71x)

Merge request reports

Pipeline #334035 passed

Pipeline passed for 68d7a76d on rbultje:itx-avx512icl-hbd-64x32

Test coverage 92.05% (0.16%) from 1 job

Merged by Ronald S. BultjeRonald S. Bultje 1 year ago (Apr 18, 2023 3:53pm UTC)

Loading

Pipeline #334046 passed

Pipeline passed for 68d7a76d on master

Test coverage 92.07% (0.16%) from 1 job

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
5517 call m(inv_txfm_add_dct_dct_32x32_10bpc).pass2_fast2_start
5518 mov r7d, 16*4
5519 mov r4, dstq
5520 pxor m12, m12
5521 call m(inv_txfm_add_dct_dct_32x32_10bpc).pass2_end
5522 lea dstq, [r4+64]
5523 mova m0, [rsp+16*mmsize]
5524 mova m1, [rsp+17*mmsize]
5525 mova m2, [rsp+18*mmsize]
5526 mova m3, [rsp+19*mmsize]
5527 mova m4, [rsp+20*mmsize]
5528 mova m5, [rsp+21*mmsize]
5529 mova m6, [rsp+22*mmsize]
5530 mova m7, [rsp+23*mmsize]
5531 lea r5, [o_base]
5532 vpbroadcastd m13, [o(pd_2048)]
  • Author Developer

    I should be able to not clobber m10 (I believe it's clobbered in .transpose_8x32) and then I need to load it only once instead of per-call. Not super-important but this feels a bit silly.

  • Author Developer

    This is trickier than expected since this function is used in a fair number of places. I think for now I'll leave it as-is since the impact of this load is tiny.

  • Please register or sign in to reply
  • Henrik Gramner
  • added 1 commit

    • 68d7a76d - x86: add AVX512-IceLake implementation of HBD 64x32 DCT^2

    Compare with previous version

  • Henrik Gramner approved this merge request

    approved this merge request

  • Jean-Baptiste Kempf changed milestone to %1.2.0

    changed milestone to %1.2.0

  • Please register or sign in to reply
    Loading