Skip to content
Snippets Groups Projects

x86: Add high bit-depth film grain AVX-512 (Ice Lake) asm

Merged Henrik Gramner requested to merge gramner/dav1d:avx512_filmgrain_16bpc into master
All threads resolved!
fgy_32x32xn_16bpc_avx2:                 1217.0
fgy_32x32xn_16bpc_avx512icl:            1175.0

fguv_32x32xn_16bpc_420_csfl0_avx2:       714.9
fguv_32x32xn_16bpc_420_csfl0_avx512icl:  498.1
fguv_32x32xn_16bpc_420_csfl1_avx2:       420.9
fguv_32x32xn_16bpc_420_csfl1_avx512icl:  413.0

fguv_32x32xn_16bpc_422_csfl0_avx2:      1400.8
fguv_32x32xn_16bpc_422_csfl0_avx512icl:  967.5
fguv_32x32xn_16bpc_422_csfl1_avx2:       821.8
fguv_32x32xn_16bpc_422_csfl1_avx512icl:  818.0

fguv_32x32xn_16bpc_444_csfl0_avx2:      2691.0
fguv_32x32xn_16bpc_444_csfl0_avx512icl: 1530.8
fguv_32x32xn_16bpc_444_csfl1_avx2:      1378.8
fguv_32x32xn_16bpc_444_csfl1_avx512icl: 1193.8

Helps the most in the csfl0 chroma functions due to those being the most computationally expensive relative to the number of memory loads.

The other ones are bottlenecked by the number of memory loads that can be executed per cycle, at least on RKL which has two load ports. Some newer µarchs has three load ports so it might help more on such systems. The number of retired instructions is much lower compared to AVX2 in any case though, which should free up more resources for the other SMT-thread in multi-threaded scenarios.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply
Loading