Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
  • dav1d dav1d
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 20
    • Issues 20
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 12
    • Merge requests 12
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Releases
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • VideoLAN
  • dav1ddav1d
  • Issues
  • #305

Closed
Open
Created Oct 30, 2019 by Ronald S. Bultje@rbultjeDeveloper0 of 1 task completed0/1 task

Use 8-bit instructions for fully edged cdef filter SIMD

I think the performance can be improved for 8 bit video and blocks that are not along any edges (which make up the vast majority). In that case 32 pixels (an 8x4 block) can be filtered in parallel inside __m256i registers all the way up to the _mm256_maddubs_epi16 operation before the final summing and packing. It's a bit hard for me to read cdef.asm, but it appears that operations are mostly 16 bit there. The filter was designed with 8 bit operations in mind to make very efficient implementations possible (and encoders can also ignore the clipping during the strength selection for further speed-ups). In libaom the blocks are first converted to 16 and pixels outside edges are set to CDEF_VERY_LARGE, and these pixels then effectively get zero weight. This trick is not necessary if the block does not border an edge and the filter can work directly on the 8 bit values. Below is some code based on the libaom code demonstrating how it can be done. Also included in that file is a definition for v256_adiff_u8 which is missing in aomlib (has no single equivalent instruction on x86, but maps to vabdq_u8 on arm). The gather/scatter corresponds to the AVX2/AVX-512 instructions (but individual load/stores seem faster anyway on AVX2).

See !253 (comment 46332)

To do:

  • SSSE3

Already done:

  • arm32, arm64, AVX2
Edited Feb 12, 2021 by Ronald S. Bultje
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking

VideoLAN code repository instance