Skip to content
Snippets Groups Projects

Cdef filter simd

Merged Ronald S. Bultje requested to merge rbultje/dav1d:cdef-filter-simd into master
All threads resolved!

cdef_filter_4x4_8bpc_c: 2273.6 cdef_filter_4x4_8bpc_avx2: 113.6 cdef_filter_8x8_8bpc_c: 7913.0 cdef_filter_8x8_8bpc_avx2: 309.9

Decoding time reduces to 15.51s for first 1000 frames of chimera 1080p, from 23.1 before cdef_filter SIMD or 17.86 with only 8x8 cdef_filter SIMD.

Also add unit tests and rewrite C code to remove last remnants of libaom code in cdef.c.

Merge request reports

Pipeline #1427 passed

Pipeline passed for 46a3fd20 on rbultje:cdef-filter-simd

Approval is optional

Merged by Ronald S. BultjeRonald S. Bultje 6 years ago (Oct 29, 2018 4:19pm UTC)

Merge details

  • Changes merged into master with 46a3fd20.
  • Deleted the source branch.
  • Auto-merge enabled

Pipeline #1428 passed

Pipeline passed for 46a3fd20 on master

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Ronald S. Bultje added 11 commits

    added 11 commits

    • bfd16f58...d8996b18 - 7 commits from branch videolan:master
    • 1760cb74 - Simplify/rewrite cdef filtering code.
    • f1dc0ee2 - Add CDEF filter checkasm unit test
    • 8cfc78b3 - Add 8x8 cdef_filter AVX2 implementation
    • 61cd6b7b - Add a 4x4 cdef_filter AVX2 implementation

    Compare with previous version

  • Ronald S. Bultje resolved all discussions

    resolved all discussions

  • Ronald S. Bultje added 4 commits

    added 4 commits

    • fd811f6c - Simplify/rewrite cdef filtering code
    • 5c2d6c31 - Add CDEF filter checkasm unit test
    • 961eafe9 - Add 8x8 cdef_filter AVX2 implementation
    • 70fe1a3e - Add a 4x4 cdef_filter AVX2 implementation

    Compare with previous version

  • Henrik Gramner
  • Ronald S. Bultje added 2 commits

    added 2 commits

    • 8abf7246 - Add 8x8 cdef_filter AVX2 implementation
    • a73c7d82 - Add a 4x4 cdef_filter AVX2 implementation

    Compare with previous version

  • Henrik Gramner resolved all discussions

    resolved all discussions

  • Henrik Gramner
  • Henrik Gramner resolved all discussions

    resolved all discussions

  • Ronald S. Bultje added 5 commits

    added 5 commits

    • ba08e37c - 1 commit from branch videolan:master
    • 8007c79f - Simplify/rewrite cdef filtering code
    • 3b02d3a9 - Add CDEF filter checkasm unit test
    • e2c6d029 - Add 8x8 cdef_filter AVX2 implementation
    • 46a3fd20 - Add a 4x4 cdef_filter AVX2 implementation

    Compare with previous version

  • Ronald S. Bultje enabled an automatic merge when the pipeline for 46a3fd20 succeeds

    enabled an automatic merge when the pipeline for 46a3fd20 succeeds

  • I think the performance can be improved for 8 bit video and blocks that are not along any edges (which make up the vast majority). In that case 32 pixels (an 8x4 block) can be filtered in parallel inside __m256i registers all the way up to the _mm256_maddubs_epi16 operation before the final summing and packing. It's a bit hard for me to read cdef.asm, but it appears that operations are mostly 16 bit there. The filter was designed with 8 bit operations in mind to make very efficient implementations possible (and encoders can also ignore the clipping during the strength selection for further speed-ups).

    In libaom the blocks are first converted to 16 and pixels outside edges are set to CDEF_VERY_LARGE, and these pixels then effectively get zero weight. This trick is not necessary if the block does not border an edge and the filter can work directly on the 8 bit values. Below is some code based on the libaom code demonstrating how it can be done. Also included in that file is a definition for v256_adiff_u8 which is missing in aomlib (has no single equivalent instruction on x86, but maps to vabdq_u8 on arm). The gather/scatter corresponds to the AVX2/AVX-512 instructions (but individual load/stores seem faster anyway on AVX2).

    cdef.c

    The cdef_dir AVX2 implementation is brilliant, by the way.

  • Ronald S. Bultje mentioned in issue #305

    mentioned in issue #305

  • added x86 label

  • Please register or sign in to reply
    Loading