Cdef filter simd

mentioned in issue #78

At a macro-level, a different approach one could take is to convert one slice of pre-CDEF data into a uint16_t buffer which includes edges, and then fill this (in luma: w+4 x 12) buffer directly into the DSP function. Then the whole stack make-up is not necessary.

Advantages: for frames with most blocks being CDEF'ed, SIMD faster/simpler. Disadvantages: for frames with mostly skipped blocks, we do a lot of extra unnecessary overhead.

Not sure how relevant this is for the review of this patch - it's more philosophical in a way, but I've added it to the task list on the wiki for future exploration.

added 17 commits

915e3701...acde4240 - 13 commits from branch videolan:master
4feb96b9 - Simplify/rewrite cdef filtering code.
6d3847b4 - Add CDEF filter checkasm unit test
f7d4242b - Add 8x8 cdef_filter AVX2 implementation
bfd16f58 - Add a 4x4 cdef_filter AVX2 implementation

Compare with previous version

mentioned in issue #15 (closed)

added 11 commits

bfd16f58...d8996b18 - 7 commits from branch videolan:master
1760cb74 - Simplify/rewrite cdef filtering code.
f1dc0ee2 - Add CDEF filter checkasm unit test
8cfc78b3 - Add 8x8 cdef_filter AVX2 implementation
61cd6b7b - Add a 4x4 cdef_filter AVX2 implementation

Compare with previous version

resolved all discussions

added 4 commits

fd811f6c - Simplify/rewrite cdef filtering code
5c2d6c31 - Add CDEF filter checkasm unit test
961eafe9 - Add 8x8 cdef_filter AVX2 implementation
70fe1a3e - Add a 4x4 cdef_filter AVX2 implementation

Compare with previous version

added 2 commits

8abf7246 - Add 8x8 cdef_filter AVX2 implementation
a73c7d82 - Add a 4x4 cdef_filter AVX2 implementation

Compare with previous version

resolved all discussions

added 5 commits

ba08e37c - 1 commit from branch videolan:master
8007c79f - Simplify/rewrite cdef filtering code
3b02d3a9 - Add CDEF filter checkasm unit test
e2c6d029 - Add 8x8 cdef_filter AVX2 implementation
46a3fd20 - Add a 4x4 cdef_filter AVX2 implementation

Compare with previous version

enabled an automatic merge when the pipeline for 46a3fd20 succeeds

merged

I think the performance can be improved for 8 bit video and blocks that are not along any edges (which make up the vast majority). In that case 32 pixels (an 8x4 block) can be filtered in parallel inside __m256i registers all the way up to the _mm256_maddubs_epi16 operation before the final summing and packing. It's a bit hard for me to read cdef.asm, but it appears that operations are mostly 16 bit there. The filter was designed with 8 bit operations in mind to make very efficient implementations possible (and encoders can also ignore the clipping during the strength selection for further speed-ups).

In libaom the blocks are first converted to 16 and pixels outside edges are set to CDEF_VERY_LARGE, and these pixels then effectively get zero weight. This trick is not necessary if the block does not border an edge and the filter can work directly on the 8 bit values. Below is some code based on the libaom code demonstrating how it can be done. Also included in that file is a definition for v256_adiff_u8 which is missing in aomlib (has no single equivalent instruction on x86, but maps to vabdq_u8 on arm). The gather/scatter corresponds to the AVX2/AVX-512 instructions (but individual load/stores seem faster anyway on AVX2).

cdef.c

The cdef_dir AVX2 implementation is brilliant, by the way.

mentioned in issue #305

added x86 label

Cdef filter simd

Merged by Ronald S. Bultje 6 years ago (Oct 29, 2018 4:19pm UTC) 6 years ago

Activity

Cdef filter simd

Merge request reports

Merged by Ronald S. Bultje 6 years ago (Oct 29, 2018 4:19pm UTC) 6 years ago

Activity