Cdef filter simd
cdef_filter_4x4_8bpc_c: 2273.6 cdef_filter_4x4_8bpc_avx2: 113.6 cdef_filter_8x8_8bpc_c: 7913.0 cdef_filter_8x8_8bpc_avx2: 309.9
Decoding time reduces to 15.51s for first 1000 frames of chimera 1080p, from 23.1 before cdef_filter SIMD or 17.86 with only 8x8 cdef_filter SIMD.
Also add unit tests and rewrite C code to remove last remnants of libaom code in cdef.c
.
Merge request reports
Activity
mentioned in issue #78
At a macro-level, a different approach one could take is to convert one slice of pre-CDEF data into a
uint16_t
buffer which includes edges, and then fill this (in luma: w+4 x 12) buffer directly into the DSP function. Then the whole stack make-up is not necessary.Advantages: for frames with most blocks being CDEF'ed, SIMD faster/simpler. Disadvantages: for frames with mostly skipped blocks, we do a lot of extra unnecessary overhead.
Not sure how relevant this is for the review of this patch - it's more philosophical in a way, but I've added it to the task list on the wiki for future exploration.
added 17 commits
-
915e3701...acde4240 - 13 commits from branch
videolan:master
- 4feb96b9 - Simplify/rewrite cdef filtering code.
- 6d3847b4 - Add CDEF filter checkasm unit test
- f7d4242b - Add 8x8 cdef_filter AVX2 implementation
- bfd16f58 - Add a 4x4 cdef_filter AVX2 implementation
Toggle commit list-
915e3701...acde4240 - 13 commits from branch
mentioned in issue #15 (closed)
- Resolved by Ronald S. Bultje
added 11 commits
-
bfd16f58...d8996b18 - 7 commits from branch
videolan:master
- 1760cb74 - Simplify/rewrite cdef filtering code.
- f1dc0ee2 - Add CDEF filter checkasm unit test
- 8cfc78b3 - Add 8x8 cdef_filter AVX2 implementation
- 61cd6b7b - Add a 4x4 cdef_filter AVX2 implementation
Toggle commit list-
bfd16f58...d8996b18 - 7 commits from branch
- Resolved by Ronald S. Bultje
- Resolved by Ronald S. Bultje
- Resolved by Henrik Gramner
- Resolved by Henrik Gramner
enabled an automatic merge when the pipeline for 46a3fd20 succeeds
I think the performance can be improved for 8 bit video and blocks that are not along any edges (which make up the vast majority). In that case 32 pixels (an 8x4 block) can be filtered in parallel inside __m256i registers all the way up to the _mm256_maddubs_epi16 operation before the final summing and packing. It's a bit hard for me to read cdef.asm, but it appears that operations are mostly 16 bit there. The filter was designed with 8 bit operations in mind to make very efficient implementations possible (and encoders can also ignore the clipping during the strength selection for further speed-ups).
In libaom the blocks are first converted to 16 and pixels outside edges are set to CDEF_VERY_LARGE, and these pixels then effectively get zero weight. This trick is not necessary if the block does not border an edge and the filter can work directly on the 8 bit values. Below is some code based on the libaom code demonstrating how it can be done. Also included in that file is a definition for v256_adiff_u8 which is missing in aomlib (has no single equivalent instruction on x86, but maps to vabdq_u8 on arm). The gather/scatter corresponds to the AVX2/AVX-512 instructions (but individual load/stores seem faster anyway on AVX2).
The cdef_dir AVX2 implementation is brilliant, by the way.
mentioned in issue #305
added x86 label