Use 8-bit instructions for fully edged cdef filter SIMD
I think the performance can be improved for 8 bit video and blocks that are not along any edges (which make up the vast majority). In that case 32 pixels (an 8x4 block) can be filtered in parallel inside __m256i registers all the way up to the _mm256_maddubs_epi16 operation before the final summing and packing. It's a bit hard for me to read cdef.asm, but it appears that operations are mostly 16 bit there. The filter was designed with 8 bit operations in mind to make very efficient implementations possible (and encoders can also ignore the clipping during the strength selection for further speed-ups). In libaom the blocks are first converted to 16 and pixels outside edges are set to CDEF_VERY_LARGE, and these pixels then effectively get zero weight. This trick is not necessary if the block does not border an edge and the filter can work directly on the 8 bit values. Below is some code based on the libaom code demonstrating how it can be done. Also included in that file is a definition for v256_adiff_u8 which is missing in aomlib (has no single equivalent instruction on x86, but maps to vabdq_u8 on arm). The gather/scatter corresponds to the AVX2/AVX-512 instructions (but individual load/stores seem faster anyway on AVX2).
- arm32, arm64, AVX2