x86: Add 8-bit AVX-512 (Ice Lake) asm

Merged Henrik Gramner requested to merge gramner/dav1d:avx512icl_8bpc into master

Overall performance of SSE4.1 vs AVX2 vs AVX-512 on an 8-core/16-thread Intel Rocket Lake system:

Chimera 1080p HoliFestival 2160p SummerNature 2160p

On this system AVX-512 speeds up overall decoding performance by around 10-20% over AVX2 on low thread counts. On high thread counts the improvement shrinks to around 5%, mainly due to DRAM bandwidth becoming more of a bottleneck, with the CPU spending an ever increasing portion of overall runtime waiting on memory instead of doing any useful work. Dual-channel DDR4 is clearly not cutting it anymore, and faster memory/more memory channels/more L3 cache would be helpful.

On an AWS m6i.4xlarge (16 vCPU Ice Lake-SP) instance which has more DRAM bandwidth available the performance delta between AVX2 and AVX-512 remains more consistent across a wide range of thread counts:

HoliFestival 2160p (ICL-SP) HoliFestival 2160p (ICL-SP)

When it comes to power consumption I made some power measurements according to the CPU SVID for real-time decoding of some 4K samples (1080p barely puts any load on the CPU, so the power usage is hardly above idle without any differences between instruction sets):

Avg. power usage SSE4.1 AVX2 AVX-512
HoliFestival 49.7 W 46.1 W 42.5 W
SummerNature 42.8 W 43.2 W 43.5 W
SummerInTomsk 40.3 W 39.6 W 37.4 W

Overall wider SIMD generally results in better power efficiency. The outlier is SummerNature, which can likely be explained by the fact that it only contains static shots with little or no movement at a very high bitrate, which results in the CPU time being spent very differently compared to other clips.

The current generation Intel µarchitectures has 3 SIMD execution units (p0, p1, p5), two of which (p0, p1) are 256-bit and one (p5) 512-bit. On client CPUs p5 can only execute shuffles and basic arithmetical/logical operations (add/sub/and/or/xor etc.), on server CPUs p5 is also capable of executing more complex arithmetic instructions (almost everything, in fact). p0+p1 can fuse into a single combined 512-bit unit in order to perform 512-bit operations, which allows for either 3x256-bit or 2x512-bit per cycle, so pure throughput under ideal circumstances with an instruction mix where all execution units can be fully utilized is increased by 33% when using AVX-512 compared to AVX2. New instructions, like VNNI and VBMI, improves things further though by reducing the number of instructions required to perform certain calculations.

It's somewhat content dependent, but around half of the overall runtime in the decoder is spent in scalar code which doesn't benefit from SIMD, and some of the DSP code operates on small blocks that doesn't benefit much, if any, from wider SIMD.

Edited by Henrik Gramner

Merge request reports