AArch64: Add i8mm support for convolutions

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_sbd_i8mm into master

This is a follow-up work of !1632 (merged).

Add an Armv8.6-A i8mm code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element USDOT instruction.

Benchmarks show 4-9% FPS increase relative to the Armv8.4-A code path depending on the input video and the CPU used.

This patch will increase the .text by around 5.7 KiB.

Relative performance to the C reference on some CPUs:

Horizontal-vertical micro benchmarks A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc

regular_w2_hv_8bpc_neon: 5.64x 7.21x 2.86x regular_w2_hv_8bpc_dotprod: 6.05x 7.98x 3.00x regular_w2_hv_8bpc_i8mm: 7.06x 8.69x 3.04x

sharp_w2_hv_8bpc_neon: 5.20x 6.04x 2.66x sharp_w2_hv_8bpc_dotprod: 4.78x 5.83x 2.63x sharp_w2_hv_8bpc_i8mm: 5.31x 6.41x 2.71x

regular_w4_hv_8bpc_neon: 7.20x 6.34x 11.20x 9.54x 4.40x 3.91x regular_w4_hv_8bpc_dotprod: 12.77x 10.98x 18.35x 14.57x 6.21x 5.45x regular_w4_hv_8bpc_i8mm: 14.50x 12.83x 21.42x 15.85x 6.16x 5.54x

sharp_w4_hv_8bpc_neon: 6.24x 5.40x 9.77x 8.24x 3.96x 3.48x sharp_w4_hv_8bpc_dotprod: 9.76x 8.77x 14.02x 11.61x 5.20x 4.78x sharp_w4_hv_8bpc_i8mm: 10.84x 9.70x 16.09x 12.68x 5.42x 4.90x

regular_w8_hv_8bpc_neon: 2.17x 2.27x 2.46x 2.57x 3.17x 3.28x regular_w8_hv_8bpc_dotprod: 3.04x 3.18x 3.11x 3.42x 3.03x 2.98x regular_w8_hv_8bpc_i8mm: 3.57x 3.87x 3.40x 3.69x 3.27x 3.26x

sharp_w8_hv_8bpc_neon: 1.72x 1.82x 1.93x 2.05x 2.75x 2.86x sharp_w8_hv_8bpc_dotprod: 2.49x 2.65x 2.54x 2.81x 2.62x 2.38x sharp_w8_hv_8bpc_i8mm: 2.80x 3.03x 2.79x 3.07x 2.70x 2.70x

regular_w16_hv_8bpc_neon: 1.90x 2.09x 2.17x 2.18x 2.02x 1.99x regular_w16_hv_8bpc_dotprod: 2.59x 2.85x 2.64x 2.79x 1.93x 1.83x regular_w16_hv_8bpc_i8mm: 3.01x 3.33x 2.85x 2.94x 2.05x 1.97x

sharp_w16_hv_8bpc_neon: 1.51x 1.67x 1.72x 1.76x 1.74x 1.73x sharp_w16_hv_8bpc_dotprod: 2.17x 2.41x 2.22x 2.35x 1.70x 1.46x sharp_w16_hv_8bpc_i8mm: 2.42x 2.69x 2.42x 2.54x 1.72x 1.65x

regular_w32_hv_8bpc_neon: 1.80x 2.01x 1.96x 2.04x 1.81x 1.81x regular_w32_hv_8bpc_dotprod: 2.43x 2.68x 2.36x 2.55x 1.74x 1.67x regular_w32_hv_8bpc_i8mm: 2.83x 3.17x 2.51x 2.67x 1.83x 1.78x

sharp_w32_hv_8bpc_neon: 1.42x 1.59x 1.54x 1.64x 1.56x 1.57x sharp_w32_hv_8bpc_dotprod: 2.07x 2.30x 2.00x 2.17x 1.55x 1.34x sharp_w32_hv_8bpc_i8mm: 2.29x 2.55x 2.16x 2.33x 1.55x 1.49x

regular_w64_hv_8bpc_neon: 1.82x 1.94x 1.89x 1.95x 1.70x 1.80x regular_w64_hv_8bpc_dotprod: 2.43x 2.59x 2.25x 2.43x 1.65x 1.66x regular_w64_hv_8bpc_i8mm: 2.84x 3.04x 2.39x 2.52x 1.73x 1.76x

sharp_w64_hv_8bpc_neon: 1.43x 1.53x 1.47x 1.57x 1.49x 1.49x sharp_w64_hv_8bpc_dotprod: 2.08x 2.24x 1.91x 2.07x 1.49x 1.28x sharp_w64_hv_8bpc_i8mm: 2.30x 2.46x 2.07x 2.22x 1.48x 1.42x

regular_w128_hv_8bpc_neon: 1.77x 1.94x 1.84x 1.92x 1.75x 1.69x regular_w128_hv_8bpc_dotprod: 2.37x 2.57x 2.18x 2.37x 1.70x 1.56x regular_w128_hv_8bpc_i8mm: 2.76x 3.02x 2.33x 2.45x 1.78x 1.65x

sharp_w128_hv_8bpc_neon: 1.40x 1.53x 1.45x 1.54x 1.42x 1.44x sharp_w128_hv_8bpc_dotprod: 2.04x 2.23x 1.87x 2.03x 1.43x 1.24x sharp_w128_hv_8bpc_i8mm: 2.24x 2.45x 2.02x 2.17x 1.42x 1.38x

Horizontal micro benchmarks A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc

regular_w2_h_8bpc_neon: 2.42x regular_w2_h_8bpc_dotprod: 3.75x regular_w2_h_8bpc_i8mm: 4.22x

sharp_w2_h_8bpc_neon: 2.42x sharp_w2_h_8bpc_dotprod: 3.76x sharp_w2_h_8bpc_i8mm: 4.23x

regular_w4_h_8bpc_neon: 4.81x 4.11x regular_w4_h_8bpc_dotprod: 9.14x 7.22x regular_w4_h_8bpc_i8mm: 11.18x 8.12x

sharp_w4_h_8bpc_neon: 4.78x 4.10x sharp_w4_h_8bpc_dotprod: 9.14x 7.17x sharp_w4_h_8bpc_i8mm: 11.11x 8.10x

regular_w8_h_8bpc_neon: 3.16x 3.20x 3.51x 3.32x 3.43x 3.37x regular_w8_h_8bpc_dotprod: 4.97x 5.12x 7.43x 7.27x 4.95x 5.06x regular_w8_h_8bpc_i8mm: 7.28x 5.87x 10.38x 8.59x 5.69x 5.69x

sharp_w8_h_8bpc_neon: 2.71x 2.64x 2.77x 2.75x 3.10x 3.09x sharp_w8_h_8bpc_dotprod: 4.92x 5.09x 7.14x 7.03x 4.94x 5.09x sharp_w8_h_8bpc_i8mm: 7.21x 5.82x 10.11x 8.45x 5.70x 5.68x

regular_w16_h_8bpc_neon: 2.79x 2.61x 2.76x 2.75x 3.53x 3.22x regular_w16_h_8bpc_dotprod: 3.81x 4.09x 4.77x 4.90x 3.13x 3.10x regular_w16_h_8bpc_i8mm: 5.21x 4.55x 6.04x 5.66x 3.56x 3.23x

sharp_w16_h_8bpc_neon: 2.31x 2.22x 2.38x 2.36x 3.12x 2.89x sharp_w16_h_8bpc_dotprod: 3.80x 4.10x 4.74x 4.87x 3.13x 3.09x sharp_w16_h_8bpc_i8mm: 5.20x 4.55x 5.98x 5.61x 3.56x 3.22x

regular_w32_h_8bpc_neon: 2.58x 2.40x 2.61x 2.54x 3.14x 2.91x regular_w32_h_8bpc_dotprod: 3.36x 3.54x 3.92x 4.03x 2.57x 2.11x regular_w32_h_8bpc_i8mm: 4.48x 3.88x 4.81x 4.55x 2.91x 2.70x

sharp_w32_h_8bpc_neon: 2.15x 2.03x 2.19x 2.17x 2.78x 2.62x sharp_w32_h_8bpc_dotprod: 3.33x 3.52x 3.90x 3.94x 2.57x 2.10x sharp_w32_h_8bpc_i8mm: 4.45x 3.85x 4.79x 4.45x 2.89x 2.70x

regular_w64_h_8bpc_neon: 2.49x 2.31x 2.46x 2.41x 2.94x 2.79x regular_w64_h_8bpc_dotprod: 3.17x 3.33x 3.60x 3.62x 2.41x 2.22x regular_w64_h_8bpc_i8mm: 4.22x 3.63x 4.40x 4.08x 2.72x 2.53x

sharp_w64_h_8bpc_neon: 2.07x 1.97x 2.06x 2.05x 2.60x 2.49x sharp_w64_h_8bpc_dotprod: 3.16x 3.32x 3.58x 3.58x 2.40x 2.21x sharp_w64_h_8bpc_i8mm: 4.20x 3.63x 4.38x 4.04x 2.71x 2.51x

regular_w128_h_8bpc_neon: 2.45x 2.28x 2.38x 2.33x 2.78x 2.69x regular_w128_h_8bpc_dotprod: 3.09x 3.25x 3.47x 3.47x 2.24x 2.23x regular_w128_h_8bpc_i8mm: 4.10x 3.55x 4.25x 3.92x 2.52x 2.31x

sharp_w128_h_8bpc_neon: 2.05x 1.94x 2.01x 2.01x 2.47x 2.39x sharp_w128_h_8bpc_dotprod: 3.09x 3.25x 3.44x 3.46x 2.24x 2.23x sharp_w128_h_8bpc_i8mm: 4.10x 3.55x 4.22x 3.89x 2.52x 2.31x

Vertical micro benchmarks A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc

regular_w2_v_8bpc_neon: 3.68x regular_w2_v_8bpc_dotprod: 3.29x regular_w2_v_8bpc_i8mm: 3.49x

sharp_w2_v_8bpc_neon: 3.29x sharp_w2_v_8bpc_dotprod: 3.27x sharp_w2_v_8bpc_i8mm: 3.46x

regular_w4_v_8bpc_neon: 7.15x 5.62x regular_w4_v_8bpc_dotprod: 7.43x 5.85x regular_w4_v_8bpc_i8mm: 7.89x 6.20x

sharp_w4_v_8bpc_neon: 5.83x 4.71x sharp_w4_v_8bpc_dotprod: 7.36x 5.85x sharp_w4_v_8bpc_i8mm: 7.90x 6.18x

regular_w8_v_8bpc_neon: 6.11x 6.55x 8.05x 8.24x 4.07x 4.38x regular_w8_v_8bpc_dotprod: 5.45x 5.61x 8.15x 7.00x 4.01x 4.30x regular_w8_v_8bpc_i8mm: 7.30x 7.59x 9.46x 9.12x 4.19x 4.49x

sharp_w8_v_8bpc_neon: 4.23x 4.51x 5.46x 5.54x 3.09x 3.33x sharp_w8_v_8bpc_dotprod: 5.43x 5.58x 7.96x 6.74x 4.01x 4.29x sharp_w8_v_8bpc_i8mm: 7.26x 7.44x 9.12x 9.02x 4.19x 4.47x

regular_w16_v_8bpc_neon: 3.44x 3.61x 4.33x 4.52x 2.40x 2.36x regular_w16_v_8bpc_dotprod: 3.20x 3.34x 4.53x 4.53x 2.85x 2.60x regular_w16_v_8bpc_i8mm: 4.09x 4.33x 5.27x 5.53x 2.87x 2.62x

sharp_w16_v_8bpc_neon: 2.50x 2.61x 3.14x 3.31x 1.82x 1.81x sharp_w16_v_8bpc_dotprod: 3.20x 3.34x 4.52x 4.51x 2.86x 2.62x sharp_w16_v_8bpc_i8mm: 4.09x 4.32x 5.15x 5.55x 2.86x 2.65x

regular_w32_v_8bpc_neon: 2.94x 3.12x 3.52x 3.70x 1.81x 1.84x regular_w32_v_8bpc_dotprod: 2.80x 2.95x 3.74x 3.75x 2.17x 2.06x regular_w32_v_8bpc_i8mm: 3.54x 3.76x 4.19x 4.48x 2.16x 2.06x

sharp_w32_v_8bpc_neon: 2.14x 2.27x 2.58x 2.73x 1.37x 1.40x sharp_w32_v_8bpc_dotprod: 2.78x 2.93x 3.70x 3.71x 2.17x 2.05x sharp_w32_v_8bpc_i8mm: 3.50x 3.73x 4.15x 4.46x 2.18x 2.06x

regular_w64_v_8bpc_neon: 2.74x 2.88x 3.11x 3.33x 1.53x 1.65x regular_w64_v_8bpc_dotprod: 2.63x 2.75x 3.30x 3.35x 1.84x 1.82x regular_w64_v_8bpc_i8mm: 3.31x 3.48x 3.73x 3.99x 1.84x 1.82x

sharp_w64_v_8bpc_neon: 2.01x 2.12x 2.29x 2.45x 1.16x 1.25x sharp_w64_v_8bpc_dotprod: 2.61x 2.75x 3.27x 3.32x 1.83x 1.82x sharp_w64_v_8bpc_i8mm: 3.29x 3.48x 3.68x 3.94x 1.84x 1.82x

regular_w128_v_8bpc_neon: 2.66x 2.80x 2.92x 3.16x 1.39x 1.53x regular_w128_v_8bpc_dotprod: 2.56x 2.68x 3.11x 3.18x 1.63x 1.69x regular_w128_v_8bpc_i8mm: 3.21x 3.39x 3.48x 3.78x 1.63x 1.69x

sharp_w128_v_8bpc_neon: 1.95x 2.06x 2.16x 2.34x 1.06x 1.17x sharp_w128_v_8bpc_dotprod: 2.55x 2.68x 3.10x 3.17x 1.63x 1.69x sharp_w128_v_8bpc_i8mm: 3.19x 3.37x 3.49x 3.76x 1.63x 1.69x

Some benchmark results against Armv8.4-A (DotProd) version:

Models 1080p:

 - AWS Graviton 3:  178.16 fps  ->  183.38 fps ( +2.93 % )

Balloons 1080p:

 - AWS Graviton 3:  162.45 fps  ->  166.60 fps ( +2.55 % )

Mountain Bike 1080p:

 - AWS Graviton 3:  133.95 fps  ->  136.51 fps ( +1.91 % )

Nature 1080p:

 - AWS Graviton 3:  130.15 fps  ->  132.68 fps ( +1.94 % )

Vision Pro 1080p:

 - AWS Graviton 3:  192.59 fps  ->  197.09 fps ( +2.34 % )

Bosphorus 1080p:

 - AWS Graviton 3:  213.57 fps  ->  226.32 fps ( +5.97 % )

Bosphorus 1080p was encoded by aomenc (3.7.1+):

aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
Edited by Arpad Panyik

