AArch64: Add i8mm support for convolutions (!1650) · Merge requests · VideoLAN / dav1d

This is a follow-up work of !1632 (merged).

Add an Armv8.6-A i8mm code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element USDOT instruction.

Benchmarks show 4-9% FPS increase relative to the Armv8.4-A code path depending on the input video and the CPU used.

This patch will increase the .text by around 5.7 KiB.

Relative performance to the C reference on some CPUs:

Horizontal-vertical micro benchmarks


                             A715-mct    A715-mc      X3-mct     X3-mc   A510-mct    A510-mc
regular_w2_hv_8bpc_neon:                   5.64x                 7.21x                 2.86x
regular_w2_hv_8bpc_dotprod:                6.05x                 7.98x                 3.00x
regular_w2_hv_8bpc_i8mm:                   7.06x                 8.69x                 3.04x
sharp_w2_hv_8bpc_neon:                     5.20x                 6.04x                 2.66x
sharp_w2_hv_8bpc_dotprod:                  4.78x                 5.83x                 2.63x
sharp_w2_hv_8bpc_i8mm:                     5.31x                 6.41x                 2.71x
regular_w4_hv_8bpc_neon:        7.20x      6.34x     11.20x      9.54x      4.40x      3.91x
regular_w4_hv_8bpc_dotprod:    12.77x     10.98x     18.35x     14.57x      6.21x      5.45x
regular_w4_hv_8bpc_i8mm:       14.50x     12.83x     21.42x     15.85x      6.16x      5.54x
sharp_w4_hv_8bpc_neon:          6.24x      5.40x      9.77x      8.24x      3.96x      3.48x
sharp_w4_hv_8bpc_dotprod:       9.76x      8.77x     14.02x     11.61x      5.20x      4.78x
sharp_w4_hv_8bpc_i8mm:         10.84x      9.70x     16.09x     12.68x      5.42x      4.90x
regular_w8_hv_8bpc_neon:        2.17x      2.27x      2.46x      2.57x      3.17x      3.28x
regular_w8_hv_8bpc_dotprod:     3.04x      3.18x      3.11x      3.42x      3.03x      2.98x
regular_w8_hv_8bpc_i8mm:        3.57x      3.87x      3.40x      3.69x      3.27x      3.26x
sharp_w8_hv_8bpc_neon:          1.72x      1.82x      1.93x      2.05x      2.75x      2.86x
sharp_w8_hv_8bpc_dotprod:       2.49x      2.65x      2.54x      2.81x      2.62x      2.38x
sharp_w8_hv_8bpc_i8mm:          2.80x      3.03x      2.79x      3.07x      2.70x      2.70x
regular_w16_hv_8bpc_neon:       1.90x      2.09x      2.17x      2.18x      2.02x      1.99x
regular_w16_hv_8bpc_dotprod:    2.59x      2.85x      2.64x      2.79x      1.93x      1.83x
regular_w16_hv_8bpc_i8mm:       3.01x      3.33x      2.85x      2.94x      2.05x      1.97x
sharp_w16_hv_8bpc_neon:         1.51x      1.67x      1.72x      1.76x      1.74x      1.73x
sharp_w16_hv_8bpc_dotprod:      2.17x      2.41x      2.22x      2.35x      1.70x      1.46x
sharp_w16_hv_8bpc_i8mm:         2.42x      2.69x      2.42x      2.54x      1.72x      1.65x
regular_w32_hv_8bpc_neon:       1.80x      2.01x      1.96x      2.04x      1.81x      1.81x
regular_w32_hv_8bpc_dotprod:    2.43x      2.68x      2.36x      2.55x      1.74x      1.67x
regular_w32_hv_8bpc_i8mm:       2.83x      3.17x      2.51x      2.67x      1.83x      1.78x
sharp_w32_hv_8bpc_neon:         1.42x      1.59x      1.54x      1.64x      1.56x      1.57x
sharp_w32_hv_8bpc_dotprod:      2.07x      2.30x      2.00x      2.17x      1.55x      1.34x
sharp_w32_hv_8bpc_i8mm:         2.29x      2.55x      2.16x      2.33x      1.55x      1.49x
regular_w64_hv_8bpc_neon:       1.82x      1.94x      1.89x      1.95x      1.70x      1.80x
regular_w64_hv_8bpc_dotprod:    2.43x      2.59x      2.25x      2.43x      1.65x      1.66x
regular_w64_hv_8bpc_i8mm:       2.84x      3.04x      2.39x      2.52x      1.73x      1.76x
sharp_w64_hv_8bpc_neon:         1.43x      1.53x      1.47x      1.57x      1.49x      1.49x
sharp_w64_hv_8bpc_dotprod:      2.08x      2.24x      1.91x      2.07x      1.49x      1.28x
sharp_w64_hv_8bpc_i8mm:         2.30x      2.46x      2.07x      2.22x      1.48x      1.42x
regular_w128_hv_8bpc_neon:      1.77x      1.94x      1.84x      1.92x      1.75x      1.69x
regular_w128_hv_8bpc_dotprod:   2.37x      2.57x      2.18x      2.37x      1.70x      1.56x
regular_w128_hv_8bpc_i8mm:      2.76x      3.02x      2.33x      2.45x      1.78x      1.65x
sharp_w128_hv_8bpc_neon:        1.40x      1.53x      1.45x      1.54x      1.42x      1.44x
sharp_w128_hv_8bpc_dotprod:     2.04x      2.23x      1.87x      2.03x      1.43x      1.24x
sharp_w128_hv_8bpc_i8mm:        2.24x      2.45x      2.02x      2.17x      1.42x      1.38x

Horizontal micro benchmarks


                             A715-mct    A715-mc      X3-mct     X3-mc   A510-mct    A510-mc
regular_w2_h_8bpc_neon:                                                                2.42x
regular_w2_h_8bpc_dotprod:                                                             3.75x
regular_w2_h_8bpc_i8mm:                                                                4.22x
sharp_w2_h_8bpc_neon:                                                                  2.42x
sharp_w2_h_8bpc_dotprod:                                                               3.76x
sharp_w2_h_8bpc_i8mm:                                                                  4.23x
regular_w4_h_8bpc_neon:                                                     4.81x      4.11x
regular_w4_h_8bpc_dotprod:                                                  9.14x      7.22x
regular_w4_h_8bpc_i8mm:                                                    11.18x      8.12x
sharp_w4_h_8bpc_neon:                                                       4.78x      4.10x
sharp_w4_h_8bpc_dotprod:                                                    9.14x      7.17x
sharp_w4_h_8bpc_i8mm:                                                      11.11x      8.10x
regular_w8_h_8bpc_neon:         3.16x      3.20x      3.51x      3.32x      3.43x      3.37x
regular_w8_h_8bpc_dotprod:      4.97x      5.12x      7.43x      7.27x      4.95x      5.06x
regular_w8_h_8bpc_i8mm:         7.28x      5.87x     10.38x      8.59x      5.69x      5.69x
sharp_w8_h_8bpc_neon:           2.71x      2.64x      2.77x      2.75x      3.10x      3.09x
sharp_w8_h_8bpc_dotprod:        4.92x      5.09x      7.14x      7.03x      4.94x      5.09x
sharp_w8_h_8bpc_i8mm:           7.21x      5.82x     10.11x      8.45x      5.70x      5.68x
regular_w16_h_8bpc_neon:        2.79x      2.61x      2.76x      2.75x      3.53x      3.22x
regular_w16_h_8bpc_dotprod:     3.81x      4.09x      4.77x      4.90x      3.13x      3.10x
regular_w16_h_8bpc_i8mm:        5.21x      4.55x      6.04x      5.66x      3.56x      3.23x
sharp_w16_h_8bpc_neon:          2.31x      2.22x      2.38x      2.36x      3.12x      2.89x
sharp_w16_h_8bpc_dotprod:       3.80x      4.10x      4.74x      4.87x      3.13x      3.09x
sharp_w16_h_8bpc_i8mm:          5.20x      4.55x      5.98x      5.61x      3.56x      3.22x
regular_w32_h_8bpc_neon:        2.58x      2.40x      2.61x      2.54x      3.14x      2.91x
regular_w32_h_8bpc_dotprod:     3.36x      3.54x      3.92x      4.03x      2.57x      2.11x
regular_w32_h_8bpc_i8mm:        4.48x      3.88x      4.81x      4.55x      2.91x      2.70x
sharp_w32_h_8bpc_neon:          2.15x      2.03x      2.19x      2.17x      2.78x      2.62x
sharp_w32_h_8bpc_dotprod:       3.33x      3.52x      3.90x      3.94x      2.57x      2.10x
sharp_w32_h_8bpc_i8mm:          4.45x      3.85x      4.79x      4.45x      2.89x      2.70x
regular_w64_h_8bpc_neon:        2.49x      2.31x      2.46x      2.41x      2.94x      2.79x
regular_w64_h_8bpc_dotprod:     3.17x      3.33x      3.60x      3.62x      2.41x      2.22x
regular_w64_h_8bpc_i8mm:        4.22x      3.63x      4.40x      4.08x      2.72x      2.53x
sharp_w64_h_8bpc_neon:          2.07x      1.97x      2.06x      2.05x      2.60x      2.49x
sharp_w64_h_8bpc_dotprod:       3.16x      3.32x      3.58x      3.58x      2.40x      2.21x
sharp_w64_h_8bpc_i8mm:          4.20x      3.63x      4.38x      4.04x      2.71x      2.51x
regular_w128_h_8bpc_neon:       2.45x      2.28x      2.38x      2.33x      2.78x      2.69x
regular_w128_h_8bpc_dotprod:    3.09x      3.25x      3.47x      3.47x      2.24x      2.23x
regular_w128_h_8bpc_i8mm:       4.10x      3.55x      4.25x      3.92x      2.52x      2.31x
sharp_w128_h_8bpc_neon:         2.05x      1.94x      2.01x      2.01x      2.47x      2.39x
sharp_w128_h_8bpc_dotprod:      3.09x      3.25x      3.44x      3.46x      2.24x      2.23x
sharp_w128_h_8bpc_i8mm:         4.10x      3.55x      4.22x      3.89x      2.52x      2.31x

Vertical micro benchmarks


                             A715-mct    A715-mc      X3-mct     X3-mc   A510-mct    A510-mc
regular_w2_v_8bpc_neon:                                                                3.68x
regular_w2_v_8bpc_dotprod:                                                             3.29x
regular_w2_v_8bpc_i8mm:                                                                3.49x
sharp_w2_v_8bpc_neon:                                                                  3.29x
sharp_w2_v_8bpc_dotprod:                                                               3.27x
sharp_w2_v_8bpc_i8mm:                                                                  3.46x
regular_w4_v_8bpc_neon:                                                     7.15x      5.62x
regular_w4_v_8bpc_dotprod:                                                  7.43x      5.85x
regular_w4_v_8bpc_i8mm:                                                     7.89x      6.20x
sharp_w4_v_8bpc_neon:                                                       5.83x      4.71x
sharp_w4_v_8bpc_dotprod:                                                    7.36x      5.85x
sharp_w4_v_8bpc_i8mm:                                                       7.90x      6.18x
regular_w8_v_8bpc_neon:         6.11x      6.55x      8.05x      8.24x      4.07x      4.38x
regular_w8_v_8bpc_dotprod:      5.45x      5.61x      8.15x      7.00x      4.01x      4.30x
regular_w8_v_8bpc_i8mm:         7.30x      7.59x      9.46x      9.12x      4.19x      4.49x
sharp_w8_v_8bpc_neon:           4.23x      4.51x      5.46x      5.54x      3.09x      3.33x
sharp_w8_v_8bpc_dotprod:        5.43x      5.58x      7.96x      6.74x      4.01x      4.29x
sharp_w8_v_8bpc_i8mm:           7.26x      7.44x      9.12x      9.02x      4.19x      4.47x
regular_w16_v_8bpc_neon:        3.44x      3.61x      4.33x      4.52x      2.40x      2.36x
regular_w16_v_8bpc_dotprod:     3.20x      3.34x      4.53x      4.53x      2.85x      2.60x
regular_w16_v_8bpc_i8mm:        4.09x      4.33x      5.27x      5.53x      2.87x      2.62x
sharp_w16_v_8bpc_neon:          2.50x      2.61x      3.14x      3.31x      1.82x      1.81x
sharp_w16_v_8bpc_dotprod:       3.20x      3.34x      4.52x      4.51x      2.86x      2.62x
sharp_w16_v_8bpc_i8mm:          4.09x      4.32x      5.15x      5.55x      2.86x      2.65x
regular_w32_v_8bpc_neon:        2.94x      3.12x      3.52x      3.70x      1.81x      1.84x
regular_w32_v_8bpc_dotprod:     2.80x      2.95x      3.74x      3.75x      2.17x      2.06x
regular_w32_v_8bpc_i8mm:        3.54x      3.76x      4.19x      4.48x      2.16x      2.06x
sharp_w32_v_8bpc_neon:          2.14x      2.27x      2.58x      2.73x      1.37x      1.40x
sharp_w32_v_8bpc_dotprod:       2.78x      2.93x      3.70x      3.71x      2.17x      2.05x
sharp_w32_v_8bpc_i8mm:          3.50x      3.73x      4.15x      4.46x      2.18x      2.06x
regular_w64_v_8bpc_neon:        2.74x      2.88x      3.11x      3.33x      1.53x      1.65x
regular_w64_v_8bpc_dotprod:     2.63x      2.75x      3.30x      3.35x      1.84x      1.82x
regular_w64_v_8bpc_i8mm:        3.31x      3.48x      3.73x      3.99x      1.84x      1.82x
sharp_w64_v_8bpc_neon:          2.01x      2.12x      2.29x      2.45x      1.16x      1.25x
sharp_w64_v_8bpc_dotprod:       2.61x      2.75x      3.27x      3.32x      1.83x      1.82x
sharp_w64_v_8bpc_i8mm:          3.29x      3.48x      3.68x      3.94x      1.84x      1.82x
regular_w128_v_8bpc_neon:       2.66x      2.80x      2.92x      3.16x      1.39x      1.53x
regular_w128_v_8bpc_dotprod:    2.56x      2.68x      3.11x      3.18x      1.63x      1.69x
regular_w128_v_8bpc_i8mm:       3.21x      3.39x      3.48x      3.78x      1.63x      1.69x
sharp_w128_v_8bpc_neon:         1.95x      2.06x      2.16x      2.34x      1.06x      1.17x
sharp_w128_v_8bpc_dotprod:      2.55x      2.68x      3.10x      3.17x      1.63x      1.69x
sharp_w128_v_8bpc_i8mm:         3.19x      3.37x      3.49x      3.76x      1.63x      1.69x

Some benchmark results against Armv8.4-A (DotProd) version:

Models 1080p:

 - AWS Graviton 3:  178.16 fps  ->  183.38 fps ( +2.93 % )

Balloons 1080p:

 - AWS Graviton 3:  162.45 fps  ->  166.60 fps ( +2.55 % )

Mountain Bike 1080p:

 - AWS Graviton 3:  133.95 fps  ->  136.51 fps ( +1.91 % )

Nature 1080p:

 - AWS Graviton 3:  130.15 fps  ->  132.68 fps ( +1.94 % )

Vision Pro 1080p:

 - AWS Graviton 3:  192.59 fps  ->  197.09 fps ( +2.34 % )

Bosphorus 1080p:

 - AWS Graviton 3:  213.57 fps  ->  226.32 fps ( +5.97 % )

Bosphorus 1080p was encoded by aomenc (3.7.1+):

aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m

Edited Apr 25, 2024 by Arpad Panyik

AArch64: Add i8mm support for convolutions

Merge request reports