AArch64: Add i8mm support for convolutions
This is a follow-up work of !1632 (merged).
Add an Armv8.6-A i8mm code path for standard bitdepth convolutions.
Only horizontal-vertical (HV) convolutions have 6-tap specialisations
of their vertical passes. All other convolutions are 4- or 8-tap
filters which fit well with the 4-element USDOT
instruction.
Benchmarks show 4-9% FPS increase relative to the Armv8.4-A code path depending on the input video and the CPU used.
This patch will increase the .text
by around 5.7 KiB.
Relative performance to the C reference on some CPUs:
Horizontal-vertical micro benchmarks
A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc
regular_w2_hv_8bpc_neon: 5.64x 7.21x 2.86x
regular_w2_hv_8bpc_dotprod: 6.05x 7.98x 3.00x
regular_w2_hv_8bpc_i8mm: 7.06x 8.69x 3.04x
sharp_w2_hv_8bpc_neon: 5.20x 6.04x 2.66x
sharp_w2_hv_8bpc_dotprod: 4.78x 5.83x 2.63x
sharp_w2_hv_8bpc_i8mm: 5.31x 6.41x 2.71x
regular_w4_hv_8bpc_neon: 7.20x 6.34x 11.20x 9.54x 4.40x 3.91x
regular_w4_hv_8bpc_dotprod: 12.77x 10.98x 18.35x 14.57x 6.21x 5.45x
regular_w4_hv_8bpc_i8mm: 14.50x 12.83x 21.42x 15.85x 6.16x 5.54x
sharp_w4_hv_8bpc_neon: 6.24x 5.40x 9.77x 8.24x 3.96x 3.48x
sharp_w4_hv_8bpc_dotprod: 9.76x 8.77x 14.02x 11.61x 5.20x 4.78x
sharp_w4_hv_8bpc_i8mm: 10.84x 9.70x 16.09x 12.68x 5.42x 4.90x
regular_w8_hv_8bpc_neon: 2.17x 2.27x 2.46x 2.57x 3.17x 3.28x
regular_w8_hv_8bpc_dotprod: 3.04x 3.18x 3.11x 3.42x 3.03x 2.98x
regular_w8_hv_8bpc_i8mm: 3.57x 3.87x 3.40x 3.69x 3.27x 3.26x
sharp_w8_hv_8bpc_neon: 1.72x 1.82x 1.93x 2.05x 2.75x 2.86x
sharp_w8_hv_8bpc_dotprod: 2.49x 2.65x 2.54x 2.81x 2.62x 2.38x
sharp_w8_hv_8bpc_i8mm: 2.80x 3.03x 2.79x 3.07x 2.70x 2.70x
regular_w16_hv_8bpc_neon: 1.90x 2.09x 2.17x 2.18x 2.02x 1.99x
regular_w16_hv_8bpc_dotprod: 2.59x 2.85x 2.64x 2.79x 1.93x 1.83x
regular_w16_hv_8bpc_i8mm: 3.01x 3.33x 2.85x 2.94x 2.05x 1.97x
sharp_w16_hv_8bpc_neon: 1.51x 1.67x 1.72x 1.76x 1.74x 1.73x
sharp_w16_hv_8bpc_dotprod: 2.17x 2.41x 2.22x 2.35x 1.70x 1.46x
sharp_w16_hv_8bpc_i8mm: 2.42x 2.69x 2.42x 2.54x 1.72x 1.65x
regular_w32_hv_8bpc_neon: 1.80x 2.01x 1.96x 2.04x 1.81x 1.81x
regular_w32_hv_8bpc_dotprod: 2.43x 2.68x 2.36x 2.55x 1.74x 1.67x
regular_w32_hv_8bpc_i8mm: 2.83x 3.17x 2.51x 2.67x 1.83x 1.78x
sharp_w32_hv_8bpc_neon: 1.42x 1.59x 1.54x 1.64x 1.56x 1.57x
sharp_w32_hv_8bpc_dotprod: 2.07x 2.30x 2.00x 2.17x 1.55x 1.34x
sharp_w32_hv_8bpc_i8mm: 2.29x 2.55x 2.16x 2.33x 1.55x 1.49x
regular_w64_hv_8bpc_neon: 1.82x 1.94x 1.89x 1.95x 1.70x 1.80x
regular_w64_hv_8bpc_dotprod: 2.43x 2.59x 2.25x 2.43x 1.65x 1.66x
regular_w64_hv_8bpc_i8mm: 2.84x 3.04x 2.39x 2.52x 1.73x 1.76x
sharp_w64_hv_8bpc_neon: 1.43x 1.53x 1.47x 1.57x 1.49x 1.49x
sharp_w64_hv_8bpc_dotprod: 2.08x 2.24x 1.91x 2.07x 1.49x 1.28x
sharp_w64_hv_8bpc_i8mm: 2.30x 2.46x 2.07x 2.22x 1.48x 1.42x
regular_w128_hv_8bpc_neon: 1.77x 1.94x 1.84x 1.92x 1.75x 1.69x
regular_w128_hv_8bpc_dotprod: 2.37x 2.57x 2.18x 2.37x 1.70x 1.56x
regular_w128_hv_8bpc_i8mm: 2.76x 3.02x 2.33x 2.45x 1.78x 1.65x
sharp_w128_hv_8bpc_neon: 1.40x 1.53x 1.45x 1.54x 1.42x 1.44x
sharp_w128_hv_8bpc_dotprod: 2.04x 2.23x 1.87x 2.03x 1.43x 1.24x
sharp_w128_hv_8bpc_i8mm: 2.24x 2.45x 2.02x 2.17x 1.42x 1.38x
Horizontal micro benchmarks
A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc
regular_w2_h_8bpc_neon: 2.42x
regular_w2_h_8bpc_dotprod: 3.75x
regular_w2_h_8bpc_i8mm: 4.22x
sharp_w2_h_8bpc_neon: 2.42x
sharp_w2_h_8bpc_dotprod: 3.76x
sharp_w2_h_8bpc_i8mm: 4.23x
regular_w4_h_8bpc_neon: 4.81x 4.11x
regular_w4_h_8bpc_dotprod: 9.14x 7.22x
regular_w4_h_8bpc_i8mm: 11.18x 8.12x
sharp_w4_h_8bpc_neon: 4.78x 4.10x
sharp_w4_h_8bpc_dotprod: 9.14x 7.17x
sharp_w4_h_8bpc_i8mm: 11.11x 8.10x
regular_w8_h_8bpc_neon: 3.16x 3.20x 3.51x 3.32x 3.43x 3.37x
regular_w8_h_8bpc_dotprod: 4.97x 5.12x 7.43x 7.27x 4.95x 5.06x
regular_w8_h_8bpc_i8mm: 7.28x 5.87x 10.38x 8.59x 5.69x 5.69x
sharp_w8_h_8bpc_neon: 2.71x 2.64x 2.77x 2.75x 3.10x 3.09x
sharp_w8_h_8bpc_dotprod: 4.92x 5.09x 7.14x 7.03x 4.94x 5.09x
sharp_w8_h_8bpc_i8mm: 7.21x 5.82x 10.11x 8.45x 5.70x 5.68x
regular_w16_h_8bpc_neon: 2.79x 2.61x 2.76x 2.75x 3.53x 3.22x
regular_w16_h_8bpc_dotprod: 3.81x 4.09x 4.77x 4.90x 3.13x 3.10x
regular_w16_h_8bpc_i8mm: 5.21x 4.55x 6.04x 5.66x 3.56x 3.23x
sharp_w16_h_8bpc_neon: 2.31x 2.22x 2.38x 2.36x 3.12x 2.89x
sharp_w16_h_8bpc_dotprod: 3.80x 4.10x 4.74x 4.87x 3.13x 3.09x
sharp_w16_h_8bpc_i8mm: 5.20x 4.55x 5.98x 5.61x 3.56x 3.22x
regular_w32_h_8bpc_neon: 2.58x 2.40x 2.61x 2.54x 3.14x 2.91x
regular_w32_h_8bpc_dotprod: 3.36x 3.54x 3.92x 4.03x 2.57x 2.11x
regular_w32_h_8bpc_i8mm: 4.48x 3.88x 4.81x 4.55x 2.91x 2.70x
sharp_w32_h_8bpc_neon: 2.15x 2.03x 2.19x 2.17x 2.78x 2.62x
sharp_w32_h_8bpc_dotprod: 3.33x 3.52x 3.90x 3.94x 2.57x 2.10x
sharp_w32_h_8bpc_i8mm: 4.45x 3.85x 4.79x 4.45x 2.89x 2.70x
regular_w64_h_8bpc_neon: 2.49x 2.31x 2.46x 2.41x 2.94x 2.79x
regular_w64_h_8bpc_dotprod: 3.17x 3.33x 3.60x 3.62x 2.41x 2.22x
regular_w64_h_8bpc_i8mm: 4.22x 3.63x 4.40x 4.08x 2.72x 2.53x
sharp_w64_h_8bpc_neon: 2.07x 1.97x 2.06x 2.05x 2.60x 2.49x
sharp_w64_h_8bpc_dotprod: 3.16x 3.32x 3.58x 3.58x 2.40x 2.21x
sharp_w64_h_8bpc_i8mm: 4.20x 3.63x 4.38x 4.04x 2.71x 2.51x
regular_w128_h_8bpc_neon: 2.45x 2.28x 2.38x 2.33x 2.78x 2.69x
regular_w128_h_8bpc_dotprod: 3.09x 3.25x 3.47x 3.47x 2.24x 2.23x
regular_w128_h_8bpc_i8mm: 4.10x 3.55x 4.25x 3.92x 2.52x 2.31x
sharp_w128_h_8bpc_neon: 2.05x 1.94x 2.01x 2.01x 2.47x 2.39x
sharp_w128_h_8bpc_dotprod: 3.09x 3.25x 3.44x 3.46x 2.24x 2.23x
sharp_w128_h_8bpc_i8mm: 4.10x 3.55x 4.22x 3.89x 2.52x 2.31x
Vertical micro benchmarks
A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc
regular_w2_v_8bpc_neon: 3.68x
regular_w2_v_8bpc_dotprod: 3.29x
regular_w2_v_8bpc_i8mm: 3.49x
sharp_w2_v_8bpc_neon: 3.29x
sharp_w2_v_8bpc_dotprod: 3.27x
sharp_w2_v_8bpc_i8mm: 3.46x
regular_w4_v_8bpc_neon: 7.15x 5.62x
regular_w4_v_8bpc_dotprod: 7.43x 5.85x
regular_w4_v_8bpc_i8mm: 7.89x 6.20x
sharp_w4_v_8bpc_neon: 5.83x 4.71x
sharp_w4_v_8bpc_dotprod: 7.36x 5.85x
sharp_w4_v_8bpc_i8mm: 7.90x 6.18x
regular_w8_v_8bpc_neon: 6.11x 6.55x 8.05x 8.24x 4.07x 4.38x
regular_w8_v_8bpc_dotprod: 5.45x 5.61x 8.15x 7.00x 4.01x 4.30x
regular_w8_v_8bpc_i8mm: 7.30x 7.59x 9.46x 9.12x 4.19x 4.49x
sharp_w8_v_8bpc_neon: 4.23x 4.51x 5.46x 5.54x 3.09x 3.33x
sharp_w8_v_8bpc_dotprod: 5.43x 5.58x 7.96x 6.74x 4.01x 4.29x
sharp_w8_v_8bpc_i8mm: 7.26x 7.44x 9.12x 9.02x 4.19x 4.47x
regular_w16_v_8bpc_neon: 3.44x 3.61x 4.33x 4.52x 2.40x 2.36x
regular_w16_v_8bpc_dotprod: 3.20x 3.34x 4.53x 4.53x 2.85x 2.60x
regular_w16_v_8bpc_i8mm: 4.09x 4.33x 5.27x 5.53x 2.87x 2.62x
sharp_w16_v_8bpc_neon: 2.50x 2.61x 3.14x 3.31x 1.82x 1.81x
sharp_w16_v_8bpc_dotprod: 3.20x 3.34x 4.52x 4.51x 2.86x 2.62x
sharp_w16_v_8bpc_i8mm: 4.09x 4.32x 5.15x 5.55x 2.86x 2.65x
regular_w32_v_8bpc_neon: 2.94x 3.12x 3.52x 3.70x 1.81x 1.84x
regular_w32_v_8bpc_dotprod: 2.80x 2.95x 3.74x 3.75x 2.17x 2.06x
regular_w32_v_8bpc_i8mm: 3.54x 3.76x 4.19x 4.48x 2.16x 2.06x
sharp_w32_v_8bpc_neon: 2.14x 2.27x 2.58x 2.73x 1.37x 1.40x
sharp_w32_v_8bpc_dotprod: 2.78x 2.93x 3.70x 3.71x 2.17x 2.05x
sharp_w32_v_8bpc_i8mm: 3.50x 3.73x 4.15x 4.46x 2.18x 2.06x
regular_w64_v_8bpc_neon: 2.74x 2.88x 3.11x 3.33x 1.53x 1.65x
regular_w64_v_8bpc_dotprod: 2.63x 2.75x 3.30x 3.35x 1.84x 1.82x
regular_w64_v_8bpc_i8mm: 3.31x 3.48x 3.73x 3.99x 1.84x 1.82x
sharp_w64_v_8bpc_neon: 2.01x 2.12x 2.29x 2.45x 1.16x 1.25x
sharp_w64_v_8bpc_dotprod: 2.61x 2.75x 3.27x 3.32x 1.83x 1.82x
sharp_w64_v_8bpc_i8mm: 3.29x 3.48x 3.68x 3.94x 1.84x 1.82x
regular_w128_v_8bpc_neon: 2.66x 2.80x 2.92x 3.16x 1.39x 1.53x
regular_w128_v_8bpc_dotprod: 2.56x 2.68x 3.11x 3.18x 1.63x 1.69x
regular_w128_v_8bpc_i8mm: 3.21x 3.39x 3.48x 3.78x 1.63x 1.69x
sharp_w128_v_8bpc_neon: 1.95x 2.06x 2.16x 2.34x 1.06x 1.17x
sharp_w128_v_8bpc_dotprod: 2.55x 2.68x 3.10x 3.17x 1.63x 1.69x
sharp_w128_v_8bpc_i8mm: 3.19x 3.37x 3.49x 3.76x 1.63x 1.69x
Some benchmark results against Armv8.4-A (DotProd) version:
- AWS Graviton 3: 178.16 fps -> 183.38 fps ( +2.93 % )
- AWS Graviton 3: 162.45 fps -> 166.60 fps ( +2.55 % )
- AWS Graviton 3: 133.95 fps -> 136.51 fps ( +1.91 % )
- AWS Graviton 3: 130.15 fps -> 132.68 fps ( +1.94 % )
- AWS Graviton 3: 192.59 fps -> 197.09 fps ( +2.34 % )
- AWS Graviton 3: 213.57 fps -> 226.32 fps ( +5.97 % )
Bosphorus 1080p was encoded by aomenc (3.7.1+):
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
Merge request reports
Activity
requested review from @mstorsjo
- Resolved by Martin Storsjö
Some measurements from an M3:
mc_8tap_regular_w64_h_8bpc_neon: 1648.3 ( 3.05x) mc_8tap_regular_w64_h_8bpc_dotprod: 1031.9 ( 4.87x) mc_8tap_regular_w64_h_8bpc_i8mm: 1128.5 ( 4.45x) mc_8tap_regular_w64_hv_8bpc_neon: 3390.1 ( 3.38x) mc_8tap_regular_w64_hv_8bpc_dotprod: 2730.3 ( 4.19x) mc_8tap_regular_w64_hv_8bpc_i8mm: 2590.4 ( 4.42x) mc_8tap_regular_w64_v_8bpc_neon: 891.4 ( 6.11x) mc_8tap_regular_w64_v_8bpc_dotprod: 1517.7 ( 3.59x) mc_8tap_regular_w64_v_8bpc_i8mm: 1105.5 ( 4.93x) mc_8tap_sharp_w64_h_8bpc_neon: 1936.9 ( 2.55x) mc_8tap_sharp_w64_h_8bpc_dotprod: 1031.2 ( 4.80x) mc_8tap_sharp_w64_h_8bpc_i8mm: 1128.5 ( 4.38x) mc_8tap_sharp_w64_hv_8bpc_neon: 4234.2 ( 2.70x) mc_8tap_sharp_w64_hv_8bpc_dotprod: 3176.3 ( 3.60x) mc_8tap_sharp_w64_hv_8bpc_i8mm: 3046.1 ( 3.76x) mc_8tap_sharp_w64_v_8bpc_neon: 1104.5 ( 4.92x) mc_8tap_sharp_w64_v_8bpc_dotprod: 1519.7 ( 3.57x) mc_8tap_sharp_w64_v_8bpc_i8mm: 1108.4 ( 4.90x)
So for the vertical case, this reduces the overhead of the dotprod version, so we're almost equal to the original neon case (almost, for regular, and quite equal, for sharp).
For the horizontal case, the i8mm version is surprisingly marginally slower than dotprod. Not by much, and it's still faster than plain neon, but it seems to be consistent. Not sure why this is. This seems to be the case for all horizontal functions from w16 and up.
For the hv case, this is a gain. Not very large, nowhere near the gain you got on the Cortex A/X series, but at least consistently better.
- Resolved by Martin Storsjö
- Automatically resolved by Arpad Panyik
- Resolved by Martin Storsjö
- Resolved by Martin Storsjö
- Resolved by Martin Storsjö
- Automatically resolved by Arpad Panyik
- Resolved by Martin Storsjö
- Resolved by Martin Storsjö
- Automatically resolved by Arpad Panyik
- Automatically resolved by Arpad Panyik
- Resolved by Martin Storsjö
Overall, it looks quite good, but I left a bunch of comments on things I found surprising.
In particular within the
hv
functions, I see more changes and more duplicated.if
conditions than I would have expected. In principle, I would expect the dotprod->i8mm change to be only about getting rid of the offsettingsub
, initializing the accumulator differently, and possibly doing rounding differently (via the accumulator, or fused with downshifts).For all the extra conditionalizing/specialcasing within
hv
, if it does make a measurable difference, is it possible to split that out as a later MR after this one, so we focus one solely on the mechanical dotprod/i8mm differences, and can do other extra tuning separately afterwards?In particular, if there are tuning differences, I'm curious about why we shouldn't apply the same to the dotprod cases as well.
added 10 commits
-
b81b29d8...fbf23637 - 9 commits from branch
videolan:master
- 75647dcf - AArch64: Add basic i8mm support for convolutions
-
b81b29d8...fbf23637 - 9 commits from branch
- Resolved by Martin Storsjö
added 1 commit
- 1776c45a - AArch64: Add basic i8mm support for convolutions
changed milestone to %1.4.2
added ARM label
added performance label