Skip to content
Snippets Groups Projects

AArch64: Add i8mm support for convolutions

Merged Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_sbd_i8mm into master
All threads resolved!

This is a follow-up work of !1632 (merged).

Add an Armv8.6-A i8mm code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element USDOT instruction.

Benchmarks show 4-9% FPS increase relative to the Armv8.4-A code path depending on the input video and the CPU used.

This patch will increase the .text by around 5.7 KiB.


Relative performance to the C reference on some CPUs:

Horizontal-vertical micro benchmarks A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc

regular_w2_hv_8bpc_neon: 5.64x 7.21x 2.86x regular_w2_hv_8bpc_dotprod: 6.05x 7.98x 3.00x regular_w2_hv_8bpc_i8mm: 7.06x 8.69x 3.04x

sharp_w2_hv_8bpc_neon: 5.20x 6.04x 2.66x sharp_w2_hv_8bpc_dotprod: 4.78x 5.83x 2.63x sharp_w2_hv_8bpc_i8mm: 5.31x 6.41x 2.71x

regular_w4_hv_8bpc_neon: 7.20x 6.34x 11.20x 9.54x 4.40x 3.91x regular_w4_hv_8bpc_dotprod: 12.77x 10.98x 18.35x 14.57x 6.21x 5.45x regular_w4_hv_8bpc_i8mm: 14.50x 12.83x 21.42x 15.85x 6.16x 5.54x

sharp_w4_hv_8bpc_neon: 6.24x 5.40x 9.77x 8.24x 3.96x 3.48x sharp_w4_hv_8bpc_dotprod: 9.76x 8.77x 14.02x 11.61x 5.20x 4.78x sharp_w4_hv_8bpc_i8mm: 10.84x 9.70x 16.09x 12.68x 5.42x 4.90x

regular_w8_hv_8bpc_neon: 2.17x 2.27x 2.46x 2.57x 3.17x 3.28x regular_w8_hv_8bpc_dotprod: 3.04x 3.18x 3.11x 3.42x 3.03x 2.98x regular_w8_hv_8bpc_i8mm: 3.57x 3.87x 3.40x 3.69x 3.27x 3.26x

sharp_w8_hv_8bpc_neon: 1.72x 1.82x 1.93x 2.05x 2.75x 2.86x sharp_w8_hv_8bpc_dotprod: 2.49x 2.65x 2.54x 2.81x 2.62x 2.38x sharp_w8_hv_8bpc_i8mm: 2.80x 3.03x 2.79x 3.07x 2.70x 2.70x

regular_w16_hv_8bpc_neon: 1.90x 2.09x 2.17x 2.18x 2.02x 1.99x regular_w16_hv_8bpc_dotprod: 2.59x 2.85x 2.64x 2.79x 1.93x 1.83x regular_w16_hv_8bpc_i8mm: 3.01x 3.33x 2.85x 2.94x 2.05x 1.97x

sharp_w16_hv_8bpc_neon: 1.51x 1.67x 1.72x 1.76x 1.74x 1.73x sharp_w16_hv_8bpc_dotprod: 2.17x 2.41x 2.22x 2.35x 1.70x 1.46x sharp_w16_hv_8bpc_i8mm: 2.42x 2.69x 2.42x 2.54x 1.72x 1.65x

regular_w32_hv_8bpc_neon: 1.80x 2.01x 1.96x 2.04x 1.81x 1.81x regular_w32_hv_8bpc_dotprod: 2.43x 2.68x 2.36x 2.55x 1.74x 1.67x regular_w32_hv_8bpc_i8mm: 2.83x 3.17x 2.51x 2.67x 1.83x 1.78x

sharp_w32_hv_8bpc_neon: 1.42x 1.59x 1.54x 1.64x 1.56x 1.57x sharp_w32_hv_8bpc_dotprod: 2.07x 2.30x 2.00x 2.17x 1.55x 1.34x sharp_w32_hv_8bpc_i8mm: 2.29x 2.55x 2.16x 2.33x 1.55x 1.49x

regular_w64_hv_8bpc_neon: 1.82x 1.94x 1.89x 1.95x 1.70x 1.80x regular_w64_hv_8bpc_dotprod: 2.43x 2.59x 2.25x 2.43x 1.65x 1.66x regular_w64_hv_8bpc_i8mm: 2.84x 3.04x 2.39x 2.52x 1.73x 1.76x

sharp_w64_hv_8bpc_neon: 1.43x 1.53x 1.47x 1.57x 1.49x 1.49x sharp_w64_hv_8bpc_dotprod: 2.08x 2.24x 1.91x 2.07x 1.49x 1.28x sharp_w64_hv_8bpc_i8mm: 2.30x 2.46x 2.07x 2.22x 1.48x 1.42x

regular_w128_hv_8bpc_neon: 1.77x 1.94x 1.84x 1.92x 1.75x 1.69x regular_w128_hv_8bpc_dotprod: 2.37x 2.57x 2.18x 2.37x 1.70x 1.56x regular_w128_hv_8bpc_i8mm: 2.76x 3.02x 2.33x 2.45x 1.78x 1.65x

sharp_w128_hv_8bpc_neon: 1.40x 1.53x 1.45x 1.54x 1.42x 1.44x sharp_w128_hv_8bpc_dotprod: 2.04x 2.23x 1.87x 2.03x 1.43x 1.24x sharp_w128_hv_8bpc_i8mm: 2.24x 2.45x 2.02x 2.17x 1.42x 1.38x

Horizontal micro benchmarks A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc

regular_w2_h_8bpc_neon: 2.42x regular_w2_h_8bpc_dotprod: 3.75x regular_w2_h_8bpc_i8mm: 4.22x

sharp_w2_h_8bpc_neon: 2.42x sharp_w2_h_8bpc_dotprod: 3.76x sharp_w2_h_8bpc_i8mm: 4.23x

regular_w4_h_8bpc_neon: 4.81x 4.11x regular_w4_h_8bpc_dotprod: 9.14x 7.22x regular_w4_h_8bpc_i8mm: 11.18x 8.12x

sharp_w4_h_8bpc_neon: 4.78x 4.10x sharp_w4_h_8bpc_dotprod: 9.14x 7.17x sharp_w4_h_8bpc_i8mm: 11.11x 8.10x

regular_w8_h_8bpc_neon: 3.16x 3.20x 3.51x 3.32x 3.43x 3.37x regular_w8_h_8bpc_dotprod: 4.97x 5.12x 7.43x 7.27x 4.95x 5.06x regular_w8_h_8bpc_i8mm: 7.28x 5.87x 10.38x 8.59x 5.69x 5.69x

sharp_w8_h_8bpc_neon: 2.71x 2.64x 2.77x 2.75x 3.10x 3.09x sharp_w8_h_8bpc_dotprod: 4.92x 5.09x 7.14x 7.03x 4.94x 5.09x sharp_w8_h_8bpc_i8mm: 7.21x 5.82x 10.11x 8.45x 5.70x 5.68x

regular_w16_h_8bpc_neon: 2.79x 2.61x 2.76x 2.75x 3.53x 3.22x regular_w16_h_8bpc_dotprod: 3.81x 4.09x 4.77x 4.90x 3.13x 3.10x regular_w16_h_8bpc_i8mm: 5.21x 4.55x 6.04x 5.66x 3.56x 3.23x

sharp_w16_h_8bpc_neon: 2.31x 2.22x 2.38x 2.36x 3.12x 2.89x sharp_w16_h_8bpc_dotprod: 3.80x 4.10x 4.74x 4.87x 3.13x 3.09x sharp_w16_h_8bpc_i8mm: 5.20x 4.55x 5.98x 5.61x 3.56x 3.22x

regular_w32_h_8bpc_neon: 2.58x 2.40x 2.61x 2.54x 3.14x 2.91x regular_w32_h_8bpc_dotprod: 3.36x 3.54x 3.92x 4.03x 2.57x 2.11x regular_w32_h_8bpc_i8mm: 4.48x 3.88x 4.81x 4.55x 2.91x 2.70x

sharp_w32_h_8bpc_neon: 2.15x 2.03x 2.19x 2.17x 2.78x 2.62x sharp_w32_h_8bpc_dotprod: 3.33x 3.52x 3.90x 3.94x 2.57x 2.10x sharp_w32_h_8bpc_i8mm: 4.45x 3.85x 4.79x 4.45x 2.89x 2.70x

regular_w64_h_8bpc_neon: 2.49x 2.31x 2.46x 2.41x 2.94x 2.79x regular_w64_h_8bpc_dotprod: 3.17x 3.33x 3.60x 3.62x 2.41x 2.22x regular_w64_h_8bpc_i8mm: 4.22x 3.63x 4.40x 4.08x 2.72x 2.53x

sharp_w64_h_8bpc_neon: 2.07x 1.97x 2.06x 2.05x 2.60x 2.49x sharp_w64_h_8bpc_dotprod: 3.16x 3.32x 3.58x 3.58x 2.40x 2.21x sharp_w64_h_8bpc_i8mm: 4.20x 3.63x 4.38x 4.04x 2.71x 2.51x

regular_w128_h_8bpc_neon: 2.45x 2.28x 2.38x 2.33x 2.78x 2.69x regular_w128_h_8bpc_dotprod: 3.09x 3.25x 3.47x 3.47x 2.24x 2.23x regular_w128_h_8bpc_i8mm: 4.10x 3.55x 4.25x 3.92x 2.52x 2.31x

sharp_w128_h_8bpc_neon: 2.05x 1.94x 2.01x 2.01x 2.47x 2.39x sharp_w128_h_8bpc_dotprod: 3.09x 3.25x 3.44x 3.46x 2.24x 2.23x sharp_w128_h_8bpc_i8mm: 4.10x 3.55x 4.22x 3.89x 2.52x 2.31x

Vertical micro benchmarks A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc

regular_w2_v_8bpc_neon: 3.68x regular_w2_v_8bpc_dotprod: 3.29x regular_w2_v_8bpc_i8mm: 3.49x

sharp_w2_v_8bpc_neon: 3.29x sharp_w2_v_8bpc_dotprod: 3.27x sharp_w2_v_8bpc_i8mm: 3.46x

regular_w4_v_8bpc_neon: 7.15x 5.62x regular_w4_v_8bpc_dotprod: 7.43x 5.85x regular_w4_v_8bpc_i8mm: 7.89x 6.20x

sharp_w4_v_8bpc_neon: 5.83x 4.71x sharp_w4_v_8bpc_dotprod: 7.36x 5.85x sharp_w4_v_8bpc_i8mm: 7.90x 6.18x

regular_w8_v_8bpc_neon: 6.11x 6.55x 8.05x 8.24x 4.07x 4.38x regular_w8_v_8bpc_dotprod: 5.45x 5.61x 8.15x 7.00x 4.01x 4.30x regular_w8_v_8bpc_i8mm: 7.30x 7.59x 9.46x 9.12x 4.19x 4.49x

sharp_w8_v_8bpc_neon: 4.23x 4.51x 5.46x 5.54x 3.09x 3.33x sharp_w8_v_8bpc_dotprod: 5.43x 5.58x 7.96x 6.74x 4.01x 4.29x sharp_w8_v_8bpc_i8mm: 7.26x 7.44x 9.12x 9.02x 4.19x 4.47x

regular_w16_v_8bpc_neon: 3.44x 3.61x 4.33x 4.52x 2.40x 2.36x regular_w16_v_8bpc_dotprod: 3.20x 3.34x 4.53x 4.53x 2.85x 2.60x regular_w16_v_8bpc_i8mm: 4.09x 4.33x 5.27x 5.53x 2.87x 2.62x

sharp_w16_v_8bpc_neon: 2.50x 2.61x 3.14x 3.31x 1.82x 1.81x sharp_w16_v_8bpc_dotprod: 3.20x 3.34x 4.52x 4.51x 2.86x 2.62x sharp_w16_v_8bpc_i8mm: 4.09x 4.32x 5.15x 5.55x 2.86x 2.65x

regular_w32_v_8bpc_neon: 2.94x 3.12x 3.52x 3.70x 1.81x 1.84x regular_w32_v_8bpc_dotprod: 2.80x 2.95x 3.74x 3.75x 2.17x 2.06x regular_w32_v_8bpc_i8mm: 3.54x 3.76x 4.19x 4.48x 2.16x 2.06x

sharp_w32_v_8bpc_neon: 2.14x 2.27x 2.58x 2.73x 1.37x 1.40x sharp_w32_v_8bpc_dotprod: 2.78x 2.93x 3.70x 3.71x 2.17x 2.05x sharp_w32_v_8bpc_i8mm: 3.50x 3.73x 4.15x 4.46x 2.18x 2.06x

regular_w64_v_8bpc_neon: 2.74x 2.88x 3.11x 3.33x 1.53x 1.65x regular_w64_v_8bpc_dotprod: 2.63x 2.75x 3.30x 3.35x 1.84x 1.82x regular_w64_v_8bpc_i8mm: 3.31x 3.48x 3.73x 3.99x 1.84x 1.82x

sharp_w64_v_8bpc_neon: 2.01x 2.12x 2.29x 2.45x 1.16x 1.25x sharp_w64_v_8bpc_dotprod: 2.61x 2.75x 3.27x 3.32x 1.83x 1.82x sharp_w64_v_8bpc_i8mm: 3.29x 3.48x 3.68x 3.94x 1.84x 1.82x

regular_w128_v_8bpc_neon: 2.66x 2.80x 2.92x 3.16x 1.39x 1.53x regular_w128_v_8bpc_dotprod: 2.56x 2.68x 3.11x 3.18x 1.63x 1.69x regular_w128_v_8bpc_i8mm: 3.21x 3.39x 3.48x 3.78x 1.63x 1.69x

sharp_w128_v_8bpc_neon: 1.95x 2.06x 2.16x 2.34x 1.06x 1.17x sharp_w128_v_8bpc_dotprod: 2.55x 2.68x 3.10x 3.17x 1.63x 1.69x sharp_w128_v_8bpc_i8mm: 3.19x 3.37x 3.49x 3.76x 1.63x 1.69x


Some benchmark results against Armv8.4-A (DotProd) version:

Models 1080p:

 - AWS Graviton 3:  178.16 fps  ->  183.38 fps ( +2.93 % )

Balloons 1080p:

 - AWS Graviton 3:  162.45 fps  ->  166.60 fps ( +2.55 % )

Mountain Bike 1080p:

 - AWS Graviton 3:  133.95 fps  ->  136.51 fps ( +1.91 % )

Nature 1080p:

 - AWS Graviton 3:  130.15 fps  ->  132.68 fps ( +1.94 % )

Vision Pro 1080p:

 - AWS Graviton 3:  192.59 fps  ->  197.09 fps ( +2.34 % )

Bosphorus 1080p:

 - AWS Graviton 3:  213.57 fps  ->  226.32 fps ( +5.97 % )

Bosphorus 1080p was encoded by aomenc (3.7.1+):

aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
Edited by Arpad Panyik

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Martin Storsjö requested review from @mstorsjo

    requested review from @mstorsjo

  • Arpad Panyik changed the description

    changed the description

    • Resolved by Martin Storsjö

      Some measurements from an M3:

      mc_8tap_regular_w64_h_8bpc_neon:                  1648.3 ( 3.05x)
      mc_8tap_regular_w64_h_8bpc_dotprod:               1031.9 ( 4.87x)
      mc_8tap_regular_w64_h_8bpc_i8mm:                  1128.5 ( 4.45x)
      
      mc_8tap_regular_w64_hv_8bpc_neon:                 3390.1 ( 3.38x)
      mc_8tap_regular_w64_hv_8bpc_dotprod:              2730.3 ( 4.19x)
      mc_8tap_regular_w64_hv_8bpc_i8mm:                 2590.4 ( 4.42x)
      
      mc_8tap_regular_w64_v_8bpc_neon:                   891.4 ( 6.11x)
      mc_8tap_regular_w64_v_8bpc_dotprod:               1517.7 ( 3.59x)
      mc_8tap_regular_w64_v_8bpc_i8mm:                  1105.5 ( 4.93x)
      
      mc_8tap_sharp_w64_h_8bpc_neon:                    1936.9 ( 2.55x)
      mc_8tap_sharp_w64_h_8bpc_dotprod:                 1031.2 ( 4.80x)
      mc_8tap_sharp_w64_h_8bpc_i8mm:                    1128.5 ( 4.38x)
      
      mc_8tap_sharp_w64_hv_8bpc_neon:                   4234.2 ( 2.70x)
      mc_8tap_sharp_w64_hv_8bpc_dotprod:                3176.3 ( 3.60x)
      mc_8tap_sharp_w64_hv_8bpc_i8mm:                   3046.1 ( 3.76x)
      
      mc_8tap_sharp_w64_v_8bpc_neon:                    1104.5 ( 4.92x)
      mc_8tap_sharp_w64_v_8bpc_dotprod:                 1519.7 ( 3.57x)
      mc_8tap_sharp_w64_v_8bpc_i8mm:                    1108.4 ( 4.90x)

      So for the vertical case, this reduces the overhead of the dotprod version, so we're almost equal to the original neon case (almost, for regular, and quite equal, for sharp).

      For the horizontal case, the i8mm version is surprisingly marginally slower than dotprod. Not by much, and it's still faster than plain neon, but it seems to be consistent. Not sure why this is. This seems to be the case for all horizontal functions from w16 and up.

      For the hv case, this is a gain. Not very large, nowhere near the gain you got on the Cortex A/X series, but at least consistently better.

  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
    • Resolved by Martin Storsjö

      Overall, it looks quite good, but I left a bunch of comments on things I found surprising.

      In particular within the hv functions, I see more changes and more duplicated .if conditions than I would have expected. In principle, I would expect the dotprod->i8mm change to be only about getting rid of the offsetting sub, initializing the accumulator differently, and possibly doing rounding differently (via the accumulator, or fused with downshifts).

      For all the extra conditionalizing/specialcasing within hv, if it does make a measurable difference, is it possible to split that out as a later MR after this one, so we focus one solely on the mechanical dotprod/i8mm differences, and can do other extra tuning separately afterwards?

      In particular, if there are tuning differences, I'm curious about why we shouldn't apply the same to the dotprod cases as well.

  • Arpad Panyik changed the description

    changed the description

  • Arpad Panyik added 1 commit

    added 1 commit

    • b81b29d8 - AArch64: Add i8mm support for convolutions

    Compare with previous version

  • Arpad Panyik marked this merge request as draft

    marked this merge request as draft

  • Arpad Panyik added 10 commits

    added 10 commits

    Compare with previous version

  • Arpad Panyik marked this merge request as ready

    marked this merge request as ready

  • Arpad Panyik changed the description

    changed the description

  • Martin Storsjö
  • Looking really good and straightforward now, thanks! There's just one case of unexpected extra differences left (present in all vertical filters) that I'd like do discuss.

  • Arpad Panyik resolved all threads

    resolved all threads

  • Arpad Panyik added 1 commit

    added 1 commit

    • 1776c45a - AArch64: Add basic i8mm support for convolutions

    Compare with previous version

  • Martin Storsjö approved this merge request

    approved this merge request

  • changed milestone to %1.4.2

  • added ARM label

  • Please register or sign in to reply
    Loading