Skip to content
Snippets Groups Projects
  1. Sep 01, 2024
  2. Aug 30, 2024
  3. Aug 29, 2024
  4. Aug 26, 2024
  5. Aug 24, 2024
    • Martin Storsjö's avatar
      aarch64: Fix a label typo · 27491dd9
      Martin Storsjö authored
      Apparently, this case isn't actually ever executed, at least in most
      checkasm runs, but some tools could complain about the relocation
      against 160b, which pointed elsewhere than intended.
      27491dd9
  6. Aug 23, 2024
  7. Aug 22, 2024
    • Arpad Panyik's avatar
      AArch64: SVE MS armasm64 fix of HBD subpel filters · 472b31f8
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      MS armasm64 cannot compile some SVE instructions with immediate
      operands, e.g.:
        sub  z0.h, z0.h, #8192
      
      The proper form is:
        sub  z0.h, z0.h, #32, lsl #8
      
      This patch contains the needed fixes.
      472b31f8
    • Martin Storsjö's avatar
      aarch64: mc16: Optimize the BTI landing pads in put/prep_neon · 3329f8d1
      Martin Storsjö authored
      Don't include the BTI landing pad instruction in the loops.
      
      If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to
      a no-op instruction that indicates that indirect jumps can land
      there. But there's no need for the loops to include that instruction.
      3329f8d1
    • Arpad Panyik's avatar
      AArch64: Add HBD subpel filters using 128-bit SVE2 · 01558f3f
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only
      2D convolutions have 6-tap specialisations of their vertical passes.
      All other convolutions are 4- or 8-tap filters which fit well with
      the 4-element 16-bit SDOT instruction of SVE2.
      
      This patch renames HBD prep/put_neon to prep/put_16bpc_neon and
      exports put_16bpc_neon.
      
      Benchmarks show up-to 17% FPS increase depending on the input video
      and the CPU used.
      
      This patch will increase the .text by around 8 KiB.
      
      Relative performance to the C reference on some Cortex-A/X CPUs:
      
          regular     A715    A720      X3      X4    A510    A520
       w4 hv neon:    3.93x   4.10x   5.21x   5.17x   3.57x   5.27x
       w4 hv sve2:    4.99x   5.14x   6.00x   6.05x   4.33x   3.99x
       w8 hv neon:    1.72x   1.67x   1.98x   2.18x   2.95x   2.94x
       w8 hv sve2:    2.12x   2.29x   2.52x   2.62x   2.60x   2.60x
      w16 hv neon:    1.59x   1.53x   1.83x   1.89x   2.35x   2.24x
      w16 hv sve2:    1.94x   2.12x   2.33x   2.18x   2.06x   2.06x
      w32 hv neon:    1.49x   1.50x   1.66x   1.76x   2.10x   2.16x
      w32 hv sve2:    1.81x   2.09x   2.11x   2.09x   1.84x   1.87x
      w64 hv neon:    1.52x   1.50x   1.55x   1.71x   1.95x   2.05x
      w64 hv sve2:    1.84x   2.08x   1.97x   1.98x   1.74x   1.77x
      
       w4 h neon:     5.35x   5.47x   7.39x   5.78x   3.92x   5.19x
       w4 h sve2:     7.91x   8.35x  11.95x  10.33x   5.81x   5.42x
       w8 h neon:     4.49x   4.43x   6.50x   4.87x   7.18x   6.17x
       w8 h sve2:     6.09x   6.22x   9.59x   7.70x   7.89x   6.83x
      w16 h neon:     2.53x   2.52x   2.34x   1.86x   2.71x   2.75x
      w16 h sve2:     3.41x   3.47x   3.53x   3.25x   2.89x   2.96x
      w32 h neon:     2.07x   2.08x   1.97x   1.56x   2.17x   2.21x
      w32 h sve2:     2.76x   2.84x   2.94x   2.75x   2.24x   2.29x
      w64 h neon:     1.86x   1.86x   1.76x   1.41x   1.87x   1.88x
      w64 h sve2:     2.47x   2.54x   2.65x   2.46x   1.94x   1.94x
      
       w4 v neon:     5.22x   5.17x   6.36x   5.60x   4.23x   7.30x
       w4 v sve2:     5.86x   5.90x   7.81x   7.16x   4.86x   4.15x
       w8 v neon:     4.83x   4.79x   6.96x   6.45x   4.74x   8.40x
       w8 v sve2:     5.25x   5.23x   7.76x   6.79x   4.84x   4.13x
      w16 v neon:     2.59x   2.60x   2.93x   2.47x   1.80x   4.16x
      w16 v sve2:     2.85x   2.88x   3.36x   2.73x   1.86x   2.00x
      w32 v neon:     2.12x   2.13x   2.33x   2.03x   1.34x   3.11x
      w32 v sve2:     2.36x   2.40x   2.73x   2.32x   1.41x   1.48x
      w64 v neon:     1.94x   1.92x   2.02x   1.78x   1.12x   2.59x
      w64 v sve2:     2.16x   2.15x   2.37x   2.03x   1.17x   1.22x
      
       w4 0 neon:     1.75x   1.71x   1.44x   1.56x   3.18x   2.87x
       w4 0 sve2:     4.28x   4.39x   5.72x   6.42x   5.50x   4.68x
       w8 0 neon:     3.05x   3.04x   4.44x   4.64x   3.84x   3.52x
       w8 0 sve2:     3.85x   3.80x   5.45x   6.01x   4.92x   4.26x
      w16 0 neon:     2.92x   2.93x   3.82x   3.23x   4.58x   4.44x
      w16 0 sve2:     4.29x   4.27x   4.25x   4.15x   5.58x   5.29x
      w32 0 neon:     2.73x   2.76x   3.50x   2.67x   4.44x   4.26x
      w32 0 sve2:     4.09x   4.10x   3.75x   3.39x   5.67x   5.22x
      w64 0 neon:     2.73x   2.70x   3.27x   3.14x   4.57x   4.68x
      w64 0 sve2:     4.06x   3.97x   3.54x   3.18x   6.36x   6.25x
      
            sharp     A715    A720      X3      X4    A510    A520
       w4 hv neon:    3.54x   3.64x   4.43x   4.45x   3.03x   4.72x
       w4 hv sve2:    4.30x   4.55x   5.38x   5.26x   4.04x   3.76x
       w8 hv neon:    1.30x   1.25x   1.51x   1.60x   2.44x   2.43x
       w8 hv sve2:    1.86x   2.06x   2.09x   2.18x   2.37x   2.39x
      w16 hv neon:    1.19x   1.16x   1.43x   1.36x   1.95x   1.98x
      w16 hv sve2:    1.68x   1.91x   1.94x   1.84x   1.89x   1.94x
      w32 hv neon:    1.13x   1.12x   1.30x   1.29x   1.75x   1.81x
      w32 hv sve2:    1.58x   1.84x   1.75x   1.74x   1.70x   1.76x
      w64 hv neon:    1.13x   1.13x   1.21x   1.25x   1.65x   1.69x
      w64 hv sve2:    1.57x   1.84x   1.62x   1.67x   1.62x   1.65x
      
       w4 h neon:     5.38x   5.49x   7.46x   5.74x   3.93x   5.23x
       w4 h sve2:     7.86x   8.37x  11.99x  10.38x   5.81x   5.40x
       w8 h neon:     3.46x   3.49x   5.36x   4.64x   6.40x   5.62x
       w8 h sve2:     5.95x   6.23x   9.61x   7.76x   7.86x   6.89x
      w16 h neon:     1.99x   1.97x   2.07x   1.91x   2.43x   2.51x
      w16 h sve2:     3.42x   3.46x   3.75x   3.23x   2.89x   2.98x
      w32 h neon:     1.67x   1.62x   1.66x   1.63x   1.95x   2.01x
      w32 h sve2:     2.86x   2.84x   2.94x   2.72x   2.21x   2.29x
      w64 h neon:     1.45x   1.45x   1.51x   1.48x   1.69x   1.70x
      w64 h sve2:     2.47x   2.54x   2.64x   2.46x   1.93x   1.95x
      
       w4 v neon:     4.07x   4.01x   5.15x   4.74x   3.38x   6.56x
       w4 v sve2:     5.88x   5.86x   7.81x   7.15x   4.85x   4.39x
       w8 v neon:     3.64x   3.59x   5.38x   4.92x   3.59x   7.23x
       w8 v sve2:     5.23x   5.19x   7.77x   6.66x   4.81x   4.13x
      w16 v neon:     1.93x   1.95x   2.25x   1.92x   1.35x   3.46x
      w16 v sve2:     2.85x   2.88x   3.36x   2.71x   1.86x   1.94x
      w32 v neon:     1.57x   1.58x   1.78x   1.60x   1.01x   2.67x
      w32 v sve2:     2.36x   2.39x   2.73x   2.35x   1.41x   1.50x
      w64 v neon:     1.44x   1.42x   1.54x   1.43x   0.85x   2.19x
      w64 v sve2:     2.17x   2.15x   2.37x   2.06x   1.18x   1.25x
      01558f3f
  8. Aug 21, 2024
    • Arpad Panyik's avatar
      AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters · 713c076d
      Arpad Panyik authored
      Add 6-tap variant of standard bit-depth horizontal subpel filters
      using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch
      also extends the HV filter with 6-tap horizontal pass using USMMLA.
      
      Benchmarks show up-to 6-7% FPS increase depending on the input video
      and the CPU used.
      
      This patch will increase the .text by around 1.2 KiB.
      
      Relative runtime of micro benchmarks after this patch on Neoverse
      and Cortex CPU cores:
      
      regular      V2      V1      X3    A720    A715    A520    A510
        w8 hv:  0.860x  0.895x  0.870x  0.896x  0.896x  0.938x  0.936x
       w16 hv:  0.829x  0.886x  0.865x  0.908x  0.906x  0.946x  0.944x
       w32 hv:  0.837x  0.883x  0.862x  0.914x  0.915x  0.953x  0.949x
       w64 hv:  0.840x  0.883x  0.862x  0.914x  0.914x  0.955x  0.952x
      
        w8 h:   0.746x  0.754x  0.747x  0.723x  0.724x  0.874x  0.866x
       w16 h:   0.749x  0.764x  0.745x  0.731x  0.731x  0.858x  0.852x
       w32 h:   0.739x  0.754x  0.738x  0.729x  0.729x  0.839x  0.837x
       w64 h:   0.736x  0.749x  0.733x  0.725x  0.726x  0.847x  0.836x
      713c076d
  9. Aug 12, 2024
  10. Aug 04, 2024
  11. Jun 26, 2024
    • Arpad Panyik's avatar
      AArch64: Move constants of DotProd subpel filters to .rodata · 2355eeb8
      Arpad Panyik authored
      The constants used for the subpel filters were placed in the .text
      section for simplicity and peak performance, but this does not work on
      systems with execute only .text sections (e.g.: OpenBSD).
      
      The performance cost of moving the constants to the .rodata section
      is small and mostly within the measurable noise.
      2355eeb8
  12. Jun 25, 2024
    • Martin Storsjö's avatar
      aarch64: Explicitly use the ldur instruction where relevant in mc_dotprod.S · 7fbcdc6d
      Martin Storsjö authored
      The ldr instruction only can handle offsets that are a multiple
      of the element size; most assemblers implicitly produce the ldur
      instruction when a non-aligned offset is provided.
      
      Older versions of MS armasm64, however, error out on this. Since
      MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7
      and earlier require explicitly writing the instruction as ldur.
      
      Despite this, even older versions still fail to build the mc_dotprod.S
      sources, with errors like this:
      
          src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range
              mov             x10, (((0*15-1)<<7)|(3*15-1))
      
      This happens on MSVC 2022 17.1 and older, while 17.2 and newer
      accept the negative value expression here.
      
      In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure
      script at the moment, as it uses inline assembly to test for external
      assembler features.
      7fbcdc6d
    • Brad Smith's avatar
      Add Arm OpenBSD run-time CPU feature detection support · 431f4fb2
      Brad Smith authored and Martin Storsjö's avatar Martin Storsjö committed
      Add run-time CPU feature detection for DotProd and i8mm on AArch64.
      431f4fb2
    • Henrik Gramner's avatar
  13. Jun 17, 2024
  14. Jun 10, 2024
  15. Jun 05, 2024
  16. May 27, 2024
  17. May 25, 2024
  18. May 20, 2024
  19. May 19, 2024
    • Martin Storsjö's avatar
      arm64: msac: Explicitly use the ldur instruction · 9469e184
      Martin Storsjö authored
      The ldr instruction can take an immediate offset which is a multiple
      of the loaded element size. If the ldr instruction is given an
      immediate offset which isn't a multiple of the element size,
      most assemblers implicitly generate a "ldur" instruction instead.
      
      Older versions of MS armasm64.exe don't do this, but instead error
      out with "error A2518: operand 2: Memory offset must be aligned".
      (Current versions don't do this but correctly generate "ldur"
      implicitly.)
      
      Switch this instruction to an explicit "ldur", like we do elsewhere,
      to fix building with these older tools.
      9469e184
  20. May 18, 2024
Loading