Skip to content
Snippets Groups Projects
  1. Sep 06, 2024
    • Kyle Siefring's avatar
      Improve density of group context setting macros · 4385e7e1
      Kyle Siefring authored and Ronald S. Bultje's avatar Ronald S. Bultje committed
      Shared object binary size reduction:
      x84_64           : 16112 bytes
      ARM64            : 16008 bytes
      ARM64(+Os)       : 21592 bytes
      ARMv7(+Os+mthumb): 18480 bytes
      
      Size reduction of symbols:
      x84_64           : 15712 bytes
      ARM64            : 18688 bytes
      ARM64(+Os)       : 18404 bytes
      ARMv7(+Os+mthumb): 17322 bytes
      
      Compiles were done with clang version 18.1.8 and symbol sizes were
      obtained using nm on the shared object.
      
      Provides speed ups on older ARM64 cpus with very little impact on other
      cpus.
      
      Speedup:
      
      c7i (skylake)
       Nature1080p      : x0.999
       Chimera          : x0.998
      
      odroid C4
       Nature1080p      : x1.007
       Chimera          : x1.016
       Models1080p      : x1.005
       MountainBike1080p: x1.009
       Balloons1080p    : x1.008
      
      Raspberry Pi 4
       Nature1080p      : x1.005
       Chimera          : x0.999
       Models1080p      : x0.999
       MountainBike1080p: x1.004
       Balloons1080p    : x1.003
      
      Raspberry Pi 2 (Cortex-A7):
       (using size optimized build)
       Nature1080p      : x1.003
       Models1080p      : x0.997
      4385e7e1
    • Martin Storsjö's avatar
      tests: Add an option to dav1d_argon.bash for using a wrapper tool · 166e1df5
      Martin Storsjö authored
      This allows executing all the tools within e.g. valgrind.
      
      This matches the "meson test --wrap <tool>" feature.
      166e1df5
    • Kyle Siefring's avatar
      AArch64: New method for calculating sgr table · 79db1624
      Kyle Siefring authored and Martin Storsjö's avatar Martin Storsjö committed
      For the 3x3 part, double the width of the vertical loop. This is done to
      provide more latency in the new sgr calculation.
      
      Initial (master):  Cortex A53        A55        A72        A73       A76   Apple M1
      sgr_3x3_8bpc_neon:   387702.8   383154.2   295742.4   302100.1  185420.7   472.2
      sgr_5x5_8bpc_neon:   261725.1   256919.8   194205.1   197585.6  128311.3   332.9
      sgr_mix_8bpc_neon:   628085.0   593664.2   453551.8   450553.8  281956.0   711.2
      
      Current:
      sgr_3x3_8bpc_neon:   368331.4   363949.7   275499.0   272056.3  169614.4   432.7
      sgr_5x5_8bpc_neon:   257866.7   255265.5   195962.5   199557.8  120481.3   319.2
      sgr_mix_8bpc_neon:   598234.1   572896.4   418500.4   438910.7  258977.7   659.3
      
      Include a minor improvement that gets rid of a dup instruction.
      79db1624
    • Arpad Panyik's avatar
      AArch64: Optimize lane load/store in MC functions · ec5c3052
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Partial register writes can create long dependency chains, which can
      reduce performance on out-of-order CPUs. This patch removes most of
      these kinds of problems in MC functions by filling the full register
      before other lane loading instructions.
      
      Most lane extracting stores can also be optimized using FP scalar
      stores when the 0th lane would be extracted.
      
      Relative runtime of micro benchmarks after this patch on some Neoverse
      and Cortex CPU cores:
      
      8bpc neon                V2      V1      X3      X1    A715     A78     A76
       avg        w8:       0.942x  1.030x  0.936x  0.935x  1.000x  0.877x  0.976x
       w_avg      w8:       0.908x  0.913x  0.919x  0.914x  0.999x  0.905x  0.910x
       mask       w8:       0.937x  0.905x  0.929x  0.907x  1.009x  0.921x  0.868x
       w_mask 420 w4:       0.969x  0.968x  0.951x  0.962x  0.995x  0.976x  0.958x
       w_mask 420 w8:       0.979x  0.935x  0.936x  0.935x  0.996x  0.948x  0.959x
       blend      w4:       0.721x  0.841x  0.764x  0.822x  0.772x  0.826x  0.883x
       blend      w8:       0.692x  0.733x  0.686x  0.730x  0.828x  0.723x  0.762x
       blend    h w2:       0.738x  0.776x  0.746x  0.775x  0.683x  0.827x  0.851x
       blend    h w4:       0.858x  0.942x  0.880x  0.933x  0.784x  0.924x  0.965x
       blend    h w8:       0.804x  0.807x  0.806x  0.805x  0.814x  0.810x  0.748x
       blend    v w2:       0.898x  0.931x  0.903x  0.949x  0.784x  0.867x  0.875x
       blend    v w4:       0.935x  0.905x  0.933x  0.922x  0.763x  0.777x  0.807x
       blend    v w8:       0.803x  0.802x  0.804x  0.815x  0.674x  0.677x  0.678x
      
      16bpc neon               V2      V1      X3      X1    A715     A78     A76
       avg        w4:       0.899x  0.967x  0.897x  0.948x  1.002x  0.901x  0.884x
       w_avg      w4:       0.952x  0.951x  0.936x  0.946x  0.997x  0.937x  0.925x
       mask       w4:       0.893x  0.958x  0.887x  0.948x  1.003x  0.938x  0.934x
       w_mask 420 w4:       0.933x  0.932x  0.932x  0.939x  1.000x  0.910x  0.955x
       w_mask 420 w8:       0.966x  0.962x  0.967x  0.961x  1.000x  0.990x  1.010x
       blend      w4:       0.367x  0.361x  0.370x  0.352x  0.418x  0.394x  0.476x
       blend    h w2:       0.365x  0.445x  0.369x  0.437x  0.416x  0.576x  0.699x
       blend    h w4:       0.343x  0.402x  0.342x  0.398x  0.418x  0.525x  0.603x
       blend    v w2:       0.464x  0.460x  0.460x  0.447x  0.494x  0.446x  0.503x
       blend    v w4:       0.432x  0.424x  0.437x  0.416x  0.433x  0.427x  0.534x
       blend    v w8:       0.936x  0.847x  0.949x  0.848x  1.007x  0.811x  0.785x
      
      bilinear 8bpc neon       V2      V1      X3      X1    A715     A78     A76
       mct     w4  0:       0.982x  0.983x  0.955x  1.029x  0.784x  0.817x  0.814x
       mc      w2  h:       0.277x  0.333x  0.275x  0.325x  0.299x  0.435x  0.518x
       mct     w4  h:       0.835x  0.862x  0.814x  0.887x  1.074x  0.899x  0.884x
       mc      w2  v:       0.887x  0.966x  0.894x  0.945x  0.808x  0.953x  0.997x
       mc      w4  v:       0.762x  0.899x  0.766x  0.867x  0.695x  0.915x  1.017x
       mct     w4  v:       0.700x  0.812x  0.740x  0.777x  0.777x  0.824x  0.853x
       mc      w2 hv:       0.928x  0.985x  0.929x  0.978x  0.789x  0.969x  1.010x
       mct     w4 hv:       0.887x  0.913x  0.912x  0.920x  1.001x  0.922x  0.937x
      
      bilinear 16bpc neon      V2      V1      X3      X1    A715     A78     A76
       mc      w2  0:       0.991x  1.032x  0.993x  0.970x  0.878x  0.925x  0.999x
       mct     w4  0:       0.811x  0.730x  0.797x  0.680x  0.808x  0.711x  0.805x
       mc      w4  h:       0.885x  0.901x  0.895x  0.905x  1.003x  0.909x  0.910x
       mct     w4  h:       0.902x  0.914x  0.898x  0.896x  1.000x  0.897x  0.934x
       mc      w2  v:       0.888x  0.966x  0.913x  0.955x  0.824x  0.958x  1.005x
       mc      w4  v:       0.897x  0.894x  0.903x  0.902x  1.001x  0.895x  0.895x
       mct     w4  v:       0.924x  0.908x  0.921x  0.901x  1.001x  0.904x  0.918x
       mc      w4 hv:       0.927x  0.925x  0.924x  0.933x  1.000x  0.936x  0.959x
       mct     w4 hv:       0.923x  0.944x  0.923x  0.944x  0.999x  0.931x  0.956x
      
      8tap 8bpc neon           V2      V1      X3      X1    A715     A78     A76
       mct regular w4  0:   0.829x  0.854x  0.735x  0.861x  0.769x  0.766x  0.840x
       mc  regular w2  h:   0.984x  1.008x  0.983x  1.012x  0.986x  0.989x  0.995x
       mc  sharp   w2  h:   0.987x  1.008x  0.986x  1.011x  0.985x  0.989x  0.995x
       mc  regular w4  h:   0.907x  0.911x  0.916x  0.908x  0.997x  0.936x  0.932x
       mc  sharp   w4  h:   0.916x  0.914x  0.918x  0.913x  0.999x  0.939x  0.905x
       mct regular w4  h:   0.992x  0.979x  0.993x  0.971x  1.000x  0.986x  0.976x
       mct sharp   w4  h:   0.991x  0.979x  0.989x  0.984x  1.001x  0.979x  0.983x
       mc  regular w2  v:   1.002x  1.001x  1.005x  1.000x  1.000x  0.998x  0.983x
       mc  sharp   w2  v:   1.005x  1.001x  1.009x  0.998x  0.994x  0.997x  0.989x
       mc  regular w4  v:   0.985x  0.998x  0.991x  0.998x  1.000x  1.000x  0.983x
       mc  sharp   w4  v:   1.005x  1.002x  1.006x  1.002x  0.998x  0.991x  0.999x
       mct regular w4  v:   0.966x  0.967x  0.961x  0.974x  0.996x  0.954x  0.982x
       mct sharp   w4  v:   0.970x  0.944x  0.967x  0.944x  0.997x  0.951x  0.966x
       mc  regular w2 hv:   0.993x  0.993x  0.994x  0.987x  0.993x  0.985x  0.999x
       mc  sharp   w2 hv:   0.994x  0.996x  0.992x  0.998x  0.997x  0.999x  0.999x
       mc  regular w4 hv:   0.964x  0.958x  0.964x  0.960x  0.982x  0.938x  0.958x
       mc  sharp   w4 hv:   0.982x  0.981x  0.980x  0.982x  0.995x  0.986x  0.941x
       mct regular w4 hv:   0.993x  0.994x  0.992x  0.994x  0.996x  0.992x  0.988x
       mct sharp   w4 hv:   0.993x  0.996x  0.991x  0.996x  0.954x  0.992x  1.011x
      
      8tap 16bpc neon          V2      V1      X3      X1    A715     A78     A76
       mc  regular w2  0:   0.869x  1.059x  0.874x  0.956x  0.883x  0.932x  1.000x
       mct regular w4  0:   0.348x  0.369x  0.354x  0.377x  0.560x  0.409x  0.648x
       mc  regular w2  h:   0.996x  0.988x  0.992x  0.985x  0.989x  0.991x  1.006x
       mc  sharp   w2  h:   0.996x  0.989x  0.979x  0.991x  0.987x  0.988x  0.997x
       mc  regular w4  h:   0.957x  0.937x  0.957x  0.948x  0.961x  0.927x  0.994x
       mc  sharp   w4  h:   0.966x  0.940x  0.962x  0.954x  0.985x  0.929x  0.970x
       mct regular w4  h:   0.922x  0.942x  0.932x  0.933x  1.007x  0.938x  0.905x
       mct sharp   w4  h:   0.919x  0.943x  0.919x  0.931x  0.971x  0.943x  0.929x
       mc  regular w2  v:   1.000x  0.997x  1.001x  1.003x  1.001x  0.999x  0.984x
       mc  sharp   w2  v:   1.000x  0.999x  1.000x  0.999x  1.000x  1.000x  0.993x
       mc  regular w4  v:   0.936x  0.941x  0.936x  0.939x  0.999x  0.928x  0.981x
       mc  sharp   w4  v:   0.955x  0.961x  0.949x  0.956x  0.999x  0.947x  0.953x
       mct regular w4  v:   0.977x  0.966x  0.979x  0.968x  0.990x  0.972x  0.972x
       mct sharp   w4  v:   0.973x  0.965x  0.981x  0.963x  0.994x  0.977x  0.974x
       mc  regular w2 hv:   0.995x  1.001x  0.995x  0.995x  0.995x  1.000x  0.981x
       mc  sharp   w2 hv:   0.993x  1.012x  0.993x  0.988x  0.996x  0.992x  1.008x
       mc  regular w4 hv:   0.938x  0.943x  0.939x  0.943x  0.986x  0.943x  0.997x
       mc  sharp   w4 hv:   0.969x  0.959x  0.970x  0.974x  0.986x  0.993x  0.997x
       mct regular w4 hv:   0.942x  0.970x  0.951x  0.960x  0.977x  0.958x  1.018x
       mct sharp   w4 hv:   0.923x  0.958x  0.934x  0.955x  0.973x  0.946x  0.986x
      ec5c3052
    • Arpad Panyik's avatar
      AArch64: Optimize Armv8.0 Neon path of SBD H/HV 6-tap filters · a992a9be
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      The 6-tap horizontal and the horizontal parts of 6-tap HV subpel
      filters can be further improved by some pointer arithmetic and saving
      some instructions (EXTs) in their data rearrangement codes.
      
      Relative runtime of micro benchmarks after this patch on Cortex CPU
      cores:
      
      SBD mct h         X1     A78     A76     A72     A55
       regular  w8:  0.878x  0.894x  0.990x  0.923x  0.944x
       regular w16:  0.962x  0.931x  0.943x  0.949x  0.949x
       regular w32:  0.937x  0.937x  0.972x  0.938x  0.947x
       regular w64:  0.920x  0.965x  0.992x  0.936x  0.944x
      
      SBD mct hv        X1     A78     A76     A72     A55
       regular  w8:  0.931x  0.970x  0.951x  0.950x  0.971x
       regular w16:  0.940x  0.971x  0.941x  0.952x  0.967x
       regular w32:  0.943x  0.972x  0.946x  0.961x  0.974x
       regular w64:  0.943x  0.973x  0.952x  0.944x  0.975x
      a992a9be
    • Arpad Panyik's avatar
      AArch64: Optimize Armv8.0 Neon path of HBD HV 6-tap filters · 2d808de1
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      The horizontal parts of 6-tap HV subpel filters can be further
      improved by some pointer arithmetic and saving some instructions
      (EXTs) in their data rearrangement codes.
      
      Relative runtime of micro benchmarks after this patch on Cortex CPU
      cores:
      
      HBD mct hv        X1     A78     A76     A72     A55
       regular  w8:  0.952x  0.989x  0.924x  0.973x  0.976x
       regular w16:  0.961x  0.993x  0.928x  0.952x  0.971x
       regular w32:  0.964x  0.996x  0.930x  0.973x  0.972x
       regular w64:  0.963x  0.997x  0.930x  0.969x  0.974x
      2d808de1
    • Arpad Panyik's avatar
      AArch64: Optimize Armv8.0 Neon path of HBD horizontal 6-tap filters · 93339ce8
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      The 6-tap horizontal subpel filters can be further improved by some
      pointer arithmetic and saving some instructions (EXTs) in their data
      rearrangement codes.
      
      Relative runtime of micro benchmarks after this patch on some Cortex
      CPU cores:
      
      regular:     X1      A78      A76      A55
       mc  w8:  0.915x   0.937x   0.900x   0.982x
       mc w16:  0.917x   0.947x   0.911x   0.971x
       mc w32:  0.914x   0.938x   0.873x   0.961x
       mc w64:  0.918x   0.932x   0.882x   0.964x
      93339ce8
    • Arpad Panyik's avatar
      AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters · 109b2427
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+
      SRSHL instruction sequences. In the horizontal case this can be
      rewritten using a single SQSHRUN instruction with an additional
      rounding value (34 for 10-bit and 40 for 12-bit).
      
      Relative runtime of micro benchmarks after this patch on some Cortex
      CPU cores:
      
      regular:     X1      A78      A76      A55
       mc  w2:  0.847x   0.864x   0.822x   0.859x
       mc  w4:  0.889x   0.994x   0.868x   0.917x
       mc  w8:  0.857x   0.911x   0.915x   0.978x
       mc w16:  0.890x   0.982x   0.868x   0.974x
       mc w32:  0.904x   0.991x   0.873x   0.967x
       mc w64:  0.919x   1.003x   0.860x   0.970x
      109b2427
  2. Sep 05, 2024
  3. Sep 04, 2024
  4. Sep 01, 2024
  5. Aug 30, 2024
  6. Aug 29, 2024
  7. Aug 26, 2024
  8. Aug 24, 2024
    • Martin Storsjö's avatar
      aarch64: Fix a label typo · 27491dd9
      Martin Storsjö authored
      Apparently, this case isn't actually ever executed, at least in most
      checkasm runs, but some tools could complain about the relocation
      against 160b, which pointed elsewhere than intended.
      27491dd9
  9. Aug 23, 2024
  10. Aug 22, 2024
    • Arpad Panyik's avatar
      AArch64: SVE MS armasm64 fix of HBD subpel filters · 472b31f8
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      MS armasm64 cannot compile some SVE instructions with immediate
      operands, e.g.:
        sub  z0.h, z0.h, #8192
      
      The proper form is:
        sub  z0.h, z0.h, #32, lsl #8
      
      This patch contains the needed fixes.
      472b31f8
    • Martin Storsjö's avatar
      aarch64: mc16: Optimize the BTI landing pads in put/prep_neon · 3329f8d1
      Martin Storsjö authored
      Don't include the BTI landing pad instruction in the loops.
      
      If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to
      a no-op instruction that indicates that indirect jumps can land
      there. But there's no need for the loops to include that instruction.
      3329f8d1
    • Arpad Panyik's avatar
      AArch64: Add HBD subpel filters using 128-bit SVE2 · 01558f3f
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only
      2D convolutions have 6-tap specialisations of their vertical passes.
      All other convolutions are 4- or 8-tap filters which fit well with
      the 4-element 16-bit SDOT instruction of SVE2.
      
      This patch renames HBD prep/put_neon to prep/put_16bpc_neon and
      exports put_16bpc_neon.
      
      Benchmarks show up-to 17% FPS increase depending on the input video
      and the CPU used.
      
      This patch will increase the .text by around 8 KiB.
      
      Relative performance to the C reference on some Cortex-A/X CPUs:
      
          regular     A715    A720      X3      X4    A510    A520
       w4 hv neon:    3.93x   4.10x   5.21x   5.17x   3.57x   5.27x
       w4 hv sve2:    4.99x   5.14x   6.00x   6.05x   4.33x   3.99x
       w8 hv neon:    1.72x   1.67x   1.98x   2.18x   2.95x   2.94x
       w8 hv sve2:    2.12x   2.29x   2.52x   2.62x   2.60x   2.60x
      w16 hv neon:    1.59x   1.53x   1.83x   1.89x   2.35x   2.24x
      w16 hv sve2:    1.94x   2.12x   2.33x   2.18x   2.06x   2.06x
      w32 hv neon:    1.49x   1.50x   1.66x   1.76x   2.10x   2.16x
      w32 hv sve2:    1.81x   2.09x   2.11x   2.09x   1.84x   1.87x
      w64 hv neon:    1.52x   1.50x   1.55x   1.71x   1.95x   2.05x
      w64 hv sve2:    1.84x   2.08x   1.97x   1.98x   1.74x   1.77x
      
       w4 h neon:     5.35x   5.47x   7.39x   5.78x   3.92x   5.19x
       w4 h sve2:     7.91x   8.35x  11.95x  10.33x   5.81x   5.42x
       w8 h neon:     4.49x   4.43x   6.50x   4.87x   7.18x   6.17x
       w8 h sve2:     6.09x   6.22x   9.59x   7.70x   7.89x   6.83x
      w16 h neon:     2.53x   2.52x   2.34x   1.86x   2.71x   2.75x
      w16 h sve2:     3.41x   3.47x   3.53x   3.25x   2.89x   2.96x
      w32 h neon:     2.07x   2.08x   1.97x   1.56x   2.17x   2.21x
      w32 h sve2:     2.76x   2.84x   2.94x   2.75x   2.24x   2.29x
      w64 h neon:     1.86x   1.86x   1.76x   1.41x   1.87x   1.88x
      w64 h sve2:     2.47x   2.54x   2.65x   2.46x   1.94x   1.94x
      
       w4 v neon:     5.22x   5.17x   6.36x   5.60x   4.23x   7.30x
       w4 v sve2:     5.86x   5.90x   7.81x   7.16x   4.86x   4.15x
       w8 v neon:     4.83x   4.79x   6.96x   6.45x   4.74x   8.40x
       w8 v sve2:     5.25x   5.23x   7.76x   6.79x   4.84x   4.13x
      w16 v neon:     2.59x   2.60x   2.93x   2.47x   1.80x   4.16x
      w16 v sve2:     2.85x   2.88x   3.36x   2.73x   1.86x   2.00x
      w32 v neon:     2.12x   2.13x   2.33x   2.03x   1.34x   3.11x
      w32 v sve2:     2.36x   2.40x   2.73x   2.32x   1.41x   1.48x
      w64 v neon:     1.94x   1.92x   2.02x   1.78x   1.12x   2.59x
      w64 v sve2:     2.16x   2.15x   2.37x   2.03x   1.17x   1.22x
      
       w4 0 neon:     1.75x   1.71x   1.44x   1.56x   3.18x   2.87x
       w4 0 sve2:     4.28x   4.39x   5.72x   6.42x   5.50x   4.68x
       w8 0 neon:     3.05x   3.04x   4.44x   4.64x   3.84x   3.52x
       w8 0 sve2:     3.85x   3.80x   5.45x   6.01x   4.92x   4.26x
      w16 0 neon:     2.92x   2.93x   3.82x   3.23x   4.58x   4.44x
      w16 0 sve2:     4.29x   4.27x   4.25x   4.15x   5.58x   5.29x
      w32 0 neon:     2.73x   2.76x   3.50x   2.67x   4.44x   4.26x
      w32 0 sve2:     4.09x   4.10x   3.75x   3.39x   5.67x   5.22x
      w64 0 neon:     2.73x   2.70x   3.27x   3.14x   4.57x   4.68x
      w64 0 sve2:     4.06x   3.97x   3.54x   3.18x   6.36x   6.25x
      
            sharp     A715    A720      X3      X4    A510    A520
       w4 hv neon:    3.54x   3.64x   4.43x   4.45x   3.03x   4.72x
       w4 hv sve2:    4.30x   4.55x   5.38x   5.26x   4.04x   3.76x
       w8 hv neon:    1.30x   1.25x   1.51x   1.60x   2.44x   2.43x
       w8 hv sve2:    1.86x   2.06x   2.09x   2.18x   2.37x   2.39x
      w16 hv neon:    1.19x   1.16x   1.43x   1.36x   1.95x   1.98x
      w16 hv sve2:    1.68x   1.91x   1.94x   1.84x   1.89x   1.94x
      w32 hv neon:    1.13x   1.12x   1.30x   1.29x   1.75x   1.81x
      w32 hv sve2:    1.58x   1.84x   1.75x   1.74x   1.70x   1.76x
      w64 hv neon:    1.13x   1.13x   1.21x   1.25x   1.65x   1.69x
      w64 hv sve2:    1.57x   1.84x   1.62x   1.67x   1.62x   1.65x
      
       w4 h neon:     5.38x   5.49x   7.46x   5.74x   3.93x   5.23x
       w4 h sve2:     7.86x   8.37x  11.99x  10.38x   5.81x   5.40x
       w8 h neon:     3.46x   3.49x   5.36x   4.64x   6.40x   5.62x
       w8 h sve2:     5.95x   6.23x   9.61x   7.76x   7.86x   6.89x
      w16 h neon:     1.99x   1.97x   2.07x   1.91x   2.43x   2.51x
      w16 h sve2:     3.42x   3.46x   3.75x   3.23x   2.89x   2.98x
      w32 h neon:     1.67x   1.62x   1.66x   1.63x   1.95x   2.01x
      w32 h sve2:     2.86x   2.84x   2.94x   2.72x   2.21x   2.29x
      w64 h neon:     1.45x   1.45x   1.51x   1.48x   1.69x   1.70x
      w64 h sve2:     2.47x   2.54x   2.64x   2.46x   1.93x   1.95x
      
       w4 v neon:     4.07x   4.01x   5.15x   4.74x   3.38x   6.56x
       w4 v sve2:     5.88x   5.86x   7.81x   7.15x   4.85x   4.39x
       w8 v neon:     3.64x   3.59x   5.38x   4.92x   3.59x   7.23x
       w8 v sve2:     5.23x   5.19x   7.77x   6.66x   4.81x   4.13x
      w16 v neon:     1.93x   1.95x   2.25x   1.92x   1.35x   3.46x
      w16 v sve2:     2.85x   2.88x   3.36x   2.71x   1.86x   1.94x
      w32 v neon:     1.57x   1.58x   1.78x   1.60x   1.01x   2.67x
      w32 v sve2:     2.36x   2.39x   2.73x   2.35x   1.41x   1.50x
      w64 v neon:     1.44x   1.42x   1.54x   1.43x   0.85x   2.19x
      w64 v sve2:     2.17x   2.15x   2.37x   2.06x   1.18x   1.25x
      01558f3f
  11. Aug 21, 2024
    • Arpad Panyik's avatar
      AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters · 713c076d
      Arpad Panyik authored
      Add 6-tap variant of standard bit-depth horizontal subpel filters
      using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch
      also extends the HV filter with 6-tap horizontal pass using USMMLA.
      
      Benchmarks show up-to 6-7% FPS increase depending on the input video
      and the CPU used.
      
      This patch will increase the .text by around 1.2 KiB.
      
      Relative runtime of micro benchmarks after this patch on Neoverse
      and Cortex CPU cores:
      
      regular      V2      V1      X3    A720    A715    A520    A510
        w8 hv:  0.860x  0.895x  0.870x  0.896x  0.896x  0.938x  0.936x
       w16 hv:  0.829x  0.886x  0.865x  0.908x  0.906x  0.946x  0.944x
       w32 hv:  0.837x  0.883x  0.862x  0.914x  0.915x  0.953x  0.949x
       w64 hv:  0.840x  0.883x  0.862x  0.914x  0.914x  0.955x  0.952x
      
        w8 h:   0.746x  0.754x  0.747x  0.723x  0.724x  0.874x  0.866x
       w16 h:   0.749x  0.764x  0.745x  0.731x  0.731x  0.858x  0.852x
       w32 h:   0.739x  0.754x  0.738x  0.729x  0.729x  0.839x  0.837x
       w64 h:   0.736x  0.749x  0.733x  0.725x  0.726x  0.847x  0.836x
      713c076d
  12. Aug 12, 2024
  13. Aug 04, 2024
  14. Jun 26, 2024
    • Arpad Panyik's avatar
      AArch64: Move constants of DotProd subpel filters to .rodata · 2355eeb8
      Arpad Panyik authored
      The constants used for the subpel filters were placed in the .text
      section for simplicity and peak performance, but this does not work on
      systems with execute only .text sections (e.g.: OpenBSD).
      
      The performance cost of moving the constants to the .rodata section
      is small and mostly within the measurable noise.
      2355eeb8
  15. Jun 25, 2024
    • Martin Storsjö's avatar
      aarch64: Explicitly use the ldur instruction where relevant in mc_dotprod.S · 7fbcdc6d
      Martin Storsjö authored
      The ldr instruction only can handle offsets that are a multiple
      of the element size; most assemblers implicitly produce the ldur
      instruction when a non-aligned offset is provided.
      
      Older versions of MS armasm64, however, error out on this. Since
      MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7
      and earlier require explicitly writing the instruction as ldur.
      
      Despite this, even older versions still fail to build the mc_dotprod.S
      sources, with errors like this:
      
          src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range
              mov             x10, (((0*15-1)<<7)|(3*15-1))
      
      This happens on MSVC 2022 17.1 and older, while 17.2 and newer
      accept the negative value expression here.
      
      In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure
      script at the moment, as it uses inline assembly to test for external
      assembler features.
      7fbcdc6d
    • Brad Smith's avatar
      Add Arm OpenBSD run-time CPU feature detection support · 431f4fb2
      Brad Smith authored and Martin Storsjö's avatar Martin Storsjö committed
      Add run-time CPU feature detection for DotProd and i8mm on AArch64.
      431f4fb2
    • Henrik Gramner's avatar
  16. Jun 17, 2024
Loading