Skip to content
Snippets Groups Projects
  1. May 25, 2024
  2. May 20, 2024
  3. May 19, 2024
    • Martin Storsjö's avatar
      arm64: msac: Explicitly use the ldur instruction · 9469e184
      Martin Storsjö authored
      The ldr instruction can take an immediate offset which is a multiple
      of the loaded element size. If the ldr instruction is given an
      immediate offset which isn't a multiple of the element size,
      most assemblers implicitly generate a "ldur" instruction instead.
      
      Older versions of MS armasm64.exe don't do this, but instead error
      out with "error A2518: operand 2: Memory offset must be aligned".
      (Current versions don't do this but correctly generate "ldur"
      implicitly.)
      
      Switch this instruction to an explicit "ldur", like we do elsewhere,
      to fix building with these older tools.
      9469e184
  4. May 18, 2024
  5. May 14, 2024
    • Kyle Siefring's avatar
      ARM64: Various optimizations for symbol decode · 7f68f23c
      Kyle Siefring authored
      Changes stem from redesigning the reduction stage of the multisymbol
      decode function.
      * No longer use adapt4 for 5 possible symbol values
      * Specialize reduction for 4/8/16 decode functions
      * Modify control flow
      
      +------------------------+--------------+--------------+---------------+
      |                        |  Neoverse V1 |  Neoverse N1 |   Cortex A72  |
      |                        | (Graviton 3) | (Graviton 2) |  (Graviton 1) |
      +------------------------+-------+------+-------+------+-------+-------+
      |                        |  Old  |  New |  Old  |  New |  Old  |  New  |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_bool_neon       |  13.0 | 12.9 |  14.9 | 14.0 |  39.3 |  29.0 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_bool_adapt_neon |  15.4 | 15.6 |  17.5 | 16.8 |  41.6 |  33.5 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_bool_equi_neon  |  11.3 | 12.0 |  14.0 | 12.2 |  35.0 |  26.3 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_hi_tok_c        |  73.7 | 57.8 |  73.4 | 60.5 | 130.1 | 103.9 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_hi_tok_neon     |  63.3 | 48.2 |  65.2 | 51.2 | 119.0 | 105.3 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_symbol_\        |  28.6 | 22.5 |  28.4 | 23.5 |  67.8 |  55.1 |
      | adapt4_neon            |       |      |       |      |       |       |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_symbol_\        |  29.5 | 26.6 |  29.0 | 28.8 |  76.6 |  74.0 |
      | adapt8_neon            |       |      |       |      |       |       |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_symbol_\        |  31.6 | 31.2 |  33.3 | 33.0 |  77.5 |  68.1 |
      | adapt16_neon           |       |      |       |      |       |       |
      +------------------------+-------+------+-------+------+-------+-------+
      7f68f23c
    • Arpad Panyik's avatar
      AArch64: Optimize prep_neon function · d835c6bf
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Optimize the widening copy part of subpel filters (the prep_neon
      function). In this patch we combine widening shifts with widening
      multiplications in the inner loops to get maximum throughput.
      
      The change will increase .text by 36 bytes.
      
      Relative performance of micro benchmarks (lower is better):
      
      Cortex-A55:
        mct_w4:   0.795x
        mct_w8:   0.913x
        mct_w16:  0.912x
        mct_w32:  0.838x
        mct_w64:  1.025x
        mct_w128: 1.002x
      
      Cortex-A510:
        mct_w4:   0.760x
        mct_w8:   0.636x
        mct_w16:  0.640x
        mct_w32:  0.854x
        mct_w64:  0.864x
        mct_w128: 0.995x
      
      Cortex-A72:
        mct_w4:   0.616x
        mct_w8:   0.854x
        mct_w16:  0.756x
        mct_w32:  1.052x
        mct_w64:  1.044x
        mct_w128: 0.702x
      
      Cortex-A76:
        mct_w4:   0.837x
        mct_w8:   0.797x
        mct_w16:  0.841x
        mct_w32:  0.804x
        mct_w64:  0.948x
        mct_w128: 0.904x
      
      Cortex-A78:
        mct_w16:  0.542x
        mct_w32:  0.725x
        mct_w64:  0.741x
        mct_w128: 0.745x
      
      Cortex-A715:
        mct_w16:  0.561x
        mct_w32:  0.720x
        mct_w64:  0.740x
        mct_w128: 0.748x
      
      Cortex-X1:
        mct_w32:  0.886x
        mct_w64:  0.882x
        mct_w128: 0.917x
      
      Cortex-X3:
        mct_w32:  0.835x
        mct_w64:  0.803x
        mct_w128: 0.808x
      d835c6bf
    • Arpad Panyik's avatar
      AArch64: Optimize jump table calculation of prep_neon · f0e779bc
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Save a complex arithmetic instruction in the jump table address
      calculation of prep_neon function.
      f0e779bc
    • Arpad Panyik's avatar
      AArch64: Optimize BTI landing pads of prep_neon · 1790e132
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Move the BTI landing pads out of the inner loops of prep_neon
      function. Only the width=4 and width=8 cases are affected.
      
      If BTI is enabled, moving the AARCH64_VALID_JUMP_TARGET out of the
      inner loops we get better execution speed on Cortex-A510 relative to
      the original (lower is better):
        w4: 0.969x
        w8: 0.722x
      
      Out-of-order cores are not affected.
      1790e132
    • Henrik Gramner's avatar
  6. May 13, 2024
    • Arpad Panyik's avatar
      AArch64: Optimize put_neon function · 8141546d
      Arpad Panyik authored
      Optimize the copy part of subpel filters (the put_neon function).
      For small block sizes (<16) the usage of general purpose registers
      is usually the best way to do the copy.
      
      Relative performance of micro benchmarks (lower is better):
      
      Cortex-A55:
        w2:   0.991x
        w4:   0.992x
        w8:   0.999x
        w16:  0.875x
        w32:  0.775x
        w64:  0.914x
        w128: 0.998x
      
      Cortex-A510:
        w2:   0.159x
        w4:   0.080x
        w8:   0.583x
        w16:  0.588x
        w32:  0.966x
        w64:  1.111x
        w128: 0.957x
      
      Cortex-A76:
        w2:   0.903x
        w4:   0.683x
        w8:   0.944x
        w16:  0.948x
        w32:  0.919x
        w64:  0.855x
        w128: 0.991x
      
      Cortex-A78:
        w32:  0.867x
        w64:  0.820x
        w128: 1.011x
      
      Cortex-A715:
        w32:  0.834x
        w64:  0.778x
        w128: 1.000x
      
      Cortex-X1:
        w32:  0.809x
        w64:  0.762x
        w128: 1.000x
      
      Cortex-X3:
        w32: 0.733x
        w64: 0.720x
        w128: 0.999x
      8141546d
    • Arpad Panyik's avatar
      AArch64: Optimize jump table calculation of put_neon · 645d1f9f
      Arpad Panyik authored
      Save a complex arithmetic instruction in the jump table address
      calculation of put_neon function.
      645d1f9f
    • Arpad Panyik's avatar
      AArch64: Optimize BTI landing pads of put_neon · 83452c6e
      Arpad Panyik authored
      Move the BTI landing pads out of the inner loops of put_neon
      function, the only exception is the width=16 case where it is already
      outside of the loops.
      
      When BTI is enabled, the relative performance of omitting the
      AARCH64_VALID_JUMP_TARGET from the inner loops on Cortex-A510 (lower
      is better):
        w2:   0.981x
        w4:   0.991x
        w8:   0.612x
        w32:  0.687x
        w64:  0.813x
        w128: 0.892x
      
      Out-of-order CPUs are mostly unaffected.
      83452c6e
    • Henrik Gramner's avatar
    • Henrik Gramner's avatar
      checkasm: Avoid UB in setjmp() invocations · 471549f2
      Henrik Gramner authored
      Both POSIX and the C standard places several environmental limits on
      setjmp() invocations, with essentially anything beyond comparing the
      return value with a constant as a simple branch condition being UB.
      
      We were previously performing a function call using the setjmp()
      return value as an argument, which is technically not allowed
      even though it happened to work correctly in practice.
      
      Some systems may loosen those restrictions and allow for more
      flexible usage, but we shouldn't be relying on that.
      471549f2
  7. May 12, 2024
    • Arpad Panyik's avatar
      AArch64: Optimize the init of DotProd+ 2D subpel filters · a6d57b11
      Arpad Panyik authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Removed some unnecessary vector register copies from the initial
      horizontal filter parts of the HV subpel filters. The performance
      improvements are better for the smaller filter block sizes.
      
      The narrowing shifts were also rewritten at the end of the *filter8*
      because it was only beneficial for the Cortex-A55 among the DotProd
      capable CPU cores. On other out-of-order or newer CPUs the UZP1+SHRN
      instruction combination is better.
      
      Relative performance of micro benchmarks (lower is better):
      
      Cortex-A55:
        mct regular w4:  0.980x
        mct regular w8:  1.007x
        mct regular w16: 1.007x
      
        mct sharp w4:    0.983x
        mct sharp w8:    1.012x
        mct sharp w16:   1.005x
      
      Cortex-A510:
        mct regular w4:  0.935x
        mct regular w8:  0.984x
        mct regular w16: 0.986x
      
        mct sharp w4:    0.927x
        mct sharp w8:    0.983x
        mct sharp w16:   0.987x
      
      Cortex-A78:
        mct regular w4:  0.974x
        mct regular w8:  0.988x
        mct regular w16: 0.991x
      
        mct sharp w4:    0.971x
        mct sharp w8:    0.987x
        mct sharp w16:   0.979x
      
      Cortex-715:
        mct regular w4:  0.958x
        mct regular w8:  0.993x
        mct regular w16: 0.998x
      
        mct sharp w4:    0.974x
        mct sharp w8:    0.991x
        mct sharp w16:   0.997x
      
      Cortex-X1:
        mct regular w4:  0.983x
        mct regular w8:  0.993x
        mct regular w16: 0.996x
      
        mct sharp w4:    0.974x
        mct sharp w8:    0.990x
        mct sharp w16:   0.995x
      
      Cortex-X3:
        mct regular w4:  0.953x
        mct regular w8:  0.993x
        mct regular w16: 0.997x
      
        mct sharp w4:    0.981x
        mct sharp w8:    0.993x
        mct sharp w16:   0.995x
      a6d57b11
  8. May 10, 2024
  9. May 09, 2024
    • Arpad Panyik's avatar
      AArch64: Optimize 2D i8mm subpel filters · 643195f5
      Arpad Panyik authored
      Rewrite the accumulator initializations of the horizontal part of the
      2D filters with zero register fills. It can improve the performance
      on out-of-order CPUs which can fill vector registers by zero with
      zero latency. Zeroed accumulators imply the usage of the rounding
      shifts at the end of filters.
      
      The only exception is the very short *hv_filter4*, where the longer
      latency of rounding shift could decrease the performance.
      
      The *filter8* function uses a different (alternating) dot product
      computation order for DotProd+ feature level, it gives a better
      overall performance for out-of-order and some in-order CPU cores.
      
      The i8mm version does not need to use bias for the loaded samples, so
      a different instruction scheduling is beneficial mostly affecting the
      order of TBL instructions in the 8-tap case.
      
      Relative performance of micro benchmarks (lower is better):
      
      Cortex-X3:
        mct_8tap_regular_w16_hv_8bpc_i8mm:  0.982x
        mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.979x
        mct_8tap_regular_w8_hv_8bpc_i8mm:   0.972x
        mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.969x
        mct_8tap_regular_w4_hv_8bpc_i8mm:   0.942x
        mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.935x
        mc_8tap_regular_w16_hv_8bpc_i8mm:   0.988x
        mc_8tap_sharp_w16_hv_8bpc_i8mm:     0.982x
        mc_8tap_regular_w8_hv_8bpc_i8mm:    0.981x
        mc_8tap_sharp_w8_hv_8bpc_i8mm:      0.975x
        mc_8tap_regular_w4_hv_8bpc_i8mm:    0.998x
        mc_8tap_sharp_w4_hv_8bpc_i8mm:      0.996x
        mc_8tap_regular_w2_hv_8bpc_i8mm:    1.006x
        mc_8tap_sharp_w2_hv_8bpc_i8mm:      0.993x
      
      Cortex-A715:
        mct_8tap_regular_w16_hv_8bpc_i8mm:  0.883x
        mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.931x
        mct_8tap_regular_w8_hv_8bpc_i8mm:   0.882x
        mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.928x
        mct_8tap_regular_w4_hv_8bpc_i8mm:   0.969x
        mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.934x
        mc_8tap_regular_w16_hv_8bpc_i8mm:   0.881x
        mc_8tap_sharp_w16_hv_8bpc_i8mm:     0.925x
        mc_8tap_regular_w8_hv_8bpc_i8mm:    0.879x
        mc_8tap_sharp_w8_hv_8bpc_i8mm:      0.925x
        mc_8tap_regular_w4_hv_8bpc_i8mm:    0.917x
        mc_8tap_sharp_w4_hv_8bpc_i8mm:      0.976x
        mc_8tap_regular_w2_hv_8bpc_i8mm:    0.915x
        mc_8tap_sharp_w2_hv_8bpc_i8mm:      0.972x
      
      Cortex-A510:
        mct_8tap_regular_w16_hv_8bpc_i8mm:  0.994x
        mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.949x
        mct_8tap_regular_w8_hv_8bpc_i8mm:   0.987x
        mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.947x
        mct_8tap_regular_w4_hv_8bpc_i8mm:   1.002x
        mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.999x
        mc_8tap_regular_w16_hv_8bpc_i8mm:   0.989x
        mc_8tap_sharp_w16_hv_8bpc_i8mm:     1.003x
        mc_8tap_regular_w8_hv_8bpc_i8mm:    0.986x
        mc_8tap_sharp_w8_hv_8bpc_i8mm:      1.000x
        mc_8tap_regular_w4_hv_8bpc_i8mm:    1.007x
        mc_8tap_sharp_w4_hv_8bpc_i8mm:      1.000x
        mc_8tap_regular_w2_hv_8bpc_i8mm:    1.005x
        mc_8tap_sharp_w2_hv_8bpc_i8mm:      1.000x
      643195f5
  10. May 08, 2024
    • Arpad Panyik's avatar
      AArch64: Optimize vertical i8mm subpel filters · b2eca1ac
      Arpad Panyik authored
      Replace the accumulator initializations of the vertical subpel
      filters with register fills by zeros (which are usually zero latency
      operations in this feature class), this implies the usage of rounding
      shifts at the end in the prep cases. Out-of-order CPU cores can
      benefit from this change.
      
      The width=16 case uses a simpler register duplication scheme that
      relies on MOV instructions for the subsequent shuffles. This approach
      uses a different register to load the data into for better instruction
      scheduling and data dependency chain.
      
      Relative performance of micro benchmarks (lower is better):
      
      Cortex-X3:
      mct_8tap_sharp_w16_v_8bpc_i8mm:	0.910x
      mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.986x
      
      mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.864x
      mc_8tap_sharp_w8_v_8bpc_i8mm:  	0.882x
      mc_8tap_sharp_w4_v_8bpc_i8mm:  	0.933x
      mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.926x
      
      Cortex-A715:
      mct_8tap_sharp_w16_v_8bpc_i8mm:	0.855x
      mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.784x
      mct_8tap_sharp_w4_v_8bpc_i8mm:  1.069x
      
      mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.850x
      mc_8tap_sharp_w8_v_8bpc_i8mm:  	0.779x
      mc_8tap_sharp_w4_v_8bpc_i8mm:  	0.971x
      mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.975x
      
      Cortex-A510:
      mct_8tap_sharp_w16_v_8bpc_i8mm: 1.001x
      mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.979x
      mct_8tap_sharp_w4_v_8bpc_i8mm: 	0.998x
      
      mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.998x
      mc_8tap_sharp_w8_v_8bpc_i8mm:   1.004x
      mc_8tap_sharp_w4_v_8bpc_i8mm:   1.003x
      mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.996x
      b2eca1ac
    • Arpad Panyik's avatar
      AArch64: Optimize horizontal i8mm prep filters · d1bdf4f1
      Arpad Panyik authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Replace the accumulator initializations of the horizontal prep
      filters with register fills by zeros. Most i8mm capable CPUs can do
      these with zero latency, but we also need to use rounding shifts at
      the end of the filter. We can see better performance with this
      change on out-of-order CPUs.
      
      Relative performance of micro benchmarks (lower is better):
      
      Cortex-X3:
      mct_8tap_sharp_w32_h_8bpc_i8mm:  0.914x
      mct_8tap_sharp_w16_h_8bpc_i8mm:  0.906x
      mct_8tap_sharp_w8_h_8bpc_i8mm:   0.877x
      
      Cortex-A715:
      mct_8tap_sharp_w32_h_8bpc_i8mm:  0.819x
      mct_8tap_sharp_w16_h_8bpc_i8mm:  0.805x
      mct_8tap_sharp_w8_h_8bpc_i8mm:   0.779x
      
      Cortex-A510:
      mct_8tap_sharp_w32_h_8bpc_i8mm:  0.999x
      mct_8tap_sharp_w16_h_8bpc_i8mm:  1.001x
      mct_8tap_sharp_w8_h_8bpc_i8mm:   0.996x
      mct_8tap_sharp_w4_h_8bpc_i8mm:   0.915x
      d1bdf4f1
  11. May 06, 2024
  12. May 01, 2024
  13. Apr 29, 2024
  14. Apr 26, 2024
    • Martin Storsjö's avatar
      tools: Make ARM cpu flags imply relevant lower level flags · 236e1d19
      Martin Storsjö authored
      The --cpumask flag only takes one single flag name, one can't set
      a combination like neon+dotprod.
      
      Therefore, apply the same pattern as for x86, by adding mask values
      that contain all the implied lower level flags.
      
      This is somewhat complicated, as the set of features isn't entirely
      linear - in particular, SVE doesn't imply either dotprod or i8mm,
      and SVE2 only implies dotprod, but not i8mm.
      
      This makes sure that "dav1d --cpumask dotprod" actually uses any
      SIMD at all, as it previously only set the dotprod flag but not
      neon, which essentially opted out from all SIMD.
      236e1d19
    • Arpad Panyik's avatar
      AArch64: Add basic i8mm support for convolutions · 1776c45a
      Arpad Panyik authored
      Add an Armv8.6-A i8mm code path for standard bitdepth convolutions.
      Only horizontal-vertical (HV) convolutions have 6-tap specialisations
      of their vertical passes. All other convolutions are 4- or 8-tap
      filters which fit well with the 4-element USDOT instruction.
      
      Benchmarks show 4-9% FPS increase relative to the Armv8.4-A
      code path depending on the input video and the CPU used.
      
      This patch will increase the .text by around 5.7 KiB.
      
      Relative performance to the C reference on some Cortex CPU cores:
      
                             Cortex-A715   Cortex-X3  Cortex-A510
      regular w4 hv neon:          7.20x      11.20x        4.40x
      regular w4 hv dotprod:      12.77x      18.35x        6.21x
      regular w4 hv i8mm:         14.50x      21.42x        6.16x
      
        sharp w4 hv neon:          6.24x       9.77x        3.96x
        sharp w4 hv dotprod:       9.76x      14.02x        5.20x
        sharp w4 hv i8mm:         10.84x      16.09x        5.42x
      
      regular w8 hv neon:          2.17x       2.46x        3.17x
      regular w8 hv dotprod:       3.04x       3.11x        3.03x
      regular w8 hv i8mm:          3.57x       3.40x        3.27x
      
        sharp w8 hv neon:          1.72x       1.93x        2.75x
        sharp w8 hv dotprod:       2.49x       2.54x        2.62x
        sharp w8 hv i8mm:          2.80x       2.79x        2.70x
      
      regular w16 hv neon:         1.90x       2.17x        2.02x
      regular w16 hv dotprod:      2.59x       2.64x        1.93x
      regular w16 hv i8mm:         3.01x       2.85x        2.05x
      
        sharp w16 hv neon:         1.51x       1.72x        1.74x
        sharp w16 hv dotprod:      2.17x       2.22x        1.70x
        sharp w16 hv i8mm:         2.42x       2.42x        1.72x
      
      regular w32 hv neon:         1.80x       1.96x        1.81x
      regular w32 hv dotprod:      2.43x       2.36x        1.74x
      regular w32 hv i8mm:         2.83x       2.51x        1.83x
      
        sharp w32 hv neon:         1.42x       1.54x        1.56x
        sharp w32 hv dotprod:      2.07x       2.00x        1.55x
        sharp w32 hv i8mm:         2.29x       2.16x        1.55x
      
      regular w64 hv neon:         1.82x       1.89x        1.70x
      regular w64 hv dotprod:      2.43x       2.25x        1.65x
      regular w64 hv i8mm:         2.84x       2.39x        1.73x
      
        sharp w64 hv neon:         1.43x       1.47x        1.49x
        sharp w64 hv dotprod:      2.08x       1.91x        1.49x
        sharp w64 hv i8mm:         2.30x       2.07x        1.48x
      
      regular w128 hv neon:        1.77x       1.84x        1.75x
      regular w128 hv dotprod:     2.37x       2.18x        1.70x
      regular w128 hv i8mm:        2.76x       2.33x        1.78x
      
        sharp w128 hv neon:        1.40x       1.45x        1.42x
        sharp w128 hv dotprod:     2.04x       1.87x        1.43x
        sharp w128 hv i8mm:        2.24x       2.02x        1.42x
      
      regular w8 h neon:           3.16x       3.51x        3.43x
      regular w8 h dotprod:        4.97x       7.43x        4.95x
      regular w8 h i8mm:           7.28x      10.38x        5.69x
      
        sharp w8 h neon:           2.71x       2.77x        3.10x
        sharp w8 h dotprod:        4.92x       7.14x        4.94x
        sharp w8 h i8mm:           7.21x      10.11x        5.70x
      
      regular w16 h neon:          2.79x       2.76x        3.53x
      regular w16 h dotprod:       3.81x       4.77x        3.13x
      regular w16 h i8mm:          5.21x       6.04x        3.56x
      
        sharp w16 h neon:          2.31x       2.38x        3.12x
        sharp w16 h dotprod:       3.80x       4.74x        3.13x
        sharp w16 h i8mm:          5.20x       5.98x        3.56x
      
      regular w64 h neon:          2.49x       2.46x        2.94x
      regular w64 h dotprod:       3.17x       3.60x        2.41x
      regular w64 h i8mm:          4.22x       4.40x        2.72x
      
        sharp w64 h neon:          2.07x       2.06x        2.60x
        sharp w64 h dotprod:       3.16x       3.58x        2.40x
        sharp w64 h i8mm:          4.20x       4.38x        2.71x
      
      regular w8 v neon:           6.11x       8.05x        4.07x
      regular w8 v dotprod:        5.45x       8.15x        4.01x
      regular w8 v i8mm:           7.30x       9.46x        4.19x
      
        sharp w8 v neon:           4.23x       5.46x        3.09x
        sharp w8 v dotprod:        5.43x       7.96x        4.01x
        sharp w8 v i8mm:           7.26x       9.12x        4.19x
      
      regular w16 v neon:          3.44x       4.33x        2.40x
      regular w16 v dotprod:       3.20x       4.53x        2.85x
      regular w16 v i8mm:          4.09x       5.27x        2.87x
      
        sharp w16 v neon:          2.50x       3.14x        1.82x
        sharp w16 v dotprod:       3.20x       4.52x        2.86x
        sharp w16 v i8mm:          4.09x       5.15x        2.86x
      
      regular w64 v neon:          2.74x       3.11x        1.53x
      regular w64 v dotprod:       2.63x       3.30x        1.84x
      regular w64 v i8mm:          3.31x       3.73x        1.84x
      
        sharp w64 v neon:          2.01x       2.29x        1.16x
        sharp w64 v dotprod:       2.61x       3.27x        1.83x
        sharp w64 v i8mm:          3.29x       3.68x        1.84x
      1776c45a
  15. Apr 25, 2024
    • Arpad Panyik's avatar
      AArch64: Simplify DotProd path of 2D subpel filters · fbf23637
      Arpad Panyik authored
      Simplify the DotProd code path of the 2D (horizontal-vertical) subpel
      filters. It contains some instruction reordering and some macro
      simplifications to be more similar to the upcoming i8mm version.
      
      These changes have negligible effect on performance.
      
      Cortex-A510:
      mc_8tap_regular_w2_hv_8bpc_dotprod:   8.3769 ->  8.3380
      mc_8tap_sharp_w2_hv_8bpc_dotprod:     9.5441 ->  9.5457
      mc_8tap_regular_w4_hv_8bpc_dotprod:   8.3422 ->  8.3444
      mc_8tap_sharp_w4_hv_8bpc_dotprod:     9.5441 ->  9.5367
      mc_8tap_regular_w8_hv_8bpc_dotprod:   9.9852 ->  9.9666
      mc_8tap_sharp_w8_hv_8bpc_dotprod:    12.5554 -> 12.5314
      
      Cortex-A55:
      mc_8tap_regular_w2_hv_8bpc_dotprod:  6.4504  ->  6.4892
      mc_8tap_sharp_w2_hv_8bpc_dotprod:    7.5732  ->  7.6078
      mc_8tap_regular_w4_hv_8bpc_dotprod:  6.5088  ->  6.4760
      mc_8tap_sharp_w4_hv_8bpc_dotprod:    7.5796  ->  7.5763
      mc_8tap_regular_w8_hv_8bpc_dotprod:  9.3384  ->  9.3078
      mc_8tap_sharp_w8_hv_8bpc_dotprod:   11.1159  -> 11.1401
      
      Cortex-A78:
      mc_8tap_regular_w2_hv_8bpc_dotprod:  1.4122  ->  1.4250
      mc_8tap_sharp_w2_hv_8bpc_dotprod:    1.7696  ->  1.7821
      mc_8tap_regular_w4_hv_8bpc_dotprod:  1.4243  ->  1.4243
      mc_8tap_sharp_w4_hv_8bpc_dotprod:    1.7866  ->  1.7863
      mc_8tap_regular_w8_hv_8bpc_dotprod:  2.5304  ->  2.5171
      mc_8tap_sharp_w8_hv_8bpc_dotprod:    3.0815  ->  3.0632
      
      Cortex-X1:
      mc_8tap_regular_w2_hv_8bpc_dotprod:  0.8195  ->  0.8194
      mc_8tap_sharp_w2_hv_8bpc_dotprod:    1.0092  ->  1.0081
      mc_8tap_regular_w4_hv_8bpc_dotprod:  0.8197  ->  0.8166
      mc_8tap_sharp_w4_hv_8bpc_dotprod:    1.0089  ->  1.0068
      mc_8tap_regular_w8_hv_8bpc_dotprod:  1.5230  ->  1.5166
      mc_8tap_sharp_w8_hv_8bpc_dotprod:    1.8683  ->  1.8625
      fbf23637
    • Arpad Panyik's avatar
      AArch64: Simplify loads in *hv_filter* of DotProd path · a40301b3
      Arpad Panyik authored
      Simplify the load sequences in *hv_filter* functions (ldr + add -> ld1)
      to be more uniform and smaller. Performance is not affected.
      a40301b3
    • Arpad Panyik's avatar
      AArch64: Simplify TBL usage in 2D DotProd filters · b0685c38
      Arpad Panyik authored
      Simplify the TBL usages in small block size (2, 4) parts of the 2D
      (horizontal-vertical) put subpel filters. The 2-register TBLs are
      replaced with the 1-register form because we only need the lower
      64-bits of the result and it can be extracted from only one source
      register. Performance is not affected by this change.
      b0685c38
    • Arpad Panyik's avatar
      AArch64: Simplify DotProd path of horizontal subpel filters · ad7938d5
      Arpad Panyik authored
      Simplify the inner loops of the DotProd code path of horizontal
      subpel filters to avoid using 2-register TBL instructions. The
      store part of block size 16 of the horizontal put case is also
      simplified (str + add -> st1). This patch can improve performance
      mostly on small cores like Cortex-A510 and newer. Other CPUs are
      mostly unaffected.
      
      Cortex-A510:
      mct_8tap_sharp_w16_h_8bpc_dotprod:  2.77x -> 3.13x
      mct_8tap_sharp_w32_h_8bpc_dotprod:  2.32x -> 2.56x
      
      Cortex-A55:
      mct_8tap_sharp_w16_h_8bpc_dotprod:  3.89x -> 3.89x
      mct_8tap_sharp_w32_h_8bpc_dotprod:  3.35x -> 3.35x
      
      Cortex-A715:
      mct_8tap_sharp_w16_h_8bpc_dotprod:  3.79x -> 3.78x
      mct_8tap_sharp_w32_h_8bpc_dotprod:  3.30x -> 3.30x
      
      Cortex-A78:
      mct_8tap_sharp_w16_h_8bpc_dotprod:  4.30x -> 4.31x
      mct_8tap_sharp_w32_h_8bpc_dotprod:  3.79x -> 3.80x
      
      Cortex-X3:
      mct_8tap_sharp_w16_h_8bpc_dotprod:  4.74x -> 4.75x
      mct_8tap_sharp_w32_h_8bpc_dotprod:  3.89x -> 3.91x
      
      Cortex-X1:
      mct_8tap_sharp_w16_h_8bpc_dotprod:  4.61x -> 4.62x
      mct_8tap_sharp_w32_h_8bpc_dotprod:  3.67x -> 3.66x
      ad7938d5
    • Arpad Panyik's avatar
      AArch64: Simplify DotProd path of vertical subpel filters · 317a94c6
      Arpad Panyik authored
      Simplify the accumulator initializations of the DotProd code path of
      vertical subpel filters. This also makes it possible for some CPUs to
      use zero latency vector register moves. The load is also simplified
      (ldr + add -> ld1) in the inner loop of vertical filter for block
      size 16.
      317a94c6
    • Arpad Panyik's avatar
      AArch64: Add \dot parameter to filter_8tap_fn macro · 7eee4a20
      Arpad Panyik authored
      Add \dot parameter to filter_8tap_fn macro in preparation to extend
      it with i8mm code path. This patch also contains string fixes and
      some instruction reorderings along with some register renaming to
      make it more uniform. These changes don't affect performance but
      simplifies the code a bit.
      7eee4a20
  16. Apr 22, 2024
    • Martin Storsjö's avatar
      aarch64: Avoid unaligned jump tables · cb8151c9
      Martin Storsjö authored
      Manually add a padding 0 entry to make the odd number of .hword
      entries align with the instruction size.
      
      This fixes assembling with GAS, with the --gdwarf2 option, where
      it previously produced the error message "unaligned opcodes detected
      in executable segment".
      
      The message is slightly misleading, as the error is printed even
      if there actually are no opcodes that are misaligned, as the jump
      table is the last thing within the .text section. The issue can
      be reproduced with an input as small as this, assembled with
      "as --gdwarf2 -c test.s".
      
              .text
              nop
              .hword 0
      
      See a6228f47 for earlier cases of
      the same error - although in those cases, we actually did have more
      code and labels following the unaligned jump tables.
      
      This error is present with binutils 2.39 and earlier; in
      binutils 2.40, this input no longer is considered an error, fixed
      in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=6f6f5b0adc9efd103c434fd316e8c880a259775d.
      cb8151c9
  17. Apr 21, 2024
  18. Apr 16, 2024
  19. Apr 15, 2024
    • Kyle Siefring's avatar
      ARM64: Port msac improvements to more functions · 37d52435
      Kyle Siefring authored and Henrik Gramner's avatar Henrik Gramner committed
      Port improvements from the hi token functions to the rest of the symbol
      adaption functions. These weren't originally ported since they didn't
      work with arbitrary padding. In practice, zero padding is already used
      and only the tests need to be updated.
      
      Results - Neoverse N1
      
      Old:
      msac_decode_symbol_adapt4_c:         41.4 ( 1.00x)
      msac_decode_symbol_adapt4_neon:      31.0 ( 1.34x)
      msac_decode_symbol_adapt8_c:         54.5 ( 1.00x)
      msac_decode_symbol_adapt8_neon:      32.2 ( 1.69x)
      msac_decode_symbol_adapt16_c:        85.6 ( 1.00x)
      msac_decode_symbol_adapt16_neon:     37.5 ( 2.28x)
      
      New:
      msac_decode_symbol_adapt4_c:         41.5 ( 1.00x)
      msac_decode_symbol_adapt4_neon:      27.7 ( 1.50x)
      msac_decode_symbol_adapt8_c:         55.7 ( 1.00x)
      msac_decode_symbol_adapt8_neon:      30.1 ( 1.85x)
      msac_decode_symbol_adapt16_c:        82.4 ( 1.00x)
      msac_decode_symbol_adapt16_neon:     35.2 ( 2.34x)
      37d52435
    • Henrik Gramner's avatar
      x86: Add 6-tap variants of 8bpc mc AVX-512 (Ice Lake) functions · 5b539991
      Henrik Gramner authored
      6-tap filtering is only performed vertically due to use of VNNI
      instructions processing 4 pixels per instruction horizontally.
      5b539991
Loading