Skip to content
  1. Dec 20, 2024
    • Martin Storsjö's avatar
      arm32: looprestoration: Rewrite the wiener functions · 2ba57aa5
      Martin Storsjö authored
      Switch to the same cache-friendly algorithm as was done for arm64
      in 2e73051c and for the reference
      C code in 8291a66e.
      
      Contrary to the arm64 implementation, this uses a main loop in C
      (very similar to the one in the main C implementation in
      8291a66e) rather than assembly;
      this gives a bit more overhead on the call to each function, but
      it shouldn't affect the big picture much.
      
      Performane wise, this doesn't make much of a difference - it makes
      things a little bit faster on some cores, and a little bit slower
      on others:
      
      Before:                 Cortex A7        A8       A53       A72       A73
      wiener_7tap_8bpc_neon:   269384.4  147730.7  140028.5   92662.5   92929.0
      wiener_7tap_10bpc_neon:  352690.2  159970.2  169427.8  116614.9  119371.1
      After:
      wiener_7tap_8bpc_neon:   238328.0  157274.1  134588.6   92200.3   97619.6
      wiener_7tap_10bpc_neon:  336369.3  162182.0  161954.4  125521.2  130634.0
      
      This is mostly in line with the results on arm64 in
      2e73051c. On arm64, there was a
      bit larger speedup for the 7tap case, mostly attributed to
      unrolling the vertical filter (and the new filter_hv function) to
      operate on 16 pixels at a time. On arm32, there's not enough
      registers to do that, so we can't get such gains from unrolling.
      (Reducing the unrolling on the arm64 version to match the case
      on arm32 also shows similar performance numbers as on arm32 here.)
      
      In the arm64 version, we also added separate 5tap versions of all
      functions; not doing that for arm32 at this point.
      
      This increases the binary size by 2 KB.
      
      This doesn't have any immediate effect on how much stack space
      dav1d requires in total, since the largest stack users on arm
      currently are the 8tap_scaled functions.
      2ba57aa5
  2. Dec 19, 2024
    • Martin Storsjö's avatar
      looprestoration: Use only 6 row buffer for wiener, like NEON/x86 · 8291a66e
      Martin Storsjö authored
      This uses a separate function for combined horizontal and vertical
      filtering, without needing to write the intermediate results
      back to memory inbetween.
      
      This mostly serves as an example for how to adjust the logic for
      that case; unless we actually merge the horizontal and vertical
      filtering within the _hv function, we still need space for a
      7th row on the stack within that function (which means we use just
      as much stack as before), but we also need one extra memcpy to
      write it into the right destination.
      
      In a build where the compiler is allowed to vectorize and inline
      the wiener functions into each other, this change actually reduces
      the final binary size by 4 KB, if the C version of the wiener filter
      is retained.
      
      This change makes the vectorized C code as fast as it was before
      with Clang 18; on Xcode Clang 16, it's 2x slower than it was before.
      
      Unfortunately, with GCC, this change makes the code a bit slower
      again.
      8291a66e
    • Martin Storsjö's avatar
      looprestoration: Make the C wiener h filter more optimizable for the compiler · a149f5c3
      Martin Storsjö authored
      This increases the binary size by 9 KB, on aarch64 with Xcode Clang 16,
      if the C version of the filter is retained (which it isn't
      by default).
      
      This makes the vectorized C code roughly as fast as it was before
      the rewrite on GCC; on Clang it also becomes 1.3x-2.0x faster,
      while still being slower than it was initially.
      a149f5c3
    • Martin Storsjö's avatar
      looprestoration: Rewrite the C version of the wiener filter · 9da303e9
      Martin Storsjö authored
      This reduces the stack usage of these functions (the C version)
      significantly.
      
      These C versions aren't used on architectures that already have
      wiener filters implemented in assembly, but they matter both if
      running e.g. with assembly disabled (e.g. for sanitizer builds),
      and matter as example for how to do a cache efficient SIMD
      implementation.
      
      This roughly matches how these functions are implemented in the
      aarch64 assembly (although that assembly function uses a mainloop
      function written in assembly, and custom calling conventions
      between the functions).
      
      With this in place, dav1d can run with around 76 KB of stack
      with assembly disabled.
      
      This increases the binary size by around 14 KB (in the case of
      aarch64 with Xcode Clang 16), unless built with (the default)
      -Dtrim_dsp=true. (By default, the C version of the wiener filter
      gets skipped entirely.)
      
      On 32 bit arm, the assembly wiener function implementation still
      uses large buffers on the stack though, but due to other functions
      using less stack there, dav1d can still run with 72 KB of stack
      there.
      
      Unfortunately, this change also makes the functions slower, depending
      on how well the compiler was able to optimize the previous version.
      On GCC (which didn't manage to vectorize the functions so well before),
      it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang
      (where it was very well vectorized before).
      
      Most of this performance can be gained back with later changes on
      top, though.
      9da303e9
  3. Dec 02, 2024
  4. Nov 28, 2024
    • Victorien Le Couviour--Tuffet's avatar
      flush: Reset f->task_thread.error · 575af258
      Victorien Le Couviour--Tuffet authored
      f->task_thread.error can be set during flushing, not resetting this can
      lead to c->task_thread.first being increased after having already submitted
      a frame post flushing. That's fine if it happens on the very first frame,
      but if that's the case on any subsequent frame it will incur a wrong frame
      ordering.
      Now that a non-first frame will be considered as such, its tasks won't be
      able to execute (since they depend on a truly previous frame considered as
      being after) and c->task_thread.cur will be increased past that frame, with
      no way of it being reset, eventually leading to a hang.
      575af258
  5. Nov 26, 2024
  6. Nov 21, 2024
  7. Nov 19, 2024
    • Martin Storsjö's avatar
      arm32: looprestoration: Rewrite the SGR functions · 30c3dd8e
      Martin Storsjö authored
      Switch to the same cache-friendly algorithm as was done for arm64
      in c121b831.
      
      This uses much less stack memory, and is much more cache friendly.
      In this form, most of the individual asm functions only operate on
      one single row of data at a time.
      
      Some of the functions used to be unrolled to operate on two rows
      at a time, while they now only operate on one at a time. In practice,
      this is still a large performance win, as data is accessed in a
      much more cache friendly manner.
      
      This gives a 2-37% speedup, and reduces the peak amount of stack
      used for these functions from 255 KB to 33 KB.
      
      Before:              Cortex A7         A8        A53        A72        A73
      sgr_3x3_8bpc_neon:    873990.7   748341.9   543410.2   383200.4   357502.9
      sgr_3x3_10bpc_neon:   909728.0   732594.5   560123.6   392765.5   359377.7
      sgr_5x5_8bpc_neon:    591597.9   527353.1   350347.4   263464.9   243098.8
      sgr_5x5_10bpc_neon:   637958.2   529462.8   364613.3   280664.6   255164.6
      sgr_mix_8bpc_neon:   1458977.4  1185423.2   884017.7   632922.5   587395.2
      sgr_mix_10bpc_neon:  1532376.5  1259111.4   918729.3   658787.6   600317.0
      After:
      sgr_3x3_8bpc_neon:    836138.7   635556.5   530596.1   335794.6   348209.9
      sgr_3x3_10bpc_neon:   850835.4   596445.0   534583.2   342713.4   349713.5
      sgr_5x5_8bpc_neon:    577039.7   443916.5   341684.8   223374.0   232841.3
      sgr_5x5_10bpc_neon:   600975.7   400041.3   347529.8   234759.9   239351.7
      sgr_mix_8bpc_neon:   1297988.7   925739.1   830360.7   545476.1   548706.6
      sgr_mix_10bpc_neon:  1340112.6   914395.7   873342.4   574815.7   554681.6
      
      With this change in place, dav1d can run with around 72 KB of stack
      on arm targets.
      
      Not all functions have been merged in the same way as they were
      for arm64 in c121b831, so some
      minor differences remain; it's possible to incrementally optimize
      this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse
      finish_filter_row1/2 with sgr_weighted_row1, and make a version of
      finish_filter_row1 that produces 2 rows, like is done for arm64.
      
      It's also possible to rewrite the logic for calculating sgr_x_by_x
      in the same way as was done for arm64 in
      79db1624.
      30c3dd8e
  8. Nov 18, 2024
    • Martin Storsjö's avatar
      arm32: looprestoration: Apply simplifications to align with C code · 1b7f1263
      Martin Storsjö authored
      This applies the same simplifications that were done for the C
      code and the x86 assembly in 4613d3a5,
      and the arm64 assembly in ce80e6da,
      to the arm32 implementation.
      
      This gives a minor speedup of around a couple percent.
      
      Before:             Cortex A7         A8        A53        A72        A73
      sgr_3x3_8bpc_neon:   926600.0   753468.3   553704.1   399379.1   369674.4
      sgr_5x5_8bpc_neon:   621722.9   540412.7   357275.9   274474.3   254996.0
      sgr_mix_8bpc_neon:  1529715.1  1171282.5   894982.9   659996.6   610407.2
      After:
      sgr_3x3_8bpc_neon:   899020.3   697278.6   541569.9   382824.3   353891.8
      sgr_5x5_8bpc_neon:   602183.2   498322.9   348974.5   264833.9   243837.7
      sgr_mix_8bpc_neon:  1497870.8  1182121.3   880470.9   635939.3   590909.3
      1b7f1263
    • Martin Storsjö's avatar
      c43debf1
    • Martin Storsjö's avatar
      arm: looprestoration: Fix the single line loop in sgr_weighted2 · 1c7433a5
      Martin Storsjö authored
      After processing one block, this accidentally jumped to the loop
      for processing two lines at once.
      
      The same bug was replicated in both 32 and 64 bit versions.
      1c7433a5
    • Martin Storsjö's avatar
      looprestoration: Rewrite the C version of the SGR filter · f32b3146
      Martin Storsjö authored
      This reduces the stack usage of these functions (the C version)
      significantly, and gives them a 15-40% speedup (on an Apple M3,
      with Xcode Clang 16).
      
      The C versions of this function does matter; even though we have
      assembly implementations of it on x86 and aarch64, those only
      covert the 8 and 10 bpc cases, while the C version is used as
      fallback for 12 bpc.
      
      This matches how these functions are implemented in the aarch64
      assembly; operate over a window of 3 or 5 lines (of 384 pixels
      each), instead of doing a full 384 x 64 block.
      
      The individual functions for filtering a line each end up
      much simpler, and closer to how this can be implemented in
      assembly - but the overall business logic ends up much much
      more complex.
      
      The main difference to the aarch64 assembly implementation,
      is that any buffer which is of int16_t size in the aarch64
      assembly implementation, uses the type "coef" here, which
      is 32 bit in the 10/12 bpc cases. (This is required for handling
      the 12 bpc cases.)
      
      With this in place, dav1d can run with around 66 KB of stack
      on x86_64 with assembly enabled, with around 74 KB of stack on
      aarch64 with assembly enabled, and with 118 KB of stack with
      assembly disabled.
      
      This increases the binary size by around 14 KB (in the case of
      aarch64 with Xcode Clang 16).
      
      On 32 bit arm, dav1d still requires around 270 KB of stack, as
      that assembly implementation of the SGR filter uses a different
      algorithm.
      f32b3146
    • Martin Storsjö's avatar
      arm: looprestoration: Give symbols and defines unique names · 01d417c2
      Martin Storsjö authored
      As the machine specific init file is included in the common
      template, give symbols and defines unique names that won't
      clash with similar ones in the main template.
      01d417c2
    • Martin Storsjö's avatar
      847eece1
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
  9. Nov 16, 2024
  10. Nov 15, 2024
  11. Nov 14, 2024
    • Martin Storsjö's avatar
      arm: Use /proc/cpuinfo on linux if getauxval is unavailable · bed3a343
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      On really old libc versions, getauxval isn't available. Fall back
      on /proc/cpuinfo in those cases, just like we do on android too.
      bed3a343
    • Martin Storsjö's avatar
      ci: Raise the timeout multipliers for jobs that run in QEMU · 718b62c8
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      For individual tests in dav1d-test-data, the default timeout
      is 30 seconds (which is the Meson default if nothing is
      specified). Previously it ran with a multiplier of 4, resulting
      in a total timeout of 120 seconds.
      
      When running tests in QEMU, exceeding this 120 second timeout
      could happen occasionally. Raise the multiplier to 10, allowing
      each individual job to run for up to 5 minutes.
      
      This should hopefully reduce the amount of stray failures in the
      CI.
      
      For tests that already have a higher default timeout set, such
      as checkasm which has got a 180 second default timeout, this results
      in a much longer timeout period. However as long as we don't
      frequently see issues where these actually hang, it should be
      beneficial to just let them run to completion, rather than
      aborting early due to a tight timeout.
      718b62c8
    • Martin Storsjö's avatar
      arm64: looprestoration: Remove an unnecessary duplicate parameter in dav1d_sgr_weighted2_Xbpc_neon · 1648c232
      Martin Storsjö authored
      Also fix one case where the 32 bit input parameter w (which was in
      x6, now in x4) was used without zero extension, by referencing to
      it as w4 instead.
      1648c232
  12. Nov 13, 2024
    • Martin Storsjö's avatar
      arm64: looprestoration: Apply simplifications to align with C code · ce80e6da
      Martin Storsjö authored
      This applies the same simplifications that were done for the C
      code and the x86 assembly in 4613d3a5,
      to the arm64 implementation.
      
      This gives a minor speedup of around a couple percent.
      
      Before:            Cortex A53        A55        A72        A73       A76  Apple
      M3
      sgr_3x3_8bpc_neon:   368583.2   363654.2   279958.1   272065.1  169353.3  354.6
      sgr_5x5_8bpc_neon:   258570.7   255018.5   200410.6   199478.3  117968.3  260.9
      sgr_mix_8bpc_neon:   603698.1   577383.3   482468.3   436540.4  256632.9  541.8
      After:
      sgr_3x3_8bpc_neon:   367873.2   357884.1   275462.4   268363.9  165909.8  346.0
      sgr_5x5_8bpc_neon:   254988.4   248184.2   190875.1   196939.1  120517.2  252.1
      sgr_mix_8bpc_neon:   589204.7   563565.8   414025.6   427702.2  251651.2  533.4
      ce80e6da
    • Martin Storsjö's avatar
      8bd31a92
  13. Nov 10, 2024
  14. Nov 05, 2024
    • Brad Smith's avatar
      93f12c11
    • Nathan E. Egge's avatar
      riscv64/mc: Only process w*3/4 elements in blend_v · a17c8625
      Nathan E. Egge authored
      Setting VL for this function only impacts the 16bpc performance and only
       on the SpacemiT K1 which has two vector units of length 128b each.
      
      Kendryte K230                Before             After         Delta
      
      blend_v_w2_8bpc_c:        220.0 ( 1.00x)    221.3 ( 1.00x)    0.59%
      blend_v_w2_8bpc_rvv:      145.7 ( 1.51x)    148.2 ( 1.49x)    1.72%
      blend_v_w4_8bpc_c:        942.1 ( 1.00x)    943.7 ( 1.00x)    0.17%
      blend_v_w4_8bpc_rvv:      240.4 ( 3.92x)    242.9 ( 3.89x)    1.04%
      blend_v_w8_8bpc_c:       1782.3 ( 1.00x)   1783.8 ( 1.00x)    0.08%
      blend_v_w8_8bpc_rvv:      252.6 ( 7.06x)    254.9 ( 7.00x)    0.91%
      blend_v_w16_8bpc_c:      3650.9 ( 1.00x)   3647.0 ( 1.00x)   -0.11%
      blend_v_w16_8bpc_rvv:     495.5 ( 7.37x)    494.4 ( 7.38x)   -0.22%
      blend_v_w32_8bpc_c:      7013.0 ( 1.00x)   7018.2 ( 1.00x)    0.07%
      blend_v_w32_8bpc_rvv:     807.9 ( 8.68x)    802.0 ( 8.75x)   -0.73%
      
      blend_v_w2_16bpc_c:       226.1 ( 1.00x)    225.5 ( 1.00x)   -0.27%
      blend_v_w2_16bpc_rvv:     148.6 ( 1.52x)    148.9 ( 1.51x)    0.20%
      blend_v_w4_16bpc_c:      1010.7 ( 1.00x)   1006.7 ( 1.00x)   -0.40%
      blend_v_w4_16bpc_rvv:     306.7 ( 3.30x)    307.4 ( 3.27x)    0.23%
      blend_v_w8_16bpc_c:      1990.2 ( 1.00x)   1996.1 ( 1.00x)    0.30%
      blend_v_w8_16bpc_rvv:     519.5 ( 3.83x)    523.4 ( 3.81x)    0.75%
      blend_v_w16_16bpc_c:     3744.5 ( 1.00x)   3742.4 ( 1.00x)   -0.06%
      blend_v_w16_16bpc_rvv:    899.6 ( 4.16x)    906.4 ( 4.13x)    0.76%
      blend_v_w32_16bpc_c:     7047.5 ( 1.00x)   7079.3 ( 1.00x)    0.45%
      blend_v_w32_16bpc_rvv:   1475.5 ( 4.78x)   1483.3 ( 4.77x)    0.53%
      
      SpacemiT K1                  Before             After         Delta
      
      blend_v_w2_8bpc_c:        216.3 ( 1.00x)    214.4 ( 1.00x)   -0.88%
      blend_v_w2_8bpc_rvv:      144.0 ( 1.50x)    143.6 ( 1.49x)   -0.28%
      blend_v_w4_8bpc_c:        919.8 ( 1.00x)    918.1 ( 1.00x)   -0.18%
      blend_v_w4_8bpc_rvv:      236.6 ( 3.89x)    236.4 ( 3.88x)   -0.08%
      blend_v_w8_8bpc_c:       1739.3 ( 1.00x)   1736.8 ( 1.00x)   -0.14%
      blend_v_w8_8bpc_rvv:      236.8 ( 7.34x)    236.3 ( 7.35x)   -0.21%
      blend_v_w16_8bpc_c:      3374.7 ( 1.00x)   3374.9 ( 1.00x)    0.01%
      blend_v_w16_8bpc_rvv:     297.0 (11.36x)    296.8 (11.37x)   -0.07%
      blend_v_w32_8bpc_c:      6647.5 ( 1.00x)   6645.5 ( 1.00x)   -0.03%
      blend_v_w32_8bpc_rvv:     403.3 (16.48x)    402.4 (16.51x)   -0.22%
      
      blend_v_w2_16bpc_c:       221.4 ( 1.00x)    220.1 ( 1.00x)   -0.59%
      blend_v_w2_16bpc_rvv:     146.3 ( 1.51x)    147.3 ( 1.49x)    0.68%
      blend_v_w4_16bpc_c:       973.3 ( 1.00x)    972.7 ( 1.00x)   -0.06%
      blend_v_w4_16bpc_rvv:     280.3 ( 3.47x)    282.1 ( 3.45x)    0.64%
      blend_v_w8_16bpc_c:      1814.8 ( 1.00x)   1816.2 ( 1.00x)    0.08%
      blend_v_w8_16bpc_rvv:     376.6 ( 4.82x)    376.9 ( 4.82x)    0.08%
      blend_v_w16_16bpc_c:     3485.5 ( 1.00x)   3485.5 ( 1.00x)    0.00%
      blend_v_w16_16bpc_rvv:    531.1 ( 6.56x)    525.6 ( 6.63x)   -1.04%
      blend_v_w32_16bpc_c:     6788.3 ( 1.00x)   6778.8 ( 1.00x)   -0.14%
      blend_v_w32_16bpc_rvv:    904.5 ( 7.51x)    854.6 ( 7.93x)   -5.52%
      a17c8625
  15. Nov 04, 2024
    • Nathan E. Egge's avatar
      riscv64/mc16: Unroll 16bpc RVV blend_v 2x · 907dd871
      Nathan E. Egge authored
      Kendryte K230                Before             After         Delta
      
      blend_v_w2_16bpc_c:       225.8 ( 1.00x)    225.7 ( 1.00x)   -0.04%
      blend_v_w2_16bpc_rvv:     194.7 ( 1.16x)    148.6 ( 1.52x)  -23.68%
      blend_v_w4_16bpc_c:      1011.3 ( 1.00x)   1005.8 ( 1.00x)   -0.54%
      blend_v_w4_16bpc_rvv:     387.2 ( 2.61x)    305.4 ( 3.29x)  -21.13%
      blend_v_w8_16bpc_c:      1878.5 ( 1.00x)   1872.7 ( 1.00x)   -0.31%
      blend_v_w8_16bpc_rvv:     475.3 ( 3.95x)    435.6 ( 4.30x)   -8.35%
      blend_v_w16_16bpc_c:     3601.9 ( 1.00x)   3601.6 ( 1.00x)   -0.01%
      blend_v_w16_16bpc_rvv:    891.2 ( 4.04x)    892.7 ( 4.03x)    0.17%
      blend_v_w32_16bpc_c:     7043.7 ( 1.00x)   7058.8 ( 1.00x)    0.21%
      blend_v_w32_16bpc_rvv:   1384.5 ( 5.09x)   1478.0 ( 4.78x)    6.75%
      
      SpacemiT K1                  Before             After         Delta
      
      blend_v_w2_16bpc_c:       222.6 ( 1.00x)    220.5 ( 1.00x)   -0.94%
      blend_v_w2_16bpc_rvv:     195.7 ( 1.14x)    146.6 ( 1.50x)  -25.09%
      blend_v_w4_16bpc_c:       972.3 ( 1.00x)    972.0 ( 1.00x)   -0.03%
      blend_v_w4_16bpc_rvv:     349.1 ( 2.79x)    281.9 ( 3.45x)  -19.25%
      blend_v_w8_16bpc_c:      1812.1 ( 1.00x)   1813.0 ( 1.00x)    0.05%
      blend_v_w8_16bpc_rvv:     481.5 ( 3.76x)    376.0 ( 4.82x)  -21.91%
      blend_v_w16_16bpc_c:     3488.4 ( 1.00x)   3484.6 ( 1.00x)   -0.11%
      blend_v_w16_16bpc_rvv:    608.7 ( 5.73x)    523.4 ( 6.66x)  -14.01%
      blend_v_w32_16bpc_c:     6795.3 ( 1.00x)   6792.4 ( 1.00x)   -0.04%
      blend_v_w32_16bpc_rvv:    934.8 ( 7.27x)    907.3 ( 7.49x)   -2.94%
      907dd871
    • Nathan E. Egge's avatar
      riscv64/mc16: Branchless vsetvl in blend_v function · 9710e7de
      Nathan E. Egge authored
      Kendryte K230                Before             After         Delta
      
      blend_v_w2_16bpc_c:       226.0 ( 1.00x)    226.1 ( 1.00x)    0.04%
      blend_v_w2_16bpc_rvv:     194.0 ( 1.16x)    193.9 ( 1.17x)   -0.05%
      blend_v_w4_16bpc_c:      1011.8 ( 1.00x)   1009.4 ( 1.00x)   -0.24%
      blend_v_w4_16bpc_rvv:     392.7 ( 2.58x)    390.8 ( 2.58x)   -0.48%
      blend_v_w8_16bpc_c:      1987.9 ( 1.00x)   1988.0 ( 1.00x)    0.01%
      blend_v_w8_16bpc_rvv:     561.5 ( 3.54x)    560.2 ( 3.55x)   -0.23%
      blend_v_w16_16bpc_c:     3738.1 ( 1.00x)   3739.1 ( 1.00x)    0.03%
      blend_v_w16_16bpc_rvv:    934.1 ( 4.00x)    932.2 ( 4.01x)   -0.20%
      blend_v_w32_16bpc_c:     7031.0 ( 1.00x)   7030.1 ( 1.00x)   -0.01%
      blend_v_w32_16bpc_rvv:   1403.3 ( 5.01x)   1395.8 ( 5.04x)   -0.53%
      
      SpacemiT K1                  Before             After         Delta
      
      blend_v_w2_16bpc_c:       221.0 ( 1.00x)    221.2 ( 1.00x)    0.09%
      blend_v_w2_16bpc_rvv:     195.2 ( 1.13x)    196.0 ( 1.13x)    0.41%
      blend_v_w4_16bpc_c:       969.8 ( 1.00x)    971.9 ( 1.00x)    0.22%
      blend_v_w4_16bpc_rvv:     348.8 ( 2.78x)    349.1 ( 2.78x)    0.09%
      blend_v_w8_16bpc_c:      1812.6 ( 1.00x)   1814.9 ( 1.00x)    0.13%
      blend_v_w8_16bpc_rvv:     486.1 ( 3.73x)    484.3 ( 3.75x)   -0.37%
      blend_v_w16_16bpc_c:     3483.0 ( 1.00x)   3485.1 ( 1.00x)    0.06%
      blend_v_w16_16bpc_rvv:    608.7 ( 5.72x)    607.4 ( 5.74x)   -0.21%
      blend_v_w32_16bpc_c:     6791.8 ( 1.00x)   6794.2 ( 1.00x)    0.04%
      blend_v_w32_16bpc_rvv:    940.6 ( 7.22x)    942.1 ( 7.21x)    0.16%
      9710e7de
    • Nathan E. Egge's avatar
      riscv64/mc16: Add VLEN=256 8bpc RVV blend_v function · 28d1c217
      Nathan E. Egge authored
      SpacemiT K1                  Before             After         Delta
      
      blend_v_w2_16bpc_c:       221.5 ( 1.00x)    220.3 ( 1.00x)   -0.54%
      blend_v_w2_16bpc_rvv:     193.5 ( 1.14x)    194.3 ( 1.13x)    0.41%
      blend_v_w4_16bpc_c:       968.8 ( 1.00x)    967.2 ( 1.00x)   -0.17%
      blend_v_w4_16bpc_rvv:     442.2 ( 2.19x)    347.4 ( 2.78x)  -21.44%
      blend_v_w8_16bpc_c:      1809.4 ( 1.00x)   1811.2 ( 1.00x)    0.10%
      blend_v_w8_16bpc_rvv:     557.4 ( 3.25x)    483.2 ( 3.75x)  -13.31%
      blend_v_w16_16bpc_c:     3481.4 ( 1.00x)   3473.4 ( 1.00x)   -0.23%
      blend_v_w16_16bpc_rvv:    844.3 ( 4.12x)    603.1 ( 5.76x)  -28.57%
      blend_v_w32_16bpc_c:     6783.1 ( 1.00x)   6749.8 ( 1.00x)   -0.49%
      blend_v_w32_16bpc_rvv:   1406.1 ( 4.82x)    919.4 ( 7.34x)  -34.61%
      28d1c217
    • Nathan E. Egge's avatar
      riscv64/mc16: Add 16bpc RVV blend_v function · aa2deb89
      Nathan E. Egge authored
      Kendryte K230
      
      blend_v_w2_16bpc_c:       226.5 ( 1.00x)
      blend_v_w2_16bpc_rvv:     192.2 ( 1.18x)
      blend_v_w4_16bpc_c:      1010.3 ( 1.00x)
      blend_v_w4_16bpc_rvv:     390.5 ( 2.59x)
      blend_v_w8_16bpc_c:      1994.2 ( 1.00x)
      blend_v_w8_16bpc_rvv:     561.7 ( 3.55x)
      blend_v_w16_16bpc_c:     3737.9 ( 1.00x)
      blend_v_w16_16bpc_rvv:    928.0 ( 4.03x)
      blend_v_w32_16bpc_c:     7064.7 ( 1.00x)
      blend_v_w32_16bpc_rvv:   1428.9 ( 4.94x)
      
      SpacemiT K1
      
      blend_v_w2_16bpc_c:       220.8 ( 1.00x)
      blend_v_w2_16bpc_rvv:     193.5 ( 1.14x)
      blend_v_w4_16bpc_c:       967.3 ( 1.00x)
      blend_v_w4_16bpc_rvv:     439.5 ( 2.20x)
      blend_v_w8_16bpc_c:      1810.2 ( 1.00x)
      blend_v_w8_16bpc_rvv:     555.3 ( 3.26x)
      blend_v_w16_16bpc_c:     3476.4 ( 1.00x)
      blend_v_w16_16bpc_rvv:    830.9 ( 4.18x)
      blend_v_w32_16bpc_c:     6772.9 ( 1.00x)
      blend_v_w32_16bpc_rvv:   1356.3 ( 4.99x)
      aa2deb89
Loading