Skip to content
Snippets Groups Projects
  1. Dec 02, 2024
  2. Nov 28, 2024
    • Victorien Le Couviour--Tuffet's avatar
      flush: Reset f->task_thread.error · 575af258
      Victorien Le Couviour--Tuffet authored
      f->task_thread.error can be set during flushing, not resetting this can
      lead to c->task_thread.first being increased after having already submitted
      a frame post flushing. That's fine if it happens on the very first frame,
      but if that's the case on any subsequent frame it will incur a wrong frame
      ordering.
      Now that a non-first frame will be considered as such, its tasks won't be
      able to execute (since they depend on a truly previous frame considered as
      being after) and c->task_thread.cur will be increased past that frame, with
      no way of it being reset, eventually leading to a hang.
      575af258
  3. Nov 26, 2024
  4. Nov 21, 2024
  5. Nov 19, 2024
    • Martin Storsjö's avatar
      arm32: looprestoration: Rewrite the SGR functions · 30c3dd8e
      Martin Storsjö authored
      Switch to the same cache-friendly algorithm as was done for arm64
      in c121b831.
      
      This uses much less stack memory, and is much more cache friendly.
      In this form, most of the individual asm functions only operate on
      one single row of data at a time.
      
      Some of the functions used to be unrolled to operate on two rows
      at a time, while they now only operate on one at a time. In practice,
      this is still a large performance win, as data is accessed in a
      much more cache friendly manner.
      
      This gives a 2-37% speedup, and reduces the peak amount of stack
      used for these functions from 255 KB to 33 KB.
      
      Before:              Cortex A7         A8        A53        A72        A73
      sgr_3x3_8bpc_neon:    873990.7   748341.9   543410.2   383200.4   357502.9
      sgr_3x3_10bpc_neon:   909728.0   732594.5   560123.6   392765.5   359377.7
      sgr_5x5_8bpc_neon:    591597.9   527353.1   350347.4   263464.9   243098.8
      sgr_5x5_10bpc_neon:   637958.2   529462.8   364613.3   280664.6   255164.6
      sgr_mix_8bpc_neon:   1458977.4  1185423.2   884017.7   632922.5   587395.2
      sgr_mix_10bpc_neon:  1532376.5  1259111.4   918729.3   658787.6   600317.0
      After:
      sgr_3x3_8bpc_neon:    836138.7   635556.5   530596.1   335794.6   348209.9
      sgr_3x3_10bpc_neon:   850835.4   596445.0   534583.2   342713.4   349713.5
      sgr_5x5_8bpc_neon:    577039.7   443916.5   341684.8   223374.0   232841.3
      sgr_5x5_10bpc_neon:   600975.7   400041.3   347529.8   234759.9   239351.7
      sgr_mix_8bpc_neon:   1297988.7   925739.1   830360.7   545476.1   548706.6
      sgr_mix_10bpc_neon:  1340112.6   914395.7   873342.4   574815.7   554681.6
      
      With this change in place, dav1d can run with around 72 KB of stack
      on arm targets.
      
      Not all functions have been merged in the same way as they were
      for arm64 in c121b831, so some
      minor differences remain; it's possible to incrementally optimize
      this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse
      finish_filter_row1/2 with sgr_weighted_row1, and make a version of
      finish_filter_row1 that produces 2 rows, like is done for arm64.
      
      It's also possible to rewrite the logic for calculating sgr_x_by_x
      in the same way as was done for arm64 in
      79db1624.
      30c3dd8e
  6. Nov 18, 2024
    • Martin Storsjö's avatar
      arm32: looprestoration: Apply simplifications to align with C code · 1b7f1263
      Martin Storsjö authored
      This applies the same simplifications that were done for the C
      code and the x86 assembly in 4613d3a5,
      and the arm64 assembly in ce80e6da,
      to the arm32 implementation.
      
      This gives a minor speedup of around a couple percent.
      
      Before:             Cortex A7         A8        A53        A72        A73
      sgr_3x3_8bpc_neon:   926600.0   753468.3   553704.1   399379.1   369674.4
      sgr_5x5_8bpc_neon:   621722.9   540412.7   357275.9   274474.3   254996.0
      sgr_mix_8bpc_neon:  1529715.1  1171282.5   894982.9   659996.6   610407.2
      After:
      sgr_3x3_8bpc_neon:   899020.3   697278.6   541569.9   382824.3   353891.8
      sgr_5x5_8bpc_neon:   602183.2   498322.9   348974.5   264833.9   243837.7
      sgr_mix_8bpc_neon:  1497870.8  1182121.3   880470.9   635939.3   590909.3
      1b7f1263
    • Martin Storsjö's avatar
      c43debf1
    • Martin Storsjö's avatar
      arm: looprestoration: Fix the single line loop in sgr_weighted2 · 1c7433a5
      Martin Storsjö authored
      After processing one block, this accidentally jumped to the loop
      for processing two lines at once.
      
      The same bug was replicated in both 32 and 64 bit versions.
      1c7433a5
    • Martin Storsjö's avatar
      looprestoration: Rewrite the C version of the SGR filter · f32b3146
      Martin Storsjö authored
      This reduces the stack usage of these functions (the C version)
      significantly, and gives them a 15-40% speedup (on an Apple M3,
      with Xcode Clang 16).
      
      The C versions of this function does matter; even though we have
      assembly implementations of it on x86 and aarch64, those only
      covert the 8 and 10 bpc cases, while the C version is used as
      fallback for 12 bpc.
      
      This matches how these functions are implemented in the aarch64
      assembly; operate over a window of 3 or 5 lines (of 384 pixels
      each), instead of doing a full 384 x 64 block.
      
      The individual functions for filtering a line each end up
      much simpler, and closer to how this can be implemented in
      assembly - but the overall business logic ends up much much
      more complex.
      
      The main difference to the aarch64 assembly implementation,
      is that any buffer which is of int16_t size in the aarch64
      assembly implementation, uses the type "coef" here, which
      is 32 bit in the 10/12 bpc cases. (This is required for handling
      the 12 bpc cases.)
      
      With this in place, dav1d can run with around 66 KB of stack
      on x86_64 with assembly enabled, with around 74 KB of stack on
      aarch64 with assembly enabled, and with 118 KB of stack with
      assembly disabled.
      
      This increases the binary size by around 14 KB (in the case of
      aarch64 with Xcode Clang 16).
      
      On 32 bit arm, dav1d still requires around 270 KB of stack, as
      that assembly implementation of the SGR filter uses a different
      algorithm.
      f32b3146
    • Martin Storsjö's avatar
      arm: looprestoration: Give symbols and defines unique names · 01d417c2
      Martin Storsjö authored
      As the machine specific init file is included in the common
      template, give symbols and defines unique names that won't
      clash with similar ones in the main template.
      01d417c2
    • Martin Storsjö's avatar
      847eece1
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
  7. Nov 16, 2024
  8. Nov 15, 2024
  9. Nov 14, 2024
    • Martin Storsjö's avatar
      arm: Use /proc/cpuinfo on linux if getauxval is unavailable · bed3a343
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      On really old libc versions, getauxval isn't available. Fall back
      on /proc/cpuinfo in those cases, just like we do on android too.
      bed3a343
    • Martin Storsjö's avatar
      ci: Raise the timeout multipliers for jobs that run in QEMU · 718b62c8
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      For individual tests in dav1d-test-data, the default timeout
      is 30 seconds (which is the Meson default if nothing is
      specified). Previously it ran with a multiplier of 4, resulting
      in a total timeout of 120 seconds.
      
      When running tests in QEMU, exceeding this 120 second timeout
      could happen occasionally. Raise the multiplier to 10, allowing
      each individual job to run for up to 5 minutes.
      
      This should hopefully reduce the amount of stray failures in the
      CI.
      
      For tests that already have a higher default timeout set, such
      as checkasm which has got a 180 second default timeout, this results
      in a much longer timeout period. However as long as we don't
      frequently see issues where these actually hang, it should be
      beneficial to just let them run to completion, rather than
      aborting early due to a tight timeout.
      718b62c8
    • Martin Storsjö's avatar
      arm64: looprestoration: Remove an unnecessary duplicate parameter in dav1d_sgr_weighted2_Xbpc_neon · 1648c232
      Martin Storsjö authored
      Also fix one case where the 32 bit input parameter w (which was in
      x6, now in x4) was used without zero extension, by referencing to
      it as w4 instead.
      1648c232
  10. Nov 13, 2024
    • Martin Storsjö's avatar
      arm64: looprestoration: Apply simplifications to align with C code · ce80e6da
      Martin Storsjö authored
      This applies the same simplifications that were done for the C
      code and the x86 assembly in 4613d3a5,
      to the arm64 implementation.
      
      This gives a minor speedup of around a couple percent.
      
      Before:            Cortex A53        A55        A72        A73       A76  Apple
      M3
      sgr_3x3_8bpc_neon:   368583.2   363654.2   279958.1   272065.1  169353.3  354.6
      sgr_5x5_8bpc_neon:   258570.7   255018.5   200410.6   199478.3  117968.3  260.9
      sgr_mix_8bpc_neon:   603698.1   577383.3   482468.3   436540.4  256632.9  541.8
      After:
      sgr_3x3_8bpc_neon:   367873.2   357884.1   275462.4   268363.9  165909.8  346.0
      sgr_5x5_8bpc_neon:   254988.4   248184.2   190875.1   196939.1  120517.2  252.1
      sgr_mix_8bpc_neon:   589204.7   563565.8   414025.6   427702.2  251651.2  533.4
      ce80e6da
    • Martin Storsjö's avatar
      8bd31a92
  11. Nov 10, 2024
  12. Nov 05, 2024
    • Brad Smith's avatar
      93f12c11
    • Nathan E. Egge's avatar
      riscv64/mc: Only process w*3/4 elements in blend_v · a17c8625
      Nathan E. Egge authored
      Setting VL for this function only impacts the 16bpc performance and only
       on the SpacemiT K1 which has two vector units of length 128b each.
      
      Kendryte K230                Before             After         Delta
      
      blend_v_w2_8bpc_c:        220.0 ( 1.00x)    221.3 ( 1.00x)    0.59%
      blend_v_w2_8bpc_rvv:      145.7 ( 1.51x)    148.2 ( 1.49x)    1.72%
      blend_v_w4_8bpc_c:        942.1 ( 1.00x)    943.7 ( 1.00x)    0.17%
      blend_v_w4_8bpc_rvv:      240.4 ( 3.92x)    242.9 ( 3.89x)    1.04%
      blend_v_w8_8bpc_c:       1782.3 ( 1.00x)   1783.8 ( 1.00x)    0.08%
      blend_v_w8_8bpc_rvv:      252.6 ( 7.06x)    254.9 ( 7.00x)    0.91%
      blend_v_w16_8bpc_c:      3650.9 ( 1.00x)   3647.0 ( 1.00x)   -0.11%
      blend_v_w16_8bpc_rvv:     495.5 ( 7.37x)    494.4 ( 7.38x)   -0.22%
      blend_v_w32_8bpc_c:      7013.0 ( 1.00x)   7018.2 ( 1.00x)    0.07%
      blend_v_w32_8bpc_rvv:     807.9 ( 8.68x)    802.0 ( 8.75x)   -0.73%
      
      blend_v_w2_16bpc_c:       226.1 ( 1.00x)    225.5 ( 1.00x)   -0.27%
      blend_v_w2_16bpc_rvv:     148.6 ( 1.52x)    148.9 ( 1.51x)    0.20%
      blend_v_w4_16bpc_c:      1010.7 ( 1.00x)   1006.7 ( 1.00x)   -0.40%
      blend_v_w4_16bpc_rvv:     306.7 ( 3.30x)    307.4 ( 3.27x)    0.23%
      blend_v_w8_16bpc_c:      1990.2 ( 1.00x)   1996.1 ( 1.00x)    0.30%
      blend_v_w8_16bpc_rvv:     519.5 ( 3.83x)    523.4 ( 3.81x)    0.75%
      blend_v_w16_16bpc_c:     3744.5 ( 1.00x)   3742.4 ( 1.00x)   -0.06%
      blend_v_w16_16bpc_rvv:    899.6 ( 4.16x)    906.4 ( 4.13x)    0.76%
      blend_v_w32_16bpc_c:     7047.5 ( 1.00x)   7079.3 ( 1.00x)    0.45%
      blend_v_w32_16bpc_rvv:   1475.5 ( 4.78x)   1483.3 ( 4.77x)    0.53%
      
      SpacemiT K1                  Before             After         Delta
      
      blend_v_w2_8bpc_c:        216.3 ( 1.00x)    214.4 ( 1.00x)   -0.88%
      blend_v_w2_8bpc_rvv:      144.0 ( 1.50x)    143.6 ( 1.49x)   -0.28%
      blend_v_w4_8bpc_c:        919.8 ( 1.00x)    918.1 ( 1.00x)   -0.18%
      blend_v_w4_8bpc_rvv:      236.6 ( 3.89x)    236.4 ( 3.88x)   -0.08%
      blend_v_w8_8bpc_c:       1739.3 ( 1.00x)   1736.8 ( 1.00x)   -0.14%
      blend_v_w8_8bpc_rvv:      236.8 ( 7.34x)    236.3 ( 7.35x)   -0.21%
      blend_v_w16_8bpc_c:      3374.7 ( 1.00x)   3374.9 ( 1.00x)    0.01%
      blend_v_w16_8bpc_rvv:     297.0 (11.36x)    296.8 (11.37x)   -0.07%
      blend_v_w32_8bpc_c:      6647.5 ( 1.00x)   6645.5 ( 1.00x)   -0.03%
      blend_v_w32_8bpc_rvv:     403.3 (16.48x)    402.4 (16.51x)   -0.22%
      
      blend_v_w2_16bpc_c:       221.4 ( 1.00x)    220.1 ( 1.00x)   -0.59%
      blend_v_w2_16bpc_rvv:     146.3 ( 1.51x)    147.3 ( 1.49x)    0.68%
      blend_v_w4_16bpc_c:       973.3 ( 1.00x)    972.7 ( 1.00x)   -0.06%
      blend_v_w4_16bpc_rvv:     280.3 ( 3.47x)    282.1 ( 3.45x)    0.64%
      blend_v_w8_16bpc_c:      1814.8 ( 1.00x)   1816.2 ( 1.00x)    0.08%
      blend_v_w8_16bpc_rvv:     376.6 ( 4.82x)    376.9 ( 4.82x)    0.08%
      blend_v_w16_16bpc_c:     3485.5 ( 1.00x)   3485.5 ( 1.00x)    0.00%
      blend_v_w16_16bpc_rvv:    531.1 ( 6.56x)    525.6 ( 6.63x)   -1.04%
      blend_v_w32_16bpc_c:     6788.3 ( 1.00x)   6778.8 ( 1.00x)   -0.14%
      blend_v_w32_16bpc_rvv:    904.5 ( 7.51x)    854.6 ( 7.93x)   -5.52%
      a17c8625
  13. Nov 04, 2024
    • Nathan E. Egge's avatar
      riscv64/mc16: Unroll 16bpc RVV blend_v 2x · 907dd871
      Nathan E. Egge authored
      Kendryte K230                Before             After         Delta
      
      blend_v_w2_16bpc_c:       225.8 ( 1.00x)    225.7 ( 1.00x)   -0.04%
      blend_v_w2_16bpc_rvv:     194.7 ( 1.16x)    148.6 ( 1.52x)  -23.68%
      blend_v_w4_16bpc_c:      1011.3 ( 1.00x)   1005.8 ( 1.00x)   -0.54%
      blend_v_w4_16bpc_rvv:     387.2 ( 2.61x)    305.4 ( 3.29x)  -21.13%
      blend_v_w8_16bpc_c:      1878.5 ( 1.00x)   1872.7 ( 1.00x)   -0.31%
      blend_v_w8_16bpc_rvv:     475.3 ( 3.95x)    435.6 ( 4.30x)   -8.35%
      blend_v_w16_16bpc_c:     3601.9 ( 1.00x)   3601.6 ( 1.00x)   -0.01%
      blend_v_w16_16bpc_rvv:    891.2 ( 4.04x)    892.7 ( 4.03x)    0.17%
      blend_v_w32_16bpc_c:     7043.7 ( 1.00x)   7058.8 ( 1.00x)    0.21%
      blend_v_w32_16bpc_rvv:   1384.5 ( 5.09x)   1478.0 ( 4.78x)    6.75%
      
      SpacemiT K1                  Before             After         Delta
      
      blend_v_w2_16bpc_c:       222.6 ( 1.00x)    220.5 ( 1.00x)   -0.94%
      blend_v_w2_16bpc_rvv:     195.7 ( 1.14x)    146.6 ( 1.50x)  -25.09%
      blend_v_w4_16bpc_c:       972.3 ( 1.00x)    972.0 ( 1.00x)   -0.03%
      blend_v_w4_16bpc_rvv:     349.1 ( 2.79x)    281.9 ( 3.45x)  -19.25%
      blend_v_w8_16bpc_c:      1812.1 ( 1.00x)   1813.0 ( 1.00x)    0.05%
      blend_v_w8_16bpc_rvv:     481.5 ( 3.76x)    376.0 ( 4.82x)  -21.91%
      blend_v_w16_16bpc_c:     3488.4 ( 1.00x)   3484.6 ( 1.00x)   -0.11%
      blend_v_w16_16bpc_rvv:    608.7 ( 5.73x)    523.4 ( 6.66x)  -14.01%
      blend_v_w32_16bpc_c:     6795.3 ( 1.00x)   6792.4 ( 1.00x)   -0.04%
      blend_v_w32_16bpc_rvv:    934.8 ( 7.27x)    907.3 ( 7.49x)   -2.94%
      907dd871
    • Nathan E. Egge's avatar
      riscv64/mc16: Branchless vsetvl in blend_v function · 9710e7de
      Nathan E. Egge authored
      Kendryte K230                Before             After         Delta
      
      blend_v_w2_16bpc_c:       226.0 ( 1.00x)    226.1 ( 1.00x)    0.04%
      blend_v_w2_16bpc_rvv:     194.0 ( 1.16x)    193.9 ( 1.17x)   -0.05%
      blend_v_w4_16bpc_c:      1011.8 ( 1.00x)   1009.4 ( 1.00x)   -0.24%
      blend_v_w4_16bpc_rvv:     392.7 ( 2.58x)    390.8 ( 2.58x)   -0.48%
      blend_v_w8_16bpc_c:      1987.9 ( 1.00x)   1988.0 ( 1.00x)    0.01%
      blend_v_w8_16bpc_rvv:     561.5 ( 3.54x)    560.2 ( 3.55x)   -0.23%
      blend_v_w16_16bpc_c:     3738.1 ( 1.00x)   3739.1 ( 1.00x)    0.03%
      blend_v_w16_16bpc_rvv:    934.1 ( 4.00x)    932.2 ( 4.01x)   -0.20%
      blend_v_w32_16bpc_c:     7031.0 ( 1.00x)   7030.1 ( 1.00x)   -0.01%
      blend_v_w32_16bpc_rvv:   1403.3 ( 5.01x)   1395.8 ( 5.04x)   -0.53%
      
      SpacemiT K1                  Before             After         Delta
      
      blend_v_w2_16bpc_c:       221.0 ( 1.00x)    221.2 ( 1.00x)    0.09%
      blend_v_w2_16bpc_rvv:     195.2 ( 1.13x)    196.0 ( 1.13x)    0.41%
      blend_v_w4_16bpc_c:       969.8 ( 1.00x)    971.9 ( 1.00x)    0.22%
      blend_v_w4_16bpc_rvv:     348.8 ( 2.78x)    349.1 ( 2.78x)    0.09%
      blend_v_w8_16bpc_c:      1812.6 ( 1.00x)   1814.9 ( 1.00x)    0.13%
      blend_v_w8_16bpc_rvv:     486.1 ( 3.73x)    484.3 ( 3.75x)   -0.37%
      blend_v_w16_16bpc_c:     3483.0 ( 1.00x)   3485.1 ( 1.00x)    0.06%
      blend_v_w16_16bpc_rvv:    608.7 ( 5.72x)    607.4 ( 5.74x)   -0.21%
      blend_v_w32_16bpc_c:     6791.8 ( 1.00x)   6794.2 ( 1.00x)    0.04%
      blend_v_w32_16bpc_rvv:    940.6 ( 7.22x)    942.1 ( 7.21x)    0.16%
      9710e7de
    • Nathan E. Egge's avatar
      riscv64/mc16: Add VLEN=256 8bpc RVV blend_v function · 28d1c217
      Nathan E. Egge authored
      SpacemiT K1                  Before             After         Delta
      
      blend_v_w2_16bpc_c:       221.5 ( 1.00x)    220.3 ( 1.00x)   -0.54%
      blend_v_w2_16bpc_rvv:     193.5 ( 1.14x)    194.3 ( 1.13x)    0.41%
      blend_v_w4_16bpc_c:       968.8 ( 1.00x)    967.2 ( 1.00x)   -0.17%
      blend_v_w4_16bpc_rvv:     442.2 ( 2.19x)    347.4 ( 2.78x)  -21.44%
      blend_v_w8_16bpc_c:      1809.4 ( 1.00x)   1811.2 ( 1.00x)    0.10%
      blend_v_w8_16bpc_rvv:     557.4 ( 3.25x)    483.2 ( 3.75x)  -13.31%
      blend_v_w16_16bpc_c:     3481.4 ( 1.00x)   3473.4 ( 1.00x)   -0.23%
      blend_v_w16_16bpc_rvv:    844.3 ( 4.12x)    603.1 ( 5.76x)  -28.57%
      blend_v_w32_16bpc_c:     6783.1 ( 1.00x)   6749.8 ( 1.00x)   -0.49%
      blend_v_w32_16bpc_rvv:   1406.1 ( 4.82x)    919.4 ( 7.34x)  -34.61%
      28d1c217
    • Nathan E. Egge's avatar
      riscv64/mc16: Add 16bpc RVV blend_v function · aa2deb89
      Nathan E. Egge authored
      Kendryte K230
      
      blend_v_w2_16bpc_c:       226.5 ( 1.00x)
      blend_v_w2_16bpc_rvv:     192.2 ( 1.18x)
      blend_v_w4_16bpc_c:      1010.3 ( 1.00x)
      blend_v_w4_16bpc_rvv:     390.5 ( 2.59x)
      blend_v_w8_16bpc_c:      1994.2 ( 1.00x)
      blend_v_w8_16bpc_rvv:     561.7 ( 3.55x)
      blend_v_w16_16bpc_c:     3737.9 ( 1.00x)
      blend_v_w16_16bpc_rvv:    928.0 ( 4.03x)
      blend_v_w32_16bpc_c:     7064.7 ( 1.00x)
      blend_v_w32_16bpc_rvv:   1428.9 ( 4.94x)
      
      SpacemiT K1
      
      blend_v_w2_16bpc_c:       220.8 ( 1.00x)
      blend_v_w2_16bpc_rvv:     193.5 ( 1.14x)
      blend_v_w4_16bpc_c:       967.3 ( 1.00x)
      blend_v_w4_16bpc_rvv:     439.5 ( 2.20x)
      blend_v_w8_16bpc_c:      1810.2 ( 1.00x)
      blend_v_w8_16bpc_rvv:     555.3 ( 3.26x)
      blend_v_w16_16bpc_c:     3476.4 ( 1.00x)
      blend_v_w16_16bpc_rvv:    830.9 ( 4.18x)
      blend_v_w32_16bpc_c:     6772.9 ( 1.00x)
      blend_v_w32_16bpc_rvv:   1356.3 ( 4.99x)
      aa2deb89
  14. Oct 31, 2024
    • Nathan E. Egge's avatar
      riscv64/mc16: Unroll 16bpc RVV blend 2x · c783088f
      Nathan E. Egge authored
      Kendryte K230              Before               After         Delta
      
      blend_w4_16bpc_c:       210.0 ( 1.00x)      208.9 ( 1.00x)   -0.52%
      blend_w4_16bpc_rvv:      88.5 ( 2.37x)       66.2 ( 3.15x)  -25.20%
      blend_w8_16bpc_c:       614.1 ( 1.00x)      613.5 ( 1.00x)   -0.10%
      blend_w8_16bpc_rvv:     143.1 ( 4.29x)      126.9 ( 4.83x)  -11.32%
      blend_w16_16bpc_c:     2371.2 ( 1.00x)     2371.3 ( 1.00x)    0.00%
      blend_w16_16bpc_rvv:    461.1 ( 5.14x)      413.2 ( 5.74x)  -10.39%
      blend_w32_16bpc_c:     5998.4 ( 1.00x)     5998.4 ( 1.00x)    0.00%
      blend_w32_16bpc_rvv:    978.4 ( 6.13x)     1013.1 ( 5.92x)    3.55%
      
      SpacemiT K1                Before               After         Delta
      
      blend_w4_16bpc_c:       205.8 ( 1.00x)      205.9 ( 1.00x)    0.05%
      blend_w4_16bpc_rvv:      80.9 ( 2.54x)       64.9 ( 3.17x)  -19.78%
      blend_w8_16bpc_c:       599.9 ( 1.00x)      599.9 ( 1.00x)    0.00%
      blend_w8_16bpc_rvv:     134.4 ( 4.46x)      101.9 ( 5.89x)  -24.18%
      blend_w16_16bpc_c:     2316.5 ( 1.00x)     2316.5 ( 1.00x)    0.00%
      blend_w16_16bpc_rvv:    302.0 ( 7.67x)      262.8 ( 8.81x)  -12.98%
      blend_w32_16bpc_c:     5861.9 ( 1.00x)     5861.4 ( 1.00x)   -0.01%
      blend_w32_16bpc_rvv:    589.6 ( 9.94x)      602.2 ( 9.73x)    2.14%
      c783088f
    • Nathan E. Egge's avatar
      riscv64/mc16: Branchless vsetvl in blend function · 67c60d76
      Nathan E. Egge authored
      Kendryte K230              Before               After         Delta
      
      blend_w4_16bpc_c:       208.8 ( 1.00x)      209.9 ( 1.00x)    0.53%
      blend_w4_16bpc_rvv:      85.9 ( 2.43x)       88.6 ( 2.37x)    3.14%
      blend_w8_16bpc_c:       613.2 ( 1.00x)      614.3 ( 1.00x)    0.18%
      blend_w8_16bpc_rvv:     145.4 ( 4.22x)      143.1 ( 4.29x)   -1.58%
      blend_w16_16bpc_c:     2371.9 ( 1.00x)     2373.6 ( 1.00x)    0.07%
      blend_w16_16bpc_rvv:    464.0 ( 5.11x)      461.2 ( 5.15x)   -0.60%
      blend_w32_16bpc_c:     6005.6 ( 1.00x)     6007.7 ( 1.00x)    0.03%
      blend_w32_16bpc_rvv:    981.6 ( 6.12x)      979.4 ( 6.13x)   -0.22%
      
      SpacemiT K1                Before               After         Delta
      
      blend_w4_16bpc_c:       206.4 ( 1.00x)      205.7 ( 1.00x)   -0.34%
      blend_w4_16bpc_rvv:      79.5 ( 2.60x)       81.0 ( 2.54x)    1.89%
      blend_w8_16bpc_c:       600.7 ( 1.00x)      599.7 ( 1.00x)   -0.17%
      blend_w8_16bpc_rvv:     133.3 ( 4.51x)      134.1 ( 4.47x)    0.60%
      blend_w16_16bpc_c:     2315.9 ( 1.00x)     2315.2 ( 1.00x)   -0.03%
      blend_w16_16bpc_rvv:    305.2 ( 7.59x)      300.7 ( 7.70x)   -1.47%
      blend_w32_16bpc_c:     5861.1 ( 1.00x)     5860.2 ( 1.00x)   -0.02%
      blend_w32_16bpc_rvv:    592.5 ( 9.89x)      589.5 ( 9.94x)   -0.51%
      67c60d76
    • Nathan E. Egge's avatar
      riscv64/mc16: Add VLEN=256 8bpc RVV blend function · 3437a26b
      Nathan E. Egge authored
      SpacemiT K1                Before               After         Delta
      
      blend_w4_16bpc_c:       206.8 ( 1.00x)      206.0 ( 1.00x)   -0.39%
      blend_w4_16bpc_rvv:      95.8 ( 2.16x)       77.8 ( 2.65x)  -18.79%
      blend_w8_16bpc_c:       600.4 ( 1.00x)      600.1 ( 1.00x)   -0.05%
      blend_w8_16bpc_rvv:     161.7 ( 3.71x)      131.3 ( 4.57x)  -18.80%
      blend_w16_16bpc_c:     2317.6 ( 1.00x)     2316.5 ( 1.00x)   -0.05%
      blend_w16_16bpc_rvv:    459.6 ( 5.04x)      302.9 ( 7.65x)  -34.09%
      blend_w32_16bpc_c:     5863.0 ( 1.00x)     5863.3 ( 1.00x)    0.01%
      blend_w32_16bpc_rvv:    992.7 ( 5.91x)      578.1 (10.14x)  -41.76%
      3437a26b
  15. Oct 29, 2024
    • Nathan E. Egge's avatar
      meson: Move riscv64 8bpc only files into bitdepth sources · e542f661
      Nathan E. Egge authored
      The cdef.S, itx.S and mc.S files contain only 8bpc implementations and
       should be compiled only when building with -Dbitdepths=8 configuration.
      e542f661
    • Nathan E. Egge's avatar
      riscv64/mc16: Add 16bpc RVV blend function · ca489d8a
      Nathan E. Egge authored and Luca Barbato's avatar Luca Barbato committed
      Kendryte K230
      
      blend_w4_16bpc_c:        214.4 ( 1.00x)
      blend_w4_16bpc_rvv:       90.2 ( 2.38x)
      blend_w8_16bpc_c:        618.9 ( 1.00x)
      blend_w8_16bpc_rvv:      147.4 ( 4.20x)
      blend_w16_16bpc_c:      2376.5 ( 1.00x)
      blend_w16_16bpc_rvv:     466.0 ( 5.10x)
      blend_w32_16bpc_c:      6008.6 ( 1.00x)
      blend_w32_16bpc_rvv:     985.0 ( 6.10x)
      
      SpacemiT K1
      
      blend_w4_16bpc_c:        204.9 ( 1.00x)
      blend_w4_16bpc_rvv:       88.3 ( 2.32x)
      blend_w8_16bpc_c:        598.5 ( 1.00x)
      blend_w8_16bpc_rvv:      155.3 ( 3.85x)
      blend_w16_16bpc_c:      2315.4 ( 1.00x)
      blend_w16_16bpc_rvv:     444.4 ( 5.21x)
      blend_w32_16bpc_c:      5860.1 ( 1.00x)
      blend_w32_16bpc_rvv:     993.0 ( 5.90x)
      ca489d8a
Loading