- Dec 20, 2024
-
-
Martin Storsjö authored
Switch to the same cache-friendly algorithm as was done for arm64 in 2e73051c and for the reference C code in 8291a66e. Contrary to the arm64 implementation, this uses a main loop in C (very similar to the one in the main C implementation in 8291a66e) rather than assembly; this gives a bit more overhead on the call to each function, but it shouldn't affect the big picture much. Performane wise, this doesn't make much of a difference - it makes things a little bit faster on some cores, and a little bit slower on others: Before: Cortex A7 A8 A53 A72 A73 wiener_7tap_8bpc_neon: 269384.4 147730.7 140028.5 92662.5 92929.0 wiener_7tap_10bpc_neon: 352690.2 159970.2 169427.8 116614.9 119371.1 After: wiener_7tap_8bpc_neon: 238328.0 157274.1 134588.6 92200.3 97619.6 wiener_7tap_10bpc_neon: 336369.3 162182.0 161954.4 125521.2 130634.0 This is mostly in line with the results on arm64 in 2e73051c. On arm64, there was a bit larger speedup for the 7tap case, mostly attributed to unrolling the vertical filter (and the new filter_hv function) to operate on 16 pixels at a time. On arm32, there's not enough registers to do that, so we can't get such gains from unrolling. (Reducing the unrolling on the arm64 version to match the case on arm32 also shows similar performance numbers as on arm32 here.) In the arm64 version, we also added separate 5tap versions of all functions; not doing that for arm32 at this point. This increases the binary size by 2 KB. This doesn't have any immediate effect on how much stack space dav1d requires in total, since the largest stack users on arm currently are the 8tap_scaled functions.
-
- Dec 19, 2024
-
-
Martin Storsjö authored
This uses a separate function for combined horizontal and vertical filtering, without needing to write the intermediate results back to memory inbetween. This mostly serves as an example for how to adjust the logic for that case; unless we actually merge the horizontal and vertical filtering within the _hv function, we still need space for a 7th row on the stack within that function (which means we use just as much stack as before), but we also need one extra memcpy to write it into the right destination. In a build where the compiler is allowed to vectorize and inline the wiener functions into each other, this change actually reduces the final binary size by 4 KB, if the C version of the wiener filter is retained. This change makes the vectorized C code as fast as it was before with Clang 18; on Xcode Clang 16, it's 2x slower than it was before. Unfortunately, with GCC, this change makes the code a bit slower again.
-
Martin Storsjö authored
This increases the binary size by 9 KB, on aarch64 with Xcode Clang 16, if the C version of the filter is retained (which it isn't by default). This makes the vectorized C code roughly as fast as it was before the rewrite on GCC; on Clang it also becomes 1.3x-2.0x faster, while still being slower than it was initially.
-
Martin Storsjö authored
This reduces the stack usage of these functions (the C version) significantly. These C versions aren't used on architectures that already have wiener filters implemented in assembly, but they matter both if running e.g. with assembly disabled (e.g. for sanitizer builds), and matter as example for how to do a cache efficient SIMD implementation. This roughly matches how these functions are implemented in the aarch64 assembly (although that assembly function uses a mainloop function written in assembly, and custom calling conventions between the functions). With this in place, dav1d can run with around 76 KB of stack with assembly disabled. This increases the binary size by around 14 KB (in the case of aarch64 with Xcode Clang 16), unless built with (the default) -Dtrim_dsp=true. (By default, the C version of the wiener filter gets skipped entirely.) On 32 bit arm, the assembly wiener function implementation still uses large buffers on the stack though, but due to other functions using less stack there, dav1d can still run with 72 KB of stack there. Unfortunately, this change also makes the functions slower, depending on how well the compiler was able to optimize the previous version. On GCC (which didn't manage to vectorize the functions so well before), it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang (where it was very well vectorized before). Most of this performance can be gained back with later changes on top, though.
-
- Dec 02, 2024
-
-
Luc Trudeau authored
-
Henrik Gramner authored
It previously used 'pixel' which is typedefed to uint8_t in files that aren't bitdepth-templated, but those are indices and not pixels so that was just confusing and misleading.
-
- Nov 28, 2024
-
-
Victorien Le Couviour--Tuffet authored
f->task_thread.error can be set during flushing, not resetting this can lead to c->task_thread.first being increased after having already submitted a frame post flushing. That's fine if it happens on the very first frame, but if that's the case on any subsequent frame it will incur a wrong frame ordering. Now that a non-first frame will be considered as such, its tasks won't be able to execute (since they depend on a truly previous frame considered as being after) and c->task_thread.cur will be increased past that frame, with no way of it being reset, eventually leading to a hang.
-
- Nov 26, 2024
-
-
- Nov 21, 2024
-
-
This would allow to immediately detect unintended writes out of bounds like the ones fixed in 72b53807 and 1c7433a5. Extend the PIXEL_RECT macro to provide a variable containing the full, padded height of the buffer, for uses that operate on the full buffer. Allow overwriting past the right edge of the target output rectangle, up to an alignment of 64 pixels, but allow no overwrite past the bottom.
-
-
- Nov 19, 2024
-
-
Martin Storsjö authored
Switch to the same cache-friendly algorithm as was done for arm64 in c121b831. This uses much less stack memory, and is much more cache friendly. In this form, most of the individual asm functions only operate on one single row of data at a time. Some of the functions used to be unrolled to operate on two rows at a time, while they now only operate on one at a time. In practice, this is still a large performance win, as data is accessed in a much more cache friendly manner. This gives a 2-37% speedup, and reduces the peak amount of stack used for these functions from 255 KB to 33 KB. Before: Cortex A7 A8 A53 A72 A73 sgr_3x3_8bpc_neon: 873990.7 748341.9 543410.2 383200.4 357502.9 sgr_3x3_10bpc_neon: 909728.0 732594.5 560123.6 392765.5 359377.7 sgr_5x5_8bpc_neon: 591597.9 527353.1 350347.4 263464.9 243098.8 sgr_5x5_10bpc_neon: 637958.2 529462.8 364613.3 280664.6 255164.6 sgr_mix_8bpc_neon: 1458977.4 1185423.2 884017.7 632922.5 587395.2 sgr_mix_10bpc_neon: 1532376.5 1259111.4 918729.3 658787.6 600317.0 After: sgr_3x3_8bpc_neon: 836138.7 635556.5 530596.1 335794.6 348209.9 sgr_3x3_10bpc_neon: 850835.4 596445.0 534583.2 342713.4 349713.5 sgr_5x5_8bpc_neon: 577039.7 443916.5 341684.8 223374.0 232841.3 sgr_5x5_10bpc_neon: 600975.7 400041.3 347529.8 234759.9 239351.7 sgr_mix_8bpc_neon: 1297988.7 925739.1 830360.7 545476.1 548706.6 sgr_mix_10bpc_neon: 1340112.6 914395.7 873342.4 574815.7 554681.6 With this change in place, dav1d can run with around 72 KB of stack on arm targets. Not all functions have been merged in the same way as they were for arm64 in c121b831, so some minor differences remain; it's possible to incrementally optimize this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse finish_filter_row1/2 with sgr_weighted_row1, and make a version of finish_filter_row1 that produces 2 rows, like is done for arm64. It's also possible to rewrite the logic for calculating sgr_x_by_x in the same way as was done for arm64 in 79db1624.
-
- Nov 18, 2024
-
-
Martin Storsjö authored
This applies the same simplifications that were done for the C code and the x86 assembly in 4613d3a5, and the arm64 assembly in ce80e6da, to the arm32 implementation. This gives a minor speedup of around a couple percent. Before: Cortex A7 A8 A53 A72 A73 sgr_3x3_8bpc_neon: 926600.0 753468.3 553704.1 399379.1 369674.4 sgr_5x5_8bpc_neon: 621722.9 540412.7 357275.9 274474.3 254996.0 sgr_mix_8bpc_neon: 1529715.1 1171282.5 894982.9 659996.6 610407.2 After: sgr_3x3_8bpc_neon: 899020.3 697278.6 541569.9 382824.3 353891.8 sgr_5x5_8bpc_neon: 602183.2 498322.9 348974.5 264833.9 243837.7 sgr_mix_8bpc_neon: 1497870.8 1182121.3 880470.9 635939.3 590909.3
-
Martin Storsjö authored
-
Martin Storsjö authored
After processing one block, this accidentally jumped to the loop for processing two lines at once. The same bug was replicated in both 32 and 64 bit versions.
-
Martin Storsjö authored
This reduces the stack usage of these functions (the C version) significantly, and gives them a 15-40% speedup (on an Apple M3, with Xcode Clang 16). The C versions of this function does matter; even though we have assembly implementations of it on x86 and aarch64, those only covert the 8 and 10 bpc cases, while the C version is used as fallback for 12 bpc. This matches how these functions are implemented in the aarch64 assembly; operate over a window of 3 or 5 lines (of 384 pixels each), instead of doing a full 384 x 64 block. The individual functions for filtering a line each end up much simpler, and closer to how this can be implemented in assembly - but the overall business logic ends up much much more complex. The main difference to the aarch64 assembly implementation, is that any buffer which is of int16_t size in the aarch64 assembly implementation, uses the type "coef" here, which is 32 bit in the 10/12 bpc cases. (This is required for handling the 12 bpc cases.) With this in place, dav1d can run with around 66 KB of stack on x86_64 with assembly enabled, with around 74 KB of stack on aarch64 with assembly enabled, and with 118 KB of stack with assembly disabled. This increases the binary size by around 14 KB (in the case of aarch64 with Xcode Clang 16). On 32 bit arm, dav1d still requires around 270 KB of stack, as that assembly implementation of the SGR filter uses a different algorithm.
-
Martin Storsjö authored
As the machine specific init file is included in the common template, give symbols and defines unique names that won't clash with similar ones in the main template.
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
-
- Nov 16, 2024
-
-
Marvin Scholz authored
This is not an object so putting it in the objects variable seems wrong and would also break using gaspp for that file.
-
- Nov 15, 2024
-
-
Maryla Ustarroz authored
The '///<' syntax is used to document a field after the field. Mistakenly using it before the field results in the documentation going to the wrong field, see: https://videolan.videolan.me/dav1d/structDav1dMasteringDisplay.html
-
Martin Storsjö authored
When renumbering argument registers in 1648c232, this one register reference was missed. The missed register was meant to compare h with 2, but accidentally ended up comparing bitdepth_max to 2. In the case of 8 bpc, there's actually no bitdepth_max parameter, so it ended up comparing an uninitialized value.
-
- Nov 14, 2024
-
-
On really old libc versions, getauxval isn't available. Fall back on /proc/cpuinfo in those cases, just like we do on android too.
-
For individual tests in dav1d-test-data, the default timeout is 30 seconds (which is the Meson default if nothing is specified). Previously it ran with a multiplier of 4, resulting in a total timeout of 120 seconds. When running tests in QEMU, exceeding this 120 second timeout could happen occasionally. Raise the multiplier to 10, allowing each individual job to run for up to 5 minutes. This should hopefully reduce the amount of stray failures in the CI. For tests that already have a higher default timeout set, such as checkasm which has got a 180 second default timeout, this results in a much longer timeout period. However as long as we don't frequently see issues where these actually hang, it should be beneficial to just let them run to completion, rather than aborting early due to a tight timeout.
-
Martin Storsjö authored
Also fix one case where the 32 bit input parameter w (which was in x6, now in x4) was used without zero extension, by referencing to it as w4 instead.
-
- Nov 13, 2024
-
-
Martin Storsjö authored
This applies the same simplifications that were done for the C code and the x86 assembly in 4613d3a5, to the arm64 implementation. This gives a minor speedup of around a couple percent. Before: Cortex A53 A55 A72 A73 A76 Apple M3 sgr_3x3_8bpc_neon: 368583.2 363654.2 279958.1 272065.1 169353.3 354.6 sgr_5x5_8bpc_neon: 258570.7 255018.5 200410.6 199478.3 117968.3 260.9 sgr_mix_8bpc_neon: 603698.1 577383.3 482468.3 436540.4 256632.9 541.8 After: sgr_3x3_8bpc_neon: 367873.2 357884.1 275462.4 268363.9 165909.8 346.0 sgr_5x5_8bpc_neon: 254988.4 248184.2 190875.1 196939.1 120517.2 252.1 sgr_mix_8bpc_neon: 589204.7 563565.8 414025.6 427702.2 251651.2 533.4
-
Martin Storsjö authored
-
- Nov 10, 2024
-
-
Luca Barbato authored
-
Luca Barbato authored
-
Luca Barbato authored
They can be used across arches.
-
Luca Barbato authored
It makes the code tidier and the runtime is not slow.
-
Luca Barbato authored
blend_h_w2_8bpc_pwr9: 18.4 ( 1.20x) blend_h_w4_8bpc_pwr9: 27.2 ( 1.26x) blend_h_w8_8bpc_pwr9: 27.9 ( 2.22x) blend_h_w16_8bpc_pwr9: 35.1 ( 3.28x) blend_h_w32_8bpc_pwr9: 57.4 ( 3.88x) blend_h_w64_8bpc_pwr9: 97.9 ( 4.70x) blend_h_w128_8bpc_pwr9: 207.6 ( 5.18x)
-
Luca Barbato authored
blend_v_w2_8bpc_pwr9: 25.0 ( 1.12x) blend_v_w4_8bpc_pwr9: 79.3 ( 1.35x) blend_v_w8_8bpc_pwr9: 79.5 ( 2.43x) blend_v_w16_8bpc_pwr9: 108.0 ( 3.58x) blend_v_w32_8bpc_pwr9: 153.5 ( 4.69x)
-
Luca Barbato authored
blend_w4_8bpc_pwr9: 14.4 ( 1.90x) blend_w8_8bpc_pwr9: 19.9 ( 3.62x) blend_w16_8bpc_pwr9: 50.6 ( 5.17x) blend_w32_8bpc_pwr9: 125.8 ( 5.33x)
-
- Nov 05, 2024
-
-
-
Nathan E. Egge authored
Setting VL for this function only impacts the 16bpc performance and only on the SpacemiT K1 which has two vector units of length 128b each. Kendryte K230 Before After Delta blend_v_w2_8bpc_c: 220.0 ( 1.00x) 221.3 ( 1.00x) 0.59% blend_v_w2_8bpc_rvv: 145.7 ( 1.51x) 148.2 ( 1.49x) 1.72% blend_v_w4_8bpc_c: 942.1 ( 1.00x) 943.7 ( 1.00x) 0.17% blend_v_w4_8bpc_rvv: 240.4 ( 3.92x) 242.9 ( 3.89x) 1.04% blend_v_w8_8bpc_c: 1782.3 ( 1.00x) 1783.8 ( 1.00x) 0.08% blend_v_w8_8bpc_rvv: 252.6 ( 7.06x) 254.9 ( 7.00x) 0.91% blend_v_w16_8bpc_c: 3650.9 ( 1.00x) 3647.0 ( 1.00x) -0.11% blend_v_w16_8bpc_rvv: 495.5 ( 7.37x) 494.4 ( 7.38x) -0.22% blend_v_w32_8bpc_c: 7013.0 ( 1.00x) 7018.2 ( 1.00x) 0.07% blend_v_w32_8bpc_rvv: 807.9 ( 8.68x) 802.0 ( 8.75x) -0.73% blend_v_w2_16bpc_c: 226.1 ( 1.00x) 225.5 ( 1.00x) -0.27% blend_v_w2_16bpc_rvv: 148.6 ( 1.52x) 148.9 ( 1.51x) 0.20% blend_v_w4_16bpc_c: 1010.7 ( 1.00x) 1006.7 ( 1.00x) -0.40% blend_v_w4_16bpc_rvv: 306.7 ( 3.30x) 307.4 ( 3.27x) 0.23% blend_v_w8_16bpc_c: 1990.2 ( 1.00x) 1996.1 ( 1.00x) 0.30% blend_v_w8_16bpc_rvv: 519.5 ( 3.83x) 523.4 ( 3.81x) 0.75% blend_v_w16_16bpc_c: 3744.5 ( 1.00x) 3742.4 ( 1.00x) -0.06% blend_v_w16_16bpc_rvv: 899.6 ( 4.16x) 906.4 ( 4.13x) 0.76% blend_v_w32_16bpc_c: 7047.5 ( 1.00x) 7079.3 ( 1.00x) 0.45% blend_v_w32_16bpc_rvv: 1475.5 ( 4.78x) 1483.3 ( 4.77x) 0.53% SpacemiT K1 Before After Delta blend_v_w2_8bpc_c: 216.3 ( 1.00x) 214.4 ( 1.00x) -0.88% blend_v_w2_8bpc_rvv: 144.0 ( 1.50x) 143.6 ( 1.49x) -0.28% blend_v_w4_8bpc_c: 919.8 ( 1.00x) 918.1 ( 1.00x) -0.18% blend_v_w4_8bpc_rvv: 236.6 ( 3.89x) 236.4 ( 3.88x) -0.08% blend_v_w8_8bpc_c: 1739.3 ( 1.00x) 1736.8 ( 1.00x) -0.14% blend_v_w8_8bpc_rvv: 236.8 ( 7.34x) 236.3 ( 7.35x) -0.21% blend_v_w16_8bpc_c: 3374.7 ( 1.00x) 3374.9 ( 1.00x) 0.01% blend_v_w16_8bpc_rvv: 297.0 (11.36x) 296.8 (11.37x) -0.07% blend_v_w32_8bpc_c: 6647.5 ( 1.00x) 6645.5 ( 1.00x) -0.03% blend_v_w32_8bpc_rvv: 403.3 (16.48x) 402.4 (16.51x) -0.22% blend_v_w2_16bpc_c: 221.4 ( 1.00x) 220.1 ( 1.00x) -0.59% blend_v_w2_16bpc_rvv: 146.3 ( 1.51x) 147.3 ( 1.49x) 0.68% blend_v_w4_16bpc_c: 973.3 ( 1.00x) 972.7 ( 1.00x) -0.06% blend_v_w4_16bpc_rvv: 280.3 ( 3.47x) 282.1 ( 3.45x) 0.64% blend_v_w8_16bpc_c: 1814.8 ( 1.00x) 1816.2 ( 1.00x) 0.08% blend_v_w8_16bpc_rvv: 376.6 ( 4.82x) 376.9 ( 4.82x) 0.08% blend_v_w16_16bpc_c: 3485.5 ( 1.00x) 3485.5 ( 1.00x) 0.00% blend_v_w16_16bpc_rvv: 531.1 ( 6.56x) 525.6 ( 6.63x) -1.04% blend_v_w32_16bpc_c: 6788.3 ( 1.00x) 6778.8 ( 1.00x) -0.14% blend_v_w32_16bpc_rvv: 904.5 ( 7.51x) 854.6 ( 7.93x) -5.52%
-
- Nov 04, 2024
-
-
Nathan E. Egge authored
Kendryte K230 Before After Delta blend_v_w2_16bpc_c: 225.8 ( 1.00x) 225.7 ( 1.00x) -0.04% blend_v_w2_16bpc_rvv: 194.7 ( 1.16x) 148.6 ( 1.52x) -23.68% blend_v_w4_16bpc_c: 1011.3 ( 1.00x) 1005.8 ( 1.00x) -0.54% blend_v_w4_16bpc_rvv: 387.2 ( 2.61x) 305.4 ( 3.29x) -21.13% blend_v_w8_16bpc_c: 1878.5 ( 1.00x) 1872.7 ( 1.00x) -0.31% blend_v_w8_16bpc_rvv: 475.3 ( 3.95x) 435.6 ( 4.30x) -8.35% blend_v_w16_16bpc_c: 3601.9 ( 1.00x) 3601.6 ( 1.00x) -0.01% blend_v_w16_16bpc_rvv: 891.2 ( 4.04x) 892.7 ( 4.03x) 0.17% blend_v_w32_16bpc_c: 7043.7 ( 1.00x) 7058.8 ( 1.00x) 0.21% blend_v_w32_16bpc_rvv: 1384.5 ( 5.09x) 1478.0 ( 4.78x) 6.75% SpacemiT K1 Before After Delta blend_v_w2_16bpc_c: 222.6 ( 1.00x) 220.5 ( 1.00x) -0.94% blend_v_w2_16bpc_rvv: 195.7 ( 1.14x) 146.6 ( 1.50x) -25.09% blend_v_w4_16bpc_c: 972.3 ( 1.00x) 972.0 ( 1.00x) -0.03% blend_v_w4_16bpc_rvv: 349.1 ( 2.79x) 281.9 ( 3.45x) -19.25% blend_v_w8_16bpc_c: 1812.1 ( 1.00x) 1813.0 ( 1.00x) 0.05% blend_v_w8_16bpc_rvv: 481.5 ( 3.76x) 376.0 ( 4.82x) -21.91% blend_v_w16_16bpc_c: 3488.4 ( 1.00x) 3484.6 ( 1.00x) -0.11% blend_v_w16_16bpc_rvv: 608.7 ( 5.73x) 523.4 ( 6.66x) -14.01% blend_v_w32_16bpc_c: 6795.3 ( 1.00x) 6792.4 ( 1.00x) -0.04% blend_v_w32_16bpc_rvv: 934.8 ( 7.27x) 907.3 ( 7.49x) -2.94%
-
Nathan E. Egge authored
Kendryte K230 Before After Delta blend_v_w2_16bpc_c: 226.0 ( 1.00x) 226.1 ( 1.00x) 0.04% blend_v_w2_16bpc_rvv: 194.0 ( 1.16x) 193.9 ( 1.17x) -0.05% blend_v_w4_16bpc_c: 1011.8 ( 1.00x) 1009.4 ( 1.00x) -0.24% blend_v_w4_16bpc_rvv: 392.7 ( 2.58x) 390.8 ( 2.58x) -0.48% blend_v_w8_16bpc_c: 1987.9 ( 1.00x) 1988.0 ( 1.00x) 0.01% blend_v_w8_16bpc_rvv: 561.5 ( 3.54x) 560.2 ( 3.55x) -0.23% blend_v_w16_16bpc_c: 3738.1 ( 1.00x) 3739.1 ( 1.00x) 0.03% blend_v_w16_16bpc_rvv: 934.1 ( 4.00x) 932.2 ( 4.01x) -0.20% blend_v_w32_16bpc_c: 7031.0 ( 1.00x) 7030.1 ( 1.00x) -0.01% blend_v_w32_16bpc_rvv: 1403.3 ( 5.01x) 1395.8 ( 5.04x) -0.53% SpacemiT K1 Before After Delta blend_v_w2_16bpc_c: 221.0 ( 1.00x) 221.2 ( 1.00x) 0.09% blend_v_w2_16bpc_rvv: 195.2 ( 1.13x) 196.0 ( 1.13x) 0.41% blend_v_w4_16bpc_c: 969.8 ( 1.00x) 971.9 ( 1.00x) 0.22% blend_v_w4_16bpc_rvv: 348.8 ( 2.78x) 349.1 ( 2.78x) 0.09% blend_v_w8_16bpc_c: 1812.6 ( 1.00x) 1814.9 ( 1.00x) 0.13% blend_v_w8_16bpc_rvv: 486.1 ( 3.73x) 484.3 ( 3.75x) -0.37% blend_v_w16_16bpc_c: 3483.0 ( 1.00x) 3485.1 ( 1.00x) 0.06% blend_v_w16_16bpc_rvv: 608.7 ( 5.72x) 607.4 ( 5.74x) -0.21% blend_v_w32_16bpc_c: 6791.8 ( 1.00x) 6794.2 ( 1.00x) 0.04% blend_v_w32_16bpc_rvv: 940.6 ( 7.22x) 942.1 ( 7.21x) 0.16%
-
Nathan E. Egge authored
SpacemiT K1 Before After Delta blend_v_w2_16bpc_c: 221.5 ( 1.00x) 220.3 ( 1.00x) -0.54% blend_v_w2_16bpc_rvv: 193.5 ( 1.14x) 194.3 ( 1.13x) 0.41% blend_v_w4_16bpc_c: 968.8 ( 1.00x) 967.2 ( 1.00x) -0.17% blend_v_w4_16bpc_rvv: 442.2 ( 2.19x) 347.4 ( 2.78x) -21.44% blend_v_w8_16bpc_c: 1809.4 ( 1.00x) 1811.2 ( 1.00x) 0.10% blend_v_w8_16bpc_rvv: 557.4 ( 3.25x) 483.2 ( 3.75x) -13.31% blend_v_w16_16bpc_c: 3481.4 ( 1.00x) 3473.4 ( 1.00x) -0.23% blend_v_w16_16bpc_rvv: 844.3 ( 4.12x) 603.1 ( 5.76x) -28.57% blend_v_w32_16bpc_c: 6783.1 ( 1.00x) 6749.8 ( 1.00x) -0.49% blend_v_w32_16bpc_rvv: 1406.1 ( 4.82x) 919.4 ( 7.34x) -34.61%
-
Nathan E. Egge authored
Kendryte K230 blend_v_w2_16bpc_c: 226.5 ( 1.00x) blend_v_w2_16bpc_rvv: 192.2 ( 1.18x) blend_v_w4_16bpc_c: 1010.3 ( 1.00x) blend_v_w4_16bpc_rvv: 390.5 ( 2.59x) blend_v_w8_16bpc_c: 1994.2 ( 1.00x) blend_v_w8_16bpc_rvv: 561.7 ( 3.55x) blend_v_w16_16bpc_c: 3737.9 ( 1.00x) blend_v_w16_16bpc_rvv: 928.0 ( 4.03x) blend_v_w32_16bpc_c: 7064.7 ( 1.00x) blend_v_w32_16bpc_rvv: 1428.9 ( 4.94x) SpacemiT K1 blend_v_w2_16bpc_c: 220.8 ( 1.00x) blend_v_w2_16bpc_rvv: 193.5 ( 1.14x) blend_v_w4_16bpc_c: 967.3 ( 1.00x) blend_v_w4_16bpc_rvv: 439.5 ( 2.20x) blend_v_w8_16bpc_c: 1810.2 ( 1.00x) blend_v_w8_16bpc_rvv: 555.3 ( 3.26x) blend_v_w16_16bpc_c: 3476.4 ( 1.00x) blend_v_w16_16bpc_rvv: 830.9 ( 4.18x) blend_v_w32_16bpc_c: 6772.9 ( 1.00x) blend_v_w32_16bpc_rvv: 1356.3 ( 4.99x)
-