- Dec 02, 2024
-
-
Henrik Gramner authored
It previously used 'pixel' which is typedefed to uint8_t in files that aren't bitdepth-templated, but those are indices and not pixels so that was just confusing and misleading.
-
- Nov 28, 2024
-
-
Victorien Le Couviour--Tuffet authored
f->task_thread.error can be set during flushing, not resetting this can lead to c->task_thread.first being increased after having already submitted a frame post flushing. That's fine if it happens on the very first frame, but if that's the case on any subsequent frame it will incur a wrong frame ordering. Now that a non-first frame will be considered as such, its tasks won't be able to execute (since they depend on a truly previous frame considered as being after) and c->task_thread.cur will be increased past that frame, with no way of it being reset, eventually leading to a hang.
-
- Nov 26, 2024
-
-
- Nov 21, 2024
-
-
This would allow to immediately detect unintended writes out of bounds like the ones fixed in 72b53807 and 1c7433a5. Extend the PIXEL_RECT macro to provide a variable containing the full, padded height of the buffer, for uses that operate on the full buffer. Allow overwriting past the right edge of the target output rectangle, up to an alignment of 64 pixels, but allow no overwrite past the bottom.
-
-
- Nov 19, 2024
-
-
Martin Storsjö authored
Switch to the same cache-friendly algorithm as was done for arm64 in c121b831. This uses much less stack memory, and is much more cache friendly. In this form, most of the individual asm functions only operate on one single row of data at a time. Some of the functions used to be unrolled to operate on two rows at a time, while they now only operate on one at a time. In practice, this is still a large performance win, as data is accessed in a much more cache friendly manner. This gives a 2-37% speedup, and reduces the peak amount of stack used for these functions from 255 KB to 33 KB. Before: Cortex A7 A8 A53 A72 A73 sgr_3x3_8bpc_neon: 873990.7 748341.9 543410.2 383200.4 357502.9 sgr_3x3_10bpc_neon: 909728.0 732594.5 560123.6 392765.5 359377.7 sgr_5x5_8bpc_neon: 591597.9 527353.1 350347.4 263464.9 243098.8 sgr_5x5_10bpc_neon: 637958.2 529462.8 364613.3 280664.6 255164.6 sgr_mix_8bpc_neon: 1458977.4 1185423.2 884017.7 632922.5 587395.2 sgr_mix_10bpc_neon: 1532376.5 1259111.4 918729.3 658787.6 600317.0 After: sgr_3x3_8bpc_neon: 836138.7 635556.5 530596.1 335794.6 348209.9 sgr_3x3_10bpc_neon: 850835.4 596445.0 534583.2 342713.4 349713.5 sgr_5x5_8bpc_neon: 577039.7 443916.5 341684.8 223374.0 232841.3 sgr_5x5_10bpc_neon: 600975.7 400041.3 347529.8 234759.9 239351.7 sgr_mix_8bpc_neon: 1297988.7 925739.1 830360.7 545476.1 548706.6 sgr_mix_10bpc_neon: 1340112.6 914395.7 873342.4 574815.7 554681.6 With this change in place, dav1d can run with around 72 KB of stack on arm targets. Not all functions have been merged in the same way as they were for arm64 in c121b831, so some minor differences remain; it's possible to incrementally optimize this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse finish_filter_row1/2 with sgr_weighted_row1, and make a version of finish_filter_row1 that produces 2 rows, like is done for arm64. It's also possible to rewrite the logic for calculating sgr_x_by_x in the same way as was done for arm64 in 79db1624.
-
- Nov 18, 2024
-
-
Martin Storsjö authored
This applies the same simplifications that were done for the C code and the x86 assembly in 4613d3a5, and the arm64 assembly in ce80e6da, to the arm32 implementation. This gives a minor speedup of around a couple percent. Before: Cortex A7 A8 A53 A72 A73 sgr_3x3_8bpc_neon: 926600.0 753468.3 553704.1 399379.1 369674.4 sgr_5x5_8bpc_neon: 621722.9 540412.7 357275.9 274474.3 254996.0 sgr_mix_8bpc_neon: 1529715.1 1171282.5 894982.9 659996.6 610407.2 After: sgr_3x3_8bpc_neon: 899020.3 697278.6 541569.9 382824.3 353891.8 sgr_5x5_8bpc_neon: 602183.2 498322.9 348974.5 264833.9 243837.7 sgr_mix_8bpc_neon: 1497870.8 1182121.3 880470.9 635939.3 590909.3
-
Martin Storsjö authored
-
Martin Storsjö authored
After processing one block, this accidentally jumped to the loop for processing two lines at once. The same bug was replicated in both 32 and 64 bit versions.
-
Martin Storsjö authored
This reduces the stack usage of these functions (the C version) significantly, and gives them a 15-40% speedup (on an Apple M3, with Xcode Clang 16). The C versions of this function does matter; even though we have assembly implementations of it on x86 and aarch64, those only covert the 8 and 10 bpc cases, while the C version is used as fallback for 12 bpc. This matches how these functions are implemented in the aarch64 assembly; operate over a window of 3 or 5 lines (of 384 pixels each), instead of doing a full 384 x 64 block. The individual functions for filtering a line each end up much simpler, and closer to how this can be implemented in assembly - but the overall business logic ends up much much more complex. The main difference to the aarch64 assembly implementation, is that any buffer which is of int16_t size in the aarch64 assembly implementation, uses the type "coef" here, which is 32 bit in the 10/12 bpc cases. (This is required for handling the 12 bpc cases.) With this in place, dav1d can run with around 66 KB of stack on x86_64 with assembly enabled, with around 74 KB of stack on aarch64 with assembly enabled, and with 118 KB of stack with assembly disabled. This increases the binary size by around 14 KB (in the case of aarch64 with Xcode Clang 16). On 32 bit arm, dav1d still requires around 270 KB of stack, as that assembly implementation of the SGR filter uses a different algorithm.
-
Martin Storsjö authored
As the machine specific init file is included in the common template, give symbols and defines unique names that won't clash with similar ones in the main template.
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
-
- Nov 16, 2024
-
-
Marvin Scholz authored
This is not an object so putting it in the objects variable seems wrong and would also break using gaspp for that file.
-
- Nov 15, 2024
-
-
Maryla Ustarroz authored
The '///<' syntax is used to document a field after the field. Mistakenly using it before the field results in the documentation going to the wrong field, see: https://videolan.videolan.me/dav1d/structDav1dMasteringDisplay.html
-
Martin Storsjö authored
When renumbering argument registers in 1648c232, this one register reference was missed. The missed register was meant to compare h with 2, but accidentally ended up comparing bitdepth_max to 2. In the case of 8 bpc, there's actually no bitdepth_max parameter, so it ended up comparing an uninitialized value.
-
- Nov 14, 2024
-
-
On really old libc versions, getauxval isn't available. Fall back on /proc/cpuinfo in those cases, just like we do on android too.
-
For individual tests in dav1d-test-data, the default timeout is 30 seconds (which is the Meson default if nothing is specified). Previously it ran with a multiplier of 4, resulting in a total timeout of 120 seconds. When running tests in QEMU, exceeding this 120 second timeout could happen occasionally. Raise the multiplier to 10, allowing each individual job to run for up to 5 minutes. This should hopefully reduce the amount of stray failures in the CI. For tests that already have a higher default timeout set, such as checkasm which has got a 180 second default timeout, this results in a much longer timeout period. However as long as we don't frequently see issues where these actually hang, it should be beneficial to just let them run to completion, rather than aborting early due to a tight timeout.
-
Martin Storsjö authored
Also fix one case where the 32 bit input parameter w (which was in x6, now in x4) was used without zero extension, by referencing to it as w4 instead.
-
- Nov 13, 2024
-
-
Martin Storsjö authored
This applies the same simplifications that were done for the C code and the x86 assembly in 4613d3a5, to the arm64 implementation. This gives a minor speedup of around a couple percent. Before: Cortex A53 A55 A72 A73 A76 Apple M3 sgr_3x3_8bpc_neon: 368583.2 363654.2 279958.1 272065.1 169353.3 354.6 sgr_5x5_8bpc_neon: 258570.7 255018.5 200410.6 199478.3 117968.3 260.9 sgr_mix_8bpc_neon: 603698.1 577383.3 482468.3 436540.4 256632.9 541.8 After: sgr_3x3_8bpc_neon: 367873.2 357884.1 275462.4 268363.9 165909.8 346.0 sgr_5x5_8bpc_neon: 254988.4 248184.2 190875.1 196939.1 120517.2 252.1 sgr_mix_8bpc_neon: 589204.7 563565.8 414025.6 427702.2 251651.2 533.4
-
Martin Storsjö authored
-
- Nov 10, 2024
-
-
Luca Barbato authored
-
Luca Barbato authored
-
Luca Barbato authored
They can be used across arches.
-
Luca Barbato authored
It makes the code tidier and the runtime is not slow.
-
Luca Barbato authored
blend_h_w2_8bpc_pwr9: 18.4 ( 1.20x) blend_h_w4_8bpc_pwr9: 27.2 ( 1.26x) blend_h_w8_8bpc_pwr9: 27.9 ( 2.22x) blend_h_w16_8bpc_pwr9: 35.1 ( 3.28x) blend_h_w32_8bpc_pwr9: 57.4 ( 3.88x) blend_h_w64_8bpc_pwr9: 97.9 ( 4.70x) blend_h_w128_8bpc_pwr9: 207.6 ( 5.18x)
-
Luca Barbato authored
blend_v_w2_8bpc_pwr9: 25.0 ( 1.12x) blend_v_w4_8bpc_pwr9: 79.3 ( 1.35x) blend_v_w8_8bpc_pwr9: 79.5 ( 2.43x) blend_v_w16_8bpc_pwr9: 108.0 ( 3.58x) blend_v_w32_8bpc_pwr9: 153.5 ( 4.69x)
-
Luca Barbato authored
blend_w4_8bpc_pwr9: 14.4 ( 1.90x) blend_w8_8bpc_pwr9: 19.9 ( 3.62x) blend_w16_8bpc_pwr9: 50.6 ( 5.17x) blend_w32_8bpc_pwr9: 125.8 ( 5.33x)
-
- Nov 05, 2024
-
-
-
Nathan E. Egge authored
Setting VL for this function only impacts the 16bpc performance and only on the SpacemiT K1 which has two vector units of length 128b each. Kendryte K230 Before After Delta blend_v_w2_8bpc_c: 220.0 ( 1.00x) 221.3 ( 1.00x) 0.59% blend_v_w2_8bpc_rvv: 145.7 ( 1.51x) 148.2 ( 1.49x) 1.72% blend_v_w4_8bpc_c: 942.1 ( 1.00x) 943.7 ( 1.00x) 0.17% blend_v_w4_8bpc_rvv: 240.4 ( 3.92x) 242.9 ( 3.89x) 1.04% blend_v_w8_8bpc_c: 1782.3 ( 1.00x) 1783.8 ( 1.00x) 0.08% blend_v_w8_8bpc_rvv: 252.6 ( 7.06x) 254.9 ( 7.00x) 0.91% blend_v_w16_8bpc_c: 3650.9 ( 1.00x) 3647.0 ( 1.00x) -0.11% blend_v_w16_8bpc_rvv: 495.5 ( 7.37x) 494.4 ( 7.38x) -0.22% blend_v_w32_8bpc_c: 7013.0 ( 1.00x) 7018.2 ( 1.00x) 0.07% blend_v_w32_8bpc_rvv: 807.9 ( 8.68x) 802.0 ( 8.75x) -0.73% blend_v_w2_16bpc_c: 226.1 ( 1.00x) 225.5 ( 1.00x) -0.27% blend_v_w2_16bpc_rvv: 148.6 ( 1.52x) 148.9 ( 1.51x) 0.20% blend_v_w4_16bpc_c: 1010.7 ( 1.00x) 1006.7 ( 1.00x) -0.40% blend_v_w4_16bpc_rvv: 306.7 ( 3.30x) 307.4 ( 3.27x) 0.23% blend_v_w8_16bpc_c: 1990.2 ( 1.00x) 1996.1 ( 1.00x) 0.30% blend_v_w8_16bpc_rvv: 519.5 ( 3.83x) 523.4 ( 3.81x) 0.75% blend_v_w16_16bpc_c: 3744.5 ( 1.00x) 3742.4 ( 1.00x) -0.06% blend_v_w16_16bpc_rvv: 899.6 ( 4.16x) 906.4 ( 4.13x) 0.76% blend_v_w32_16bpc_c: 7047.5 ( 1.00x) 7079.3 ( 1.00x) 0.45% blend_v_w32_16bpc_rvv: 1475.5 ( 4.78x) 1483.3 ( 4.77x) 0.53% SpacemiT K1 Before After Delta blend_v_w2_8bpc_c: 216.3 ( 1.00x) 214.4 ( 1.00x) -0.88% blend_v_w2_8bpc_rvv: 144.0 ( 1.50x) 143.6 ( 1.49x) -0.28% blend_v_w4_8bpc_c: 919.8 ( 1.00x) 918.1 ( 1.00x) -0.18% blend_v_w4_8bpc_rvv: 236.6 ( 3.89x) 236.4 ( 3.88x) -0.08% blend_v_w8_8bpc_c: 1739.3 ( 1.00x) 1736.8 ( 1.00x) -0.14% blend_v_w8_8bpc_rvv: 236.8 ( 7.34x) 236.3 ( 7.35x) -0.21% blend_v_w16_8bpc_c: 3374.7 ( 1.00x) 3374.9 ( 1.00x) 0.01% blend_v_w16_8bpc_rvv: 297.0 (11.36x) 296.8 (11.37x) -0.07% blend_v_w32_8bpc_c: 6647.5 ( 1.00x) 6645.5 ( 1.00x) -0.03% blend_v_w32_8bpc_rvv: 403.3 (16.48x) 402.4 (16.51x) -0.22% blend_v_w2_16bpc_c: 221.4 ( 1.00x) 220.1 ( 1.00x) -0.59% blend_v_w2_16bpc_rvv: 146.3 ( 1.51x) 147.3 ( 1.49x) 0.68% blend_v_w4_16bpc_c: 973.3 ( 1.00x) 972.7 ( 1.00x) -0.06% blend_v_w4_16bpc_rvv: 280.3 ( 3.47x) 282.1 ( 3.45x) 0.64% blend_v_w8_16bpc_c: 1814.8 ( 1.00x) 1816.2 ( 1.00x) 0.08% blend_v_w8_16bpc_rvv: 376.6 ( 4.82x) 376.9 ( 4.82x) 0.08% blend_v_w16_16bpc_c: 3485.5 ( 1.00x) 3485.5 ( 1.00x) 0.00% blend_v_w16_16bpc_rvv: 531.1 ( 6.56x) 525.6 ( 6.63x) -1.04% blend_v_w32_16bpc_c: 6788.3 ( 1.00x) 6778.8 ( 1.00x) -0.14% blend_v_w32_16bpc_rvv: 904.5 ( 7.51x) 854.6 ( 7.93x) -5.52%
-
- Nov 04, 2024
-
-
Nathan E. Egge authored
Kendryte K230 Before After Delta blend_v_w2_16bpc_c: 225.8 ( 1.00x) 225.7 ( 1.00x) -0.04% blend_v_w2_16bpc_rvv: 194.7 ( 1.16x) 148.6 ( 1.52x) -23.68% blend_v_w4_16bpc_c: 1011.3 ( 1.00x) 1005.8 ( 1.00x) -0.54% blend_v_w4_16bpc_rvv: 387.2 ( 2.61x) 305.4 ( 3.29x) -21.13% blend_v_w8_16bpc_c: 1878.5 ( 1.00x) 1872.7 ( 1.00x) -0.31% blend_v_w8_16bpc_rvv: 475.3 ( 3.95x) 435.6 ( 4.30x) -8.35% blend_v_w16_16bpc_c: 3601.9 ( 1.00x) 3601.6 ( 1.00x) -0.01% blend_v_w16_16bpc_rvv: 891.2 ( 4.04x) 892.7 ( 4.03x) 0.17% blend_v_w32_16bpc_c: 7043.7 ( 1.00x) 7058.8 ( 1.00x) 0.21% blend_v_w32_16bpc_rvv: 1384.5 ( 5.09x) 1478.0 ( 4.78x) 6.75% SpacemiT K1 Before After Delta blend_v_w2_16bpc_c: 222.6 ( 1.00x) 220.5 ( 1.00x) -0.94% blend_v_w2_16bpc_rvv: 195.7 ( 1.14x) 146.6 ( 1.50x) -25.09% blend_v_w4_16bpc_c: 972.3 ( 1.00x) 972.0 ( 1.00x) -0.03% blend_v_w4_16bpc_rvv: 349.1 ( 2.79x) 281.9 ( 3.45x) -19.25% blend_v_w8_16bpc_c: 1812.1 ( 1.00x) 1813.0 ( 1.00x) 0.05% blend_v_w8_16bpc_rvv: 481.5 ( 3.76x) 376.0 ( 4.82x) -21.91% blend_v_w16_16bpc_c: 3488.4 ( 1.00x) 3484.6 ( 1.00x) -0.11% blend_v_w16_16bpc_rvv: 608.7 ( 5.73x) 523.4 ( 6.66x) -14.01% blend_v_w32_16bpc_c: 6795.3 ( 1.00x) 6792.4 ( 1.00x) -0.04% blend_v_w32_16bpc_rvv: 934.8 ( 7.27x) 907.3 ( 7.49x) -2.94%
-
Nathan E. Egge authored
Kendryte K230 Before After Delta blend_v_w2_16bpc_c: 226.0 ( 1.00x) 226.1 ( 1.00x) 0.04% blend_v_w2_16bpc_rvv: 194.0 ( 1.16x) 193.9 ( 1.17x) -0.05% blend_v_w4_16bpc_c: 1011.8 ( 1.00x) 1009.4 ( 1.00x) -0.24% blend_v_w4_16bpc_rvv: 392.7 ( 2.58x) 390.8 ( 2.58x) -0.48% blend_v_w8_16bpc_c: 1987.9 ( 1.00x) 1988.0 ( 1.00x) 0.01% blend_v_w8_16bpc_rvv: 561.5 ( 3.54x) 560.2 ( 3.55x) -0.23% blend_v_w16_16bpc_c: 3738.1 ( 1.00x) 3739.1 ( 1.00x) 0.03% blend_v_w16_16bpc_rvv: 934.1 ( 4.00x) 932.2 ( 4.01x) -0.20% blend_v_w32_16bpc_c: 7031.0 ( 1.00x) 7030.1 ( 1.00x) -0.01% blend_v_w32_16bpc_rvv: 1403.3 ( 5.01x) 1395.8 ( 5.04x) -0.53% SpacemiT K1 Before After Delta blend_v_w2_16bpc_c: 221.0 ( 1.00x) 221.2 ( 1.00x) 0.09% blend_v_w2_16bpc_rvv: 195.2 ( 1.13x) 196.0 ( 1.13x) 0.41% blend_v_w4_16bpc_c: 969.8 ( 1.00x) 971.9 ( 1.00x) 0.22% blend_v_w4_16bpc_rvv: 348.8 ( 2.78x) 349.1 ( 2.78x) 0.09% blend_v_w8_16bpc_c: 1812.6 ( 1.00x) 1814.9 ( 1.00x) 0.13% blend_v_w8_16bpc_rvv: 486.1 ( 3.73x) 484.3 ( 3.75x) -0.37% blend_v_w16_16bpc_c: 3483.0 ( 1.00x) 3485.1 ( 1.00x) 0.06% blend_v_w16_16bpc_rvv: 608.7 ( 5.72x) 607.4 ( 5.74x) -0.21% blend_v_w32_16bpc_c: 6791.8 ( 1.00x) 6794.2 ( 1.00x) 0.04% blend_v_w32_16bpc_rvv: 940.6 ( 7.22x) 942.1 ( 7.21x) 0.16%
-
Nathan E. Egge authored
SpacemiT K1 Before After Delta blend_v_w2_16bpc_c: 221.5 ( 1.00x) 220.3 ( 1.00x) -0.54% blend_v_w2_16bpc_rvv: 193.5 ( 1.14x) 194.3 ( 1.13x) 0.41% blend_v_w4_16bpc_c: 968.8 ( 1.00x) 967.2 ( 1.00x) -0.17% blend_v_w4_16bpc_rvv: 442.2 ( 2.19x) 347.4 ( 2.78x) -21.44% blend_v_w8_16bpc_c: 1809.4 ( 1.00x) 1811.2 ( 1.00x) 0.10% blend_v_w8_16bpc_rvv: 557.4 ( 3.25x) 483.2 ( 3.75x) -13.31% blend_v_w16_16bpc_c: 3481.4 ( 1.00x) 3473.4 ( 1.00x) -0.23% blend_v_w16_16bpc_rvv: 844.3 ( 4.12x) 603.1 ( 5.76x) -28.57% blend_v_w32_16bpc_c: 6783.1 ( 1.00x) 6749.8 ( 1.00x) -0.49% blend_v_w32_16bpc_rvv: 1406.1 ( 4.82x) 919.4 ( 7.34x) -34.61%
-
Nathan E. Egge authored
Kendryte K230 blend_v_w2_16bpc_c: 226.5 ( 1.00x) blend_v_w2_16bpc_rvv: 192.2 ( 1.18x) blend_v_w4_16bpc_c: 1010.3 ( 1.00x) blend_v_w4_16bpc_rvv: 390.5 ( 2.59x) blend_v_w8_16bpc_c: 1994.2 ( 1.00x) blend_v_w8_16bpc_rvv: 561.7 ( 3.55x) blend_v_w16_16bpc_c: 3737.9 ( 1.00x) blend_v_w16_16bpc_rvv: 928.0 ( 4.03x) blend_v_w32_16bpc_c: 7064.7 ( 1.00x) blend_v_w32_16bpc_rvv: 1428.9 ( 4.94x) SpacemiT K1 blend_v_w2_16bpc_c: 220.8 ( 1.00x) blend_v_w2_16bpc_rvv: 193.5 ( 1.14x) blend_v_w4_16bpc_c: 967.3 ( 1.00x) blend_v_w4_16bpc_rvv: 439.5 ( 2.20x) blend_v_w8_16bpc_c: 1810.2 ( 1.00x) blend_v_w8_16bpc_rvv: 555.3 ( 3.26x) blend_v_w16_16bpc_c: 3476.4 ( 1.00x) blend_v_w16_16bpc_rvv: 830.9 ( 4.18x) blend_v_w32_16bpc_c: 6772.9 ( 1.00x) blend_v_w32_16bpc_rvv: 1356.3 ( 4.99x)
-
- Oct 31, 2024
-
-
Nathan E. Egge authored
Kendryte K230 Before After Delta blend_w4_16bpc_c: 210.0 ( 1.00x) 208.9 ( 1.00x) -0.52% blend_w4_16bpc_rvv: 88.5 ( 2.37x) 66.2 ( 3.15x) -25.20% blend_w8_16bpc_c: 614.1 ( 1.00x) 613.5 ( 1.00x) -0.10% blend_w8_16bpc_rvv: 143.1 ( 4.29x) 126.9 ( 4.83x) -11.32% blend_w16_16bpc_c: 2371.2 ( 1.00x) 2371.3 ( 1.00x) 0.00% blend_w16_16bpc_rvv: 461.1 ( 5.14x) 413.2 ( 5.74x) -10.39% blend_w32_16bpc_c: 5998.4 ( 1.00x) 5998.4 ( 1.00x) 0.00% blend_w32_16bpc_rvv: 978.4 ( 6.13x) 1013.1 ( 5.92x) 3.55% SpacemiT K1 Before After Delta blend_w4_16bpc_c: 205.8 ( 1.00x) 205.9 ( 1.00x) 0.05% blend_w4_16bpc_rvv: 80.9 ( 2.54x) 64.9 ( 3.17x) -19.78% blend_w8_16bpc_c: 599.9 ( 1.00x) 599.9 ( 1.00x) 0.00% blend_w8_16bpc_rvv: 134.4 ( 4.46x) 101.9 ( 5.89x) -24.18% blend_w16_16bpc_c: 2316.5 ( 1.00x) 2316.5 ( 1.00x) 0.00% blend_w16_16bpc_rvv: 302.0 ( 7.67x) 262.8 ( 8.81x) -12.98% blend_w32_16bpc_c: 5861.9 ( 1.00x) 5861.4 ( 1.00x) -0.01% blend_w32_16bpc_rvv: 589.6 ( 9.94x) 602.2 ( 9.73x) 2.14%
-
Nathan E. Egge authored
Kendryte K230 Before After Delta blend_w4_16bpc_c: 208.8 ( 1.00x) 209.9 ( 1.00x) 0.53% blend_w4_16bpc_rvv: 85.9 ( 2.43x) 88.6 ( 2.37x) 3.14% blend_w8_16bpc_c: 613.2 ( 1.00x) 614.3 ( 1.00x) 0.18% blend_w8_16bpc_rvv: 145.4 ( 4.22x) 143.1 ( 4.29x) -1.58% blend_w16_16bpc_c: 2371.9 ( 1.00x) 2373.6 ( 1.00x) 0.07% blend_w16_16bpc_rvv: 464.0 ( 5.11x) 461.2 ( 5.15x) -0.60% blend_w32_16bpc_c: 6005.6 ( 1.00x) 6007.7 ( 1.00x) 0.03% blend_w32_16bpc_rvv: 981.6 ( 6.12x) 979.4 ( 6.13x) -0.22% SpacemiT K1 Before After Delta blend_w4_16bpc_c: 206.4 ( 1.00x) 205.7 ( 1.00x) -0.34% blend_w4_16bpc_rvv: 79.5 ( 2.60x) 81.0 ( 2.54x) 1.89% blend_w8_16bpc_c: 600.7 ( 1.00x) 599.7 ( 1.00x) -0.17% blend_w8_16bpc_rvv: 133.3 ( 4.51x) 134.1 ( 4.47x) 0.60% blend_w16_16bpc_c: 2315.9 ( 1.00x) 2315.2 ( 1.00x) -0.03% blend_w16_16bpc_rvv: 305.2 ( 7.59x) 300.7 ( 7.70x) -1.47% blend_w32_16bpc_c: 5861.1 ( 1.00x) 5860.2 ( 1.00x) -0.02% blend_w32_16bpc_rvv: 592.5 ( 9.89x) 589.5 ( 9.94x) -0.51%
-
Nathan E. Egge authored
SpacemiT K1 Before After Delta blend_w4_16bpc_c: 206.8 ( 1.00x) 206.0 ( 1.00x) -0.39% blend_w4_16bpc_rvv: 95.8 ( 2.16x) 77.8 ( 2.65x) -18.79% blend_w8_16bpc_c: 600.4 ( 1.00x) 600.1 ( 1.00x) -0.05% blend_w8_16bpc_rvv: 161.7 ( 3.71x) 131.3 ( 4.57x) -18.80% blend_w16_16bpc_c: 2317.6 ( 1.00x) 2316.5 ( 1.00x) -0.05% blend_w16_16bpc_rvv: 459.6 ( 5.04x) 302.9 ( 7.65x) -34.09% blend_w32_16bpc_c: 5863.0 ( 1.00x) 5863.3 ( 1.00x) 0.01% blend_w32_16bpc_rvv: 992.7 ( 5.91x) 578.1 (10.14x) -41.76%
-
- Oct 29, 2024
-
-
Nathan E. Egge authored
The cdef.S, itx.S and mc.S files contain only 8bpc implementations and should be compiled only when building with -Dbitdepths=8 configuration.
-
Kendryte K230 blend_w4_16bpc_c: 214.4 ( 1.00x) blend_w4_16bpc_rvv: 90.2 ( 2.38x) blend_w8_16bpc_c: 618.9 ( 1.00x) blend_w8_16bpc_rvv: 147.4 ( 4.20x) blend_w16_16bpc_c: 2376.5 ( 1.00x) blend_w16_16bpc_rvv: 466.0 ( 5.10x) blend_w32_16bpc_c: 6008.6 ( 1.00x) blend_w32_16bpc_rvv: 985.0 ( 6.10x) SpacemiT K1 blend_w4_16bpc_c: 204.9 ( 1.00x) blend_w4_16bpc_rvv: 88.3 ( 2.32x) blend_w8_16bpc_c: 598.5 ( 1.00x) blend_w8_16bpc_rvv: 155.3 ( 3.85x) blend_w16_16bpc_c: 2315.4 ( 1.00x) blend_w16_16bpc_rvv: 444.4 ( 5.21x) blend_w32_16bpc_c: 5860.1 ( 1.00x) blend_w32_16bpc_rvv: 993.0 ( 5.90x)
-