Commits (92)
-
Michail Alvanos authored857232e4
-
James Almer authored
Based on a patch by Renato Cassaca.
9a9c0c7e -
Martin Storsjö authored2ef970a8
-
B Krishnan Iyer authored
blend/blend_h/blend_v: Before: Cortex A7 A8 A9 A53 A72 A73 blend_h_w2_8bpc_neon: 169.5 194.2 153.1 134.0 63.0 72.6 blend_h_w4_8bpc_neon: 164.4 171.8 142.2 137.8 60.5 60.2 blend_h_w8_8bpc_neon: 184.8 121.0 146.5 123.4 55.9 63.1 blend_h_w16_8bpc_neon: 291.0 178.6 237.3 181.0 88.6 83.9 blend_h_w32_8bpc_neon: 531.9 321.5 432.2 358.3 155.6 156.2 blend_h_w64_8bpc_neon: 957.6 600.3 827.4 631.2 279.7 268.4 blend_h_w128_8bpc_neon: 2161.5 1398.4 1931.8 1403.4 607.0 597.9 blend_v_w2_8bpc_neon: 249.3 373.4 269.2 195.6 107.9 117.6 blend_v_w4_8bpc_neon: 451.7 676.1 555.3 376.1 198.6 266.9 blend_v_w8_8bpc_neon: 561.0 475.2 607.6 357.0 213.9 204.1 blend_v_w16_8bpc_neon: 928.4 626.8 823.8 592.3 269.9 245.3 blend_v_w32_8bpc_neon: 1477.6 1024.8 1186.6 994.5 346.6 370.0 blend_w4_8bpc_neon: 103.3 113.0 86.2 91.5 38.6 35.2 blend_w8_8bpc_neon: 174.9 116.6 137.1 123.1 50.8 55.0 blend_w16_8bpc_neon: 533.0 334.3 446.6 348.6 150.7 155.4 blend_w32_8bpc_neon: 1299.2 836.8 1170.7 909.9 370.5 386.3 After: blend_h_w2_8bpc_neon: 169.6 169.8 140.9 134.0 62.3 72.5 blend_h_w4_8bpc_neon: 164.5 149.1 127.6 137.7 59.1 60.1 blend_h_w8_8bpc_neon: 184.9 102.7 126.3 123.4 54.9 63.2 blend_h_w16_8bpc_neon: 291.0 163.8 232.1 180.9 88.4 83.9 blend_h_w32_8bpc_neon: 531.2 285.6 422.6 358.4 155.5 155.9 blend_h_w64_8bpc_neon: 956.0 541.9 809.9 631.6 280.0 270.6 blend_h_w128_8bpc_neon: 2159.0 1253.6 1889.0 1404.8 606.2 600.5 blend_v_w2_8bpc_neon: 249.9 362.0 269.4 195.6 107.8 117.6 blend_v_w4_8bpc_neon: 452.6 541.6 538.2 376.1 199.5 266.9 blend_v_w8_8bpc_neon: 561.0 348.9 551.3 357.7 214.3 204.4 blend_v_w16_8bpc_neon: 926.8 510.9 785.0 592.1 270.7 245.8 blend_v_w32_8bpc_neon: 1474.4 913.3 1151.4 995.7 347.5 371.2 blend_w4_8bpc_neon: 103.3 96.6 76.9 91.5 33.7 35.3 blend_w8_8bpc_neon: 174.9 88.2 114.8 123.1 51.5 55.0 blend_w16_8bpc_neon: 532.8 282.2 445.3 348.5 149.8 155.7 blend_w32_8bpc_neon: 1295.1 735.2 1122.8 908.4 372.0 386.5 w_mask_444/422/420: Before: Cortex A7 A8 A9 A53 A72 A73 w_mask_420_w4_8bpc_neon: 218.1 144.4 187.3 152.7 86.9 89.0 w_mask_420_w8_8bpc_neon: 544.0 393.7 437.0 372.5 211.1 230.9 w_mask_420_w16_8bpc_neon: 1537.2 1063.5 1182.3 1024.3 566.4 667.7 w_mask_420_w32_8bpc_neon: 5734.7 4207.2 4716.8 3822.8 2340.5 2521.3 w_mask_420_w64_8bpc_neon: 14317.6 10165.0 13220.2 9578.5 5578.9 5989.9 w_mask_420_w128_8bpc_neon: 37932.8 25299.1 39562.9 25203.8 14916.4 15465.1 w_mask_422_w4_8bpc_neon: 206.8 141.4 177.9 143.4 82.1 84.8 w_mask_422_w8_8bpc_neon: 511.8 380.8 416.7 342.5 198.5 221.7 w_mask_422_w16_8bpc_neon: 1632.8 1154.4 1282.9 1061.2 595.3 684.9 w_mask_422_w32_8bpc_neon: 6087.8 4560.3 5173.3 3945.8 2319.1 2608.7 w_mask_422_w64_8bpc_neon: 15183.7 11013.9 14435.6 9904.6 5449.9 6100.9 w_mask_422_w128_8bpc_neon: 39951.2 27441.0 42398.2 25995.1 14624.9 15529.2 w_mask_444_w4_8bpc_neon: 193.4 127.0 170.0 135.4 76.8 81.4 w_mask_444_w8_8bpc_neon: 477.8 340.0 427.9 319.3 187.2 214.7 w_mask_444_w16_8bpc_neon: 1529.0 1058.8 1209.4 987.0 571.7 677.3 w_mask_444_w32_8bpc_neon: 5687.9 4166.9 4882.4 3667.0 2286.8 2518.7 w_mask_444_w64_8bpc_neon: 14394.7 10055.1 14057.9 9372.0 5369.3 5898.7 w_mask_444_w128_8bpc_neon: 37952.0 25008.8 42169.9 24988.8 22973.7 15241.1 After: w_mask_420_w4_8bpc_neon: 219.7 120.7 178.0 152.7 87.2 89.0 w_mask_420_w8_8bpc_neon: 547.5 355.2 404.4 372.4 211.4 231.0 w_mask_420_w16_8bpc_neon: 1540.9 987.1 1113.0 1024.9 567.4 669.5 w_mask_420_w32_8bpc_neon: 5915.4 3905.8 4516.8 3929.3 2363.7 2523.6 w_mask_420_w64_8bpc_neon: 14860.9 9437.1 12609.7 9586.4 5627.3 6005.8 w_mask_420_w128_8bpc_neon: 38799.1 23536.1 38598.3 24787.7 14595.7 15474.9 w_mask_422_w4_8bpc_neon: 208.3 115.4 168.6 143.4 82.4 84.8 w_mask_422_w8_8bpc_neon: 515.2 335.7 383.2 342.5 198.9 221.8 w_mask_422_w16_8bpc_neon: 1643.2 1053.6 1199.3 1062.2 595.6 685.7 w_mask_422_w32_8bpc_neon: 6335.1 4161.0 4959.3 4088.5 2353.0 2606.4 w_mask_422_w64_8bpc_neon: 15689.4 10039.8 13806.1 9937.7 5535.3 6099.8 w_mask_422_w128_8bpc_neon: 40754.4 25033.3 41390.5 25683.7 14668.8 15537.1 w_mask_444_w4_8bpc_neon: 194.9 107.4 162.0 135.4 77.1 81.4 w_mask_444_w8_8bpc_neon: 481.1 300.2 422.0 319.1 187.6 214.6 w_mask_444_w16_8bpc_neon: 1542.6 956.1 1137.7 988.4 572.4 677.5 w_mask_444_w32_8bpc_neon: 5896.1 3766.1 4731.9 3801.2 2322.9 2521.8 w_mask_444_w64_8bpc_neon: 14814.0 9084.7 13515.4 9311.0 5497.3 5896.3 w_mask_444_w128_8bpc_neon: 38587.7 22615.2 41389.9 24639.4 17705.8 15244.3
b0d00020 -
Henrik Gramner authored6c3e85de
-
Henrik Gramner authoredfa32f2de
-
Henrik Gramner authored
Explicitly take advantage of the fact that certain probabilities are zero instead of loading zeros from the CDF padding. The current code works just fine, but only because those values happen to be zero due to what is essentially an implementation detail.
d8799d94 -
Luc Trudeau authored5a4ae342
-
Luc Trudeau authoredad0c0412
-
Luc Trudeau authored42ea146f
-
James Almer authored
dav1dplay shouldn't be built by default. And it's an example more than a tool.
3a77c57b -
James Almer authoreddff0a08c
-
Henrik Gramner authoreda819653e
-
Henrik Gramner authored
* Eliminate the trailing zero after the CDF probabilities. We can reuse the count value as a terminator instead. This reduces the size of the CDF context by around 8%. * Align the CDF arrays. * Various other minor optimizations.
e29fd5c0 -
Henrik Gramner authored
This particular sequence is executed often enough to justify having a separate slightly more optimized code path instead of just chaining multiple generic symbol decoding function calls together.
61dcd11b -
Henrik Gramner authored0f4edbff
-
Michael Bradshaw authoredd20d70e8
-
B Krishnan Iyer authored
A73 A53 blend_h_w2_8bpc_c: 184.7 301.5 blend_h_w2_8bpc_neon: 58.8 104.1 blend_h_w4_8bpc_c: 291.4 507.3 blend_h_w4_8bpc_neon: 48.7 108.9 blend_h_w8_8bpc_c: 510.1 992.7 blend_h_w8_8bpc_neon: 66.5 99.3 blend_h_w16_8bpc_c: 972 1835.3 blend_h_w16_8bpc_neon: 82.7 145.2 blend_h_w32_8bpc_c: 776.7 912.9 blend_h_w32_8bpc_neon: 155.1 266.9 blend_h_w64_8bpc_c: 1424.3 1635.4 blend_h_w64_8bpc_neon: 273.4 480.9 blend_h_w128_8bpc_c: 3318.1 3774 blend_h_w128_8bpc_neon: 614.1 1097.9 blend_v_w2_8bpc_c: 278.8 427.5 blend_v_w2_8bpc_neon: 113.7 170.4 blend_v_w4_8bpc_c: 960.2 1597.7 blend_v_w4_8bpc_neon: 222.9 351.4 blend_v_w8_8bpc_c: 1694.2 3333.5 blend_v_w8_8bpc_neon: 200.9 333.6 blend_v_w16_8bpc_c: 3115.2 5971.6 blend_v_w16_8bpc_neon: 233.2 494.8 blend_v_w32_8bpc_c: 3949.7 6070.6 blend_v_w32_8bpc_neon: 460.4 841.6 blend_w4_8bpc_c: 244.2 388.3 blend_w4_8bpc_neon: 25.5 66.7 blend_w8_8bpc_c: 616.3 1120.8 blend_w8_8bpc_neon: 46 110.7 blend_w16_8bpc_c: 2193.1 4056.4 blend_w16_8bpc_neon: 140.7 299.3 blend_w32_8bpc_c: 2502.8 2998.5 blend_w32_8bpc_neon: 381.4 725.3
1dc2dc7d -
B Krishnan Iyer authored
A73 A53 w_mask_420_w4_8bpc_c: 818 1082.9 w_mask_420_w4_8bpc_neon: 79 126.6 w_mask_420_w8_8bpc_c: 2486 3399.8 w_mask_420_w8_8bpc_neon: 200.2 343.7 w_mask_420_w16_8bpc_c: 8022.3 10989.6 w_mask_420_w16_8bpc_neon: 528.1 889 w_mask_420_w32_8bpc_c: 31851.8 42808.6 w_mask_420_w32_8bpc_neon: 2062.5 3380.8 w_mask_420_w64_8bpc_c: 79268.5 102683.9 w_mask_420_w64_8bpc_neon: 5252.9 8575.4 w_mask_420_w128_8bpc_c: 193704.1 255586.5 w_mask_420_w128_8bpc_neon: 14602.3 22167.7 w_mask_422_w4_8bpc_c: 777.3 1038.5 w_mask_422_w4_8bpc_neon: 72.1 112.9 w_mask_422_w8_8bpc_c: 2405.7 3168 w_mask_422_w8_8bpc_neon: 191.9 314.1 w_mask_422_w16_8bpc_c: 7783.7 10543.9 w_mask_422_w16_8bpc_neon: 559.8 835.5 w_mask_422_w32_8bpc_c: 30895.7 41141.2 w_mask_422_w32_8bpc_neon: 2089.7 3187.2 w_mask_422_w64_8bpc_c: 75500.2 98766.3 w_mask_422_w64_8bpc_neon: 5379 8208.2 w_mask_422_w128_8bpc_c: 186967.1 245809.1 w_mask_422_w128_8bpc_neon: 15159.9 21474.5 w_mask_444_w4_8bpc_c: 850.1 1136.6 w_mask_444_w4_8bpc_neon: 66.5 104.7 w_mask_444_w8_8bpc_c: 2373.5 3262.9 w_mask_444_w8_8bpc_neon: 180.5 290.2 w_mask_444_w16_8bpc_c: 7291.6 10590.7 w_mask_444_w16_8bpc_neon: 550.9 809.7 w_mask_444_w32_8bpc_c: 8048.3 10140.8 w_mask_444_w32_8bpc_neon: 2136.2 3095 w_mask_444_w64_8bpc_c: 18055.3 23060 w_mask_444_w64_8bpc_neon: 5522.5 8124.8 w_mask_444_w128_8bpc_c: 42754.3 56072 w_mask_444_w128_8bpc_neon: 15569.5 21531.5
3d94fb9a -
Henrik Gramner authored
When compiling in release mode, instead of just deleting assertions, use them to give hints to the compiler. This allows for slightly better code generation in some cases.
6751c980 -
Henrik Gramner authored
Eliminates some sign extensions.
6757cab9 -
Henrik Gramner authoreda62c445d
-
Henrik Gramner authored70b66ff1
-
Henrik Gramner authoredeeca6f25
-
Henrik Gramner authored
Fixes integer overflows with very large frame sizes. Credit to OSS-Fuzz.
2c1467b4 -
Ronald S. Bultje authored
Otherwise the table can get out of sync when the frame size and tile count stays the same, but the tile coordinates change. Fixes #266.
37a03fc7 -
Martin Storsjö authoredc3e5ad04
-
Martin Storsjö authored
Use the so far unused lr register instead of r10.
f01bbbdd -
B Krishnan Iyer authoredcfd6fe6d
-
Ronald S. Bultje authored
- calculate chroma grain based on src (not dst) luma pixels; - division should precede multiplication in delta calculation. Together, these fix differences in film grain reconstruction between libaom and dav1d for various generated samples.
c09f1072 -
Ronald S. Bultje authored1ffbeda0
-
Ronald S. Bultje authored
Fixes libaom/dav1d mismatch in av1-1-b10-23-film_grain-50.ivf.
91b0af2f -
Janne Grunau authoredbfc9f72a
-
Janne Grunau authored
The chroma part of pal_idx potentially conflicts during intra reconstruction with edge_{8,16}bpc. Fixes out of range pixel values caused by invalid palette indices in clusterfuzz-testcase-minimized-dav1d_fuzzer_mt-5076736684851200. Fixes #294. Reported as integer overflows in boxsum5sqr with undefined behavior sanitizer. Credits to oss-fuzz.
863c3731 -
Janne Grunau authored
This large constant needs a movw instruction, which newer binutils can figure out, but older versions need stated explicitly. This fixes #296.
e65abadf -
Henrik Gramner authored
clang-cl doesn't like function calls in __assume statements, even trivial inline ones.
666c71a0 -
Henrik Gramner authored
__assume() doesn't work correctly in clang-cl versions prior to 7.0.0 which causes bogus warnings regarding use of uninitialized variables to be printed. Avoid that by using __builtin_unreachable() instead.
c0e1988b -
Martin Storsjö authored
See issue #295, this fixes it for arm64. Before: Cortex A53 A72 A73 inv_txfm_add_4x4_adst_adst_1_8bpc_neon: 103.0 63.2 65.2 inv_txfm_add_4x8_adst_adst_1_8bpc_neon: 197.0 145.0 134.2 inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 332.0 248.0 247.1 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1676.8 1197.0 1186.8 After: inv_txfm_add_4x4_adst_adst_1_8bpc_neon: 103.0 76.4 67.0 inv_txfm_add_4x8_adst_adst_1_8bpc_neon: 205.0 155.0 143.8 inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 358.0 269.0 276.2 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1785.2 1347.8 1312.1 This would probably only be needed for adst in the first pass, but the additional code complexity from splitting the implementations (as we currently don't have transforms differentiated between first and second pass) isn't necessarily worth it (the speedup over C code is still 8-10x).
e2702eaf -
Henrik Gramner authored
16-bit precision is sufficient for the second pass, but the first pass requires 32-bit precision to correctly handle some esoteric edge cases.
a9315f5f -
Henrik Gramner authored
For w <= 32 we can't process more than two rows per loop iteration. Credit to OSS-Fuzz.
69dae683 -
Henrik Gramner authored
For some reason the MSVC CRT _wassert() function is not flagged as __declspec(noreturn), so when using those headers the compiler will expect execution to continue after an assertion has been triggered and will therefore complain about the use of uninitialized variables when compiled in debug mode in certain code paths. Reorder some case statements as a workaround.
acad1a99 -
James Almer authored
Both values can be independently coded in the bitstream, and are not always equal to frame_width and frame_height.
79c4aa95 -
Ronald S. Bultje authoredb9d4630c
-
Ronald S. Bultje authored04ca7112
-
Ronald S. Bultje authored
fgy_32x32xn_8bpc_c: 16181.8 fgy_32x32xn_8bpc_avx2: 3231.4 gen_grain_y_ar0_8bpc_c: 108857.6 gen_grain_y_ar0_8bpc_avx2: 22826.7 gen_grain_y_ar1_8bpc_c: 168239.8 gen_grain_y_ar1_8bpc_avx2: 72117.2 gen_grain_y_ar2_8bpc_c: 266165.9 gen_grain_y_ar2_8bpc_avx2: 126281.8 gen_grain_y_ar3_8bpc_c: 448139.4 gen_grain_y_ar3_8bpc_avx2: 137047.1
99307bf3 -
Ronald S. Bultje authored
This would affect the output in samples with an odd width and horizontal chroma subsampling. The check does not exist in libaom, and might cause mismatches. This causes issues in the sample from #210, which uses super-resolution and has odd width. To work around this, make super-resolution's resize() always write an even number of pixels. This should not interfere with SIMD in the future.
6d363223 -
Ronald S. Bultje authored
fguv_32x32xn_8bpc_420_csfl0_c: 8945.4 fguv_32x32xn_8bpc_420_csfl0_avx2: 1001.6 fguv_32x32xn_8bpc_420_csfl1_c: 6363.4 fguv_32x32xn_8bpc_420_csfl1_avx2: 1299.5
556890be -
Ronald S. Bultje authored
```------------------ x86_64: ``` --------------------------------------- lpf_h_sb_uv_w4_8bpc_c: 430.6 lpf_h_sb_uv_w4_8bpc_ssse3: 322.0 lpf_h_sb_uv_w4_8bpc_avx2: 200.4 --------------------- lpf_h_sb_uv_w6_8bpc_c: 981.9 lpf_h_sb_uv_w6_8bpc_ssse3: 421.5 lpf_h_sb_uv_w6_8bpc_avx2: 270.0 --------------------- lpf_h_sb_y_w4_8bpc_c: 3001.7 lpf_h_sb_y_w4_8bpc_ssse3: 466.3 lpf_h_sb_y_w4_8bpc_avx2: 383.1 --------------------- lpf_h_sb_y_w8_8bpc_c: 4457.7 lpf_h_sb_y_w8_8bpc_ssse3: 818.9 lpf_h_sb_y_w8_8bpc_avx2: 537.0 --------------------- lpf_h_sb_y_w16_8bpc_c: 1967.9 lpf_h_sb_y_w16_8bpc_ssse3: 1836.7 lpf_h_sb_y_w16_8bpc_avx2: 1078.2 --------------------- lpf_v_sb_uv_w4_8bpc_c: 369.4 lpf_v_sb_uv_w4_8bpc_ssse3: 110.9 lpf_v_sb_uv_w4_8bpc_avx2: 58.1 --------------------- lpf_v_sb_uv_w6_8bpc_c: 769.6 lpf_v_sb_uv_w6_8bpc_ssse3: 222.2 lpf_v_sb_uv_w6_8bpc_avx2: 117.8 --------------------- lpf_v_sb_y_w4_8bpc_c: 772.4 lpf_v_sb_y_w4_8bpc_ssse3: 179.8 lpf_v_sb_y_w4_8bpc_avx2: 173.6 --------------------- lpf_v_sb_y_w8_8bpc_c: 1660.2 lpf_v_sb_y_w8_8bpc_ssse3: 468.3 lpf_v_sb_y_w8_8bpc_avx2: 345.8 --------------------- lpf_v_sb_y_w16_8bpc_c: 1889.6 lpf_v_sb_y_w16_8bpc_ssse3: 1142.0 lpf_v_sb_y_w16_8bpc_avx2: 568.1 ------------------------------------------
1e4e6c7a -
Victorien Le Couviour--Tuffet authored
```--------------------------------------- x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6 x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6 x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0 x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4 ``` ------------------ x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9 x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6 x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5 x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6 --------------------- x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7 x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3 x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3 x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7 --------------------- x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7 x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8 x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9 x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9 --------------------- x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9 x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5 x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7 x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0 --------------------- x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4 x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6 x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9 x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0 --------------------- x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6 x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7 x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2 x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2 --------------------- x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4 x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5 x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8 x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7 --------------------- x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2 x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9 x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3 x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9 --------------------- x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6 x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7 x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0 x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8 ------------------------------------------
c0865f35 -
Niklas Haas authored
Only meaningful with libplacebo. The defaults are higher quality than SDL so it's an unfair comparison and definitely too much for slow iGPUs at 4K res. Make the defaults fast/dumb processing only, and guard the debanding/dithering/upscaling/etc. behind a new --highquality flag.
f6ae8c9c -
Niklas Haas authored
Useful to test the effects of performance changes to the decoding/rendering loop as a whole.
3f35ef1f -
Niklas Haas authored
Right now this just allocates a new buffer for every frame, uses it, then discards it immediately. This is not optimal, either dav1d should start reusing buffers internally or we need to pool them in dav1dplay. As it stands, this is not really a performance gain. I'll have to investigate why, but my suspicion is that seeing any gains might require reusing buffers somewhere. Note: Thrashing buffers is not as bad as it seems, initially. Not only does libplacebo pool and reuse GPU memory and buffer state objects internally, but this also absolves us from having to do any manual polling to figure out when the buffer is reusable again. Creating, using and immediately destroying buffers actually isn't as bad an approach as it might otherwise seem. It's entirely possible that this is only bad because of lock contention. As said, I'll have to investigate further...
490a1420 -
Martin Storsjö authored
Even though smull+smlal does two multiplications instead of one, the combination seems to be better handled by actual cores. Before: Cortex A53 A72 A73 inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 356.0 279.2 278.0 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1785.0 1329.5 1308.8 After: inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 360.0 253.2 269.3 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1793.1 1300.9 1254.0 (In this particular cases, it seems like it is a minor regression on A53, which is probably more due to having to change the ordering of some instructions, due to how smull+smlal+smull2+smlal2 overwrites the second output register sooner than an addl+addl2 would have, but in general, smull+smlal seems to be equally good or better than addl+mul on A53 as well.)
a4950bce -
Martin Storsjö authored
The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
0ed3ad19 -
Martin Storsjö authored
Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed to be clipped. This fixes mismatches for some samples, see issue #299. Before: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 93.0 52.8 49.5 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 260.0 186.0 196.4 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1371.0 953.4 1028.6 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7363.2 4887.5 5135.8 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25029.0 17492.3 18404.5 After: inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 105.0 58.7 55.2 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 294.0 211.5 209.9 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1495.8 1050.4 1070.6 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7866.7 5197.8 5321.4 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25807.2 18619.3 18526.9
713aa34c -
Victorien Le Couviour--Tuffet authored
```--------------------------------------- x86_64: warp_8x8_8bpc_c: 1773.4 x86_32: warp_8x8_8bpc_c: 1740.4 ``` ------- x86_64: warp_8x8_8bpc_ssse3: 317.5 x86_32: warp_8x8_8bpc_ssse3: 378.4 ---------- x86_64: warp_8x8_8bpc_sse4: 303.7 x86_32: warp_8x8_8bpc_sse4: 367.7 ---------- x86_64: warp_8x8_8bpc_avx2: 224.9 --------------------- --------------------- x86_64: warp_8x8t_8bpc_c: 1664.6 x86_32: warp_8x8t_8bpc_c: 1674.0 ---------- x86_64: warp_8x8t_8bpc_ssse3: 320.7 x86_32: warp_8x8t_8bpc_ssse3: 379.5 ---------- x86_64: warp_8x8t_8bpc_sse4: 304.8 x86_32: warp_8x8t_8bpc_sse4: 369.8 ---------- x86_64: warp_8x8t_8bpc_avx2: 228.5 ------------------------------------------
a91a03b0 -
Martin Storsjö authored
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_dc_128_w4_8bpc_neon: 2.08 1.47 2.17 intra_pred_dc_128_w8_8bpc_neon: 3.33 2.49 4.03 intra_pred_dc_128_w16_8bpc_neon: 3.93 3.86 3.75 intra_pred_dc_128_w32_8bpc_neon: 3.14 3.79 2.90 intra_pred_dc_128_w64_8bpc_neon: 3.68 1.97 2.42 intra_pred_dc_left_w4_8bpc_neon: 2.41 1.70 2.23 intra_pred_dc_left_w8_8bpc_neon: 3.53 2.41 3.32 intra_pred_dc_left_w16_8bpc_neon: 3.87 3.54 3.34 intra_pred_dc_left_w32_8bpc_neon: 4.10 3.60 2.76 intra_pred_dc_left_w64_8bpc_neon: 3.72 2.00 2.39 intra_pred_dc_top_w4_8bpc_neon: 2.27 1.66 2.07 intra_pred_dc_top_w8_8bpc_neon: 3.83 2.69 3.43 intra_pred_dc_top_w16_8bpc_neon: 3.66 3.60 3.20 intra_pred_dc_top_w32_8bpc_neon: 3.92 3.54 2.66 intra_pred_dc_top_w64_8bpc_neon: 3.60 1.98 2.30 intra_pred_dc_w4_8bpc_neon: 2.29 1.42 2.16 intra_pred_dc_w8_8bpc_neon: 3.56 2.83 3.05 intra_pred_dc_w16_8bpc_neon: 3.46 3.37 3.15 intra_pred_dc_w32_8bpc_neon: 3.79 3.41 2.74 intra_pred_dc_w64_8bpc_neon: 3.52 2.01 2.41 intra_pred_h_w4_8bpc_neon: 10.34 5.74 5.94 intra_pred_h_w8_8bpc_neon: 12.13 6.33 6.43 intra_pred_h_w16_8bpc_neon: 10.66 7.31 5.85 intra_pred_h_w32_8bpc_neon: 6.28 4.18 2.88 intra_pred_h_w64_8bpc_neon: 3.96 1.85 1.75 intra_pred_v_w4_8bpc_neon: 11.44 6.12 7.57 intra_pred_v_w8_8bpc_neon: 14.76 7.58 7.95 intra_pred_v_w16_8bpc_neon: 11.34 6.28 5.88 intra_pred_v_w32_8bpc_neon: 6.56 3.33 3.34 intra_pred_v_w64_8bpc_neon: 4.57 1.24 1.97
f7743da1 -
Ronald S. Bultje authoredf6a8cc0c
-
Henrik Gramner authored16e0741a
-
Henrik Gramner authored
--list-functions now prints a list of all function names. Uses stdout for easy grepping/piping. Can be combined with the --test option to only list functions within a specific test. Also rename --list to --list-tests and make it print to stdout as well for consistency.
f404c722 -
Henrik Gramner authored
The existing code was using 16-bit intermediate precision for certain calculations which is insufficient for some esoteric edge cases.
de561b3b -
Henrik Gramner authoredd4dfa85c
-
Martin Storsjö authoreda4ceff6f
-
Luc Trudeau authored
Prior checks were done at the sbrow level. This now allows to call dav1d_lr_sbrow and dav1d_lr_copy_lpf only when there's something for them to do.
e570088d -
Martin Storsjö authored
Before: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1997.3 1170.1 1199.9 warp_8x8t_8bpc_neon: 1982.4 1171.5 1192.6 After: warp_8x8_8bpc_neon: 1954.6 1159.2 1153.3 warp_8x8t_8bpc_neon: 1938.5 1146.2 1136.7
ff41197b -
Ronald S. Bultje authored
gen_grain_uv_ar0_8bpc_420_c: 30131.8 gen_grain_uv_ar0_8bpc_420_avx2: 6600.4 gen_grain_uv_ar1_8bpc_420_c: 46110.5 gen_grain_uv_ar1_8bpc_420_avx2: 17887.2 gen_grain_uv_ar2_8bpc_420_c: 73593.2 gen_grain_uv_ar2_8bpc_420_avx2: 26918.6 gen_grain_uv_ar3_8bpc_420_c: 114499.3 gen_grain_uv_ar3_8bpc_420_avx2: 29804.6
4e22ef3a -
Luc Trudeau authoredd2c94ee1
-
Martin Storsjö authored
Instead of apply_sign(imin(abs(diff), clip), diff), do imax(imin(diff, clip), -clip). Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 592.7 374.5 384.5 cdef_filter_4x8_8bpc_neon: 1093.0 704.4 706.6 cdef_filter_8x8_8bpc_neon: 1962.6 1239.4 1252.1 After: cdef_filter_4x4_8bpc_neon: 593.7 355.5 373.2 cdef_filter_4x8_8bpc_neon: 1091.6 663.2 685.3 cdef_filter_8x8_8bpc_neon: 1964.2 1182.5 1210.8
bc26e300 -
Martin Storsjö authored1f835750
-
Martin Storsjö authored
As there's only two individual parameters, we can insert them into the same vector, reducing the number of actual calculation instructions, but adding a few more instructions to dup the results to the final vectors instead.
fa6a0924 -
Martin Storsjö authored
Only add .4h elements to the upper half of sum_alt, as only 11 elements are needed, and .8h + .4h gives 12 in total. Fuse two consecutive ext #8 + ext #2 into ext #10. Move a few stores further away from where they are calculated. Before: Cortex A53 A72 A73 cdef_dir_8bpc_neon: 404.0 278.2 302.4 After: cdef_dir_8bpc_neon: 400.0 269.3 282.5
dfaa2a10 -
Luc Trudeau authored7bbc5e3d
-
Martin Storsjö authored32ae5dd0
-
Martin Storsjö authored
The relative speedup ranges from 2.5 to 3.8x for find_dir and around 5 to 10x for filter. The find_dir function is a bit restricted by barely having enough registers, leaving very few ones for temporaries, so less things can be done in parallel and many instructions end up depending on the result of the preceding instruction. The ported functions end up slightly slower than the corresponding ARM64 ones, but only marginally: ARM64: Cortex A53 A72 A73 cdef_dir_8bpc_neon: 400.0 268.8 282.2 cdef_filter_4x4_8bpc_neon: 596.3 359.9 379.7 cdef_filter_4x8_8bpc_neon: 1091.0 670.4 698.5 cdef_filter_8x8_8bpc_neon: 1998.7 1207.2 1218.4 ARM32: cdef_dir_8bpc_neon: 528.5 329.1 337.4 cdef_filter_4x4_8bpc_neon: 632.5 482.5 432.2 cdef_filter_4x8_8bpc_neon: 1107.2 854.8 782.3 cdef_filter_8x8_8bpc_neon: 1984.8 1381.0 1414.4 Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 cdef_dir_8bpc_neon: 2.92 2.54 2.67 3.87 3.37 3.83 cdef_filter_4x4_8bpc_neon: 5.09 7.61 6.10 6.85 4.94 7.41 cdef_filter_4x8_8bpc_neon: 5.53 8.23 6.77 7.67 5.60 8.01 cdef_filter_8x8_8bpc_neon: 6.26 10.14 8.49 8.54 6.94 4.27
3489a9c1 -
Martin Storsjö authored
Before: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1952.8 1161.3 1151.1 warp_8x8t_8bpc_neon: 1937.1 1147.5 1139.0 After: warp_8x8_8bpc_neon: 1860.8 1068.6 1105.8 warp_8x8t_8bpc_neon: 1846.9 1056.4 1099.8
5647a57e -
Martin Storsjö authored
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 warp_8x8_8bpc_neon: 2.79 5.45 4.18 3.96 4.16 4.51 warp_8x8t_8bpc_neon: 2.79 5.33 4.18 3.98 4.22 4.25 Comparison to original ARM64 assembly: ARM64: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1854.6 1072.5 1102.5 warp_8x8t_8bpc_neon: 1839.6 1069.4 1089.5 ARM32: warp_8x8_8bpc_neon: 2132.5 1160.3 1218.0 warp_8x8t_8bpc_neon: 2113.7 1148.0 1209.1
61442bee -
Jean-Baptiste Kempf authored3e0f1508
-
Michail Alvanos authoredbe60b142
-
Jean-Baptiste Kempf authoredc688d5b2
-
James Almer authored
The uv argument is normally in a gpr, but in checkasm it's forcefully loaded from stack.
a7c024ce -
Henrik Gramner authoreddfadb6df
-
Henrik Gramner authoredafe901a6
-
Henrik Gramner authoredea9fc9d9
-
Martin Storsjö authored
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_paeth_w4_8bpc_neon: 8.36 6.55 7.27 intra_pred_paeth_w8_8bpc_neon: 15.24 11.36 11.34 intra_pred_paeth_w16_8bpc_neon: 16.63 13.20 14.17 intra_pred_paeth_w32_8bpc_neon: 10.83 9.21 9.87 intra_pred_paeth_w64_8bpc_neon: 8.37 7.07 7.45
8ab69afb -
Martin Storsjö authored
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_smooth_h_w4_8bpc_neon: 8.02 4.53 7.09 intra_pred_smooth_h_w8_8bpc_neon: 16.59 5.91 9.32 intra_pred_smooth_h_w16_8bpc_neon: 18.80 5.54 10.10 intra_pred_smooth_h_w32_8bpc_neon: 5.07 4.43 4.60 intra_pred_smooth_h_w64_8bpc_neon: 5.03 4.26 4.34 intra_pred_smooth_v_w4_8bpc_neon: 9.11 5.51 7.75 intra_pred_smooth_v_w8_8bpc_neon: 17.07 6.86 10.55 intra_pred_smooth_v_w16_8bpc_neon: 17.98 6.38 11.52 intra_pred_smooth_v_w32_8bpc_neon: 11.69 5.66 8.09 intra_pred_smooth_v_w64_8bpc_neon: 8.44 4.34 5.72 intra_pred_smooth_w4_8bpc_neon: 9.81 4.85 6.93 intra_pred_smooth_w8_8bpc_neon: 16.05 5.60 9.26 intra_pred_smooth_w16_8bpc_neon: 14.01 5.02 8.96 intra_pred_smooth_w32_8bpc_neon: 9.29 5.02 7.25 intra_pred_smooth_w64_8bpc_neon: 6.53 3.94 5.26
4318600e -
Martin Storsjö authored
Relative speedups over the C code: Cortex A53 A72 A73 pal_pred_w4_8bpc_neon: 8.75 6.15 7.60 pal_pred_w8_8bpc_neon: 19.93 11.79 10.98 pal_pred_w16_8bpc_neon: 24.68 13.28 16.06 pal_pred_w32_8bpc_neon: 23.56 11.81 16.74 pal_pred_w64_8bpc_neon: 23.16 12.19 17.60
4f14573c -
Martin Storsjö authored
Use a different layout of the filter_intra_taps depending on architecture; the current one is optimized for the x86 SIMD implementation. Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_filter_w4_8bpc_neon: 6.38 2.81 4.43 intra_pred_filter_w8_8bpc_neon: 9.30 3.62 5.71 intra_pred_filter_w16_8bpc_neon: 9.85 3.98 6.42 intra_pred_filter_w32_8bpc_neon: 10.77 4.08 7.09
d322d451 -
Martin Storsjö authored
Relative speedup over the C code: Cortex A53 A72 A73 cfl_pred_cfl_128_w4_8bpc_neon: 10.81 7.90 9.80 cfl_pred_cfl_128_w8_8bpc_neon: 18.38 11.15 13.24 cfl_pred_cfl_128_w16_8bpc_neon: 16.52 10.83 16.00 cfl_pred_cfl_128_w32_8bpc_neon: 3.27 3.60 3.70 cfl_pred_cfl_left_w4_8bpc_neon: 9.82 7.38 8.76 cfl_pred_cfl_left_w8_8bpc_neon: 17.22 10.63 11.97 cfl_pred_cfl_left_w16_8bpc_neon: 16.03 10.49 15.66 cfl_pred_cfl_left_w32_8bpc_neon: 3.28 3.61 3.72 cfl_pred_cfl_top_w4_8bpc_neon: 9.74 7.39 9.29 cfl_pred_cfl_top_w8_8bpc_neon: 17.48 10.89 12.58 cfl_pred_cfl_top_w16_8bpc_neon: 16.01 10.62 15.31 cfl_pred_cfl_top_w32_8bpc_neon: 3.25 3.62 3.75 cfl_pred_cfl_w4_8bpc_neon: 8.39 6.34 8.04 cfl_pred_cfl_w8_8bpc_neon: 15.99 10.12 12.42 cfl_pred_cfl_w16_8bpc_neon: 15.25 10.40 15.12 cfl_pred_cfl_w32_8bpc_neon: 3.23 3.58 3.71 The C code gets autovectorized for w >= 32, which is why the relative speedup looks strange (but the performance of the NEON functions is completely as expected).
c7693386 -
Martin Storsjö authored
Relative speedup over the C code: Cortex A53 A72 A73 cfl_ac_420_w4_8bpc_neon: 7.73 6.48 9.22 cfl_ac_420_w8_8bpc_neon: 6.70 5.56 6.95 cfl_ac_420_w16_8bpc_neon: 6.51 6.93 6.67 cfl_ac_422_w4_8bpc_neon: 9.25 7.70 9.75 cfl_ac_422_w8_8bpc_neon: 8.53 5.95 7.13 cfl_ac_422_w16_8bpc_neon: 7.08 6.87 6.06
57dd0aae -
Luc Trudeau authoredb7d7c8ce
-
Martin Storsjö authored5d014b41
-
Jean-Baptiste Kempf authored
Showing
examples/meson.build
0 → 100644
src/arm/32/cdef.S
0 → 100644