Commits on Source (97)
-
dav1d_init_get_bits() initializes c->eof to 0, which implies c->ptr < c->ptr_end, or equivalently sz > 0.
1efea985 -
b14711ca
-
54f92068
-
Ronald S. Bultje authored
The shift-amount can be up to 56, and left-shifting 32-bit integers by values >=32 is undefined behaviour. Therefore, use 64-bit integers instead. Also slightly rewrite so we only call dav1d_get_bits() once for the combined more|bits value, and mask the relevant portions out instead of reading twice. Lastly, move the overflow check out of the loop (as suggested by @wtc) Fixes #341.
47daa4df -
Broadcasting a memory operand is binary flag, you either broadcast or you don't, and there's only a single possible element size for any given instruction. The instruction syntax however requires the broadcast semanticts to be explicitly defined, which is an issue when using macros to template code for multiple register widhts. Add some helper defines to alleviate the issue.
8ec5ff0e -
c19484d8
-
ac5f7d0c
-
464ca6c2
-
be1fe18e
-
5895809e
-
Since 46d092ae the demuxer is no longer detected from extension but rather by probing.
9ad9b326 -
Meson does not yet normalises arm64 to the aarch64 in the reference table. To workaround this, in addition to the cpu_family check the cpu field.
a7b92c76 -
%{:} macro operand ranges were broken in nasm 2.15 which causes errors when compiling, so avoid using those for now. Some new warnings regarding use of empty macro parameters has also been added, adjust some x86inc code to silence those.
2b567aaa -
5fe20ec7
-
Martin Storsjö authored8e004039
-
Martin Storsjö authoreda26882d2
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 intra_pred_filter_w16_8bpc_neon: 540.2 573.8 580.2 intra_pred_filter_w32_8bpc_neon: 1223.1 1364.1 1292.9 After: intra_pred_filter_w16_8bpc_neon: 531.4 559.8 565.4 intra_pred_filter_w32_8bpc_neon: 1243.0 1308.6 1270.9 This does give a minor slowdown for the w32 case on A53, but helps on w16 and quite notably in all cases on A72 and A73. Doing the same modification on ipred16.S doesn't give quite as clear gains (the gains on A72 and A73 are smaller, and the regression on A53 on on w32 is a bit bigger), so not doing the same adjustment there.
2e36a3be -
Martin Storsjö authored
It became unused in 38629906.
a903642a -
Martin Storsjö authored
The branch target is directly afterwards, so the branch isn't needed.
2e271c49 -
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 cfl_ac_444_w4_8bpc_neon: 8.72 8.75 10.50 cfl_ac_444_w8_8bpc_neon: 13.10 10.77 11.23 cfl_ac_444_w16_8bpc_neon: 13.08 9.95 10.49 cfl_ac_444_w32_8bpc_neon: 12.58 9.43 10.63
9b40bb95 -
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 cfl_ac_444_w4_16bpc_neon: 8.03 9.41 10.48 cfl_ac_444_w8_16bpc_neon: 10.17 10.54 10.38 cfl_ac_444_w16_16bpc_neon: 10.73 10.38 9.73 cfl_ac_444_w32_16bpc_neon: 10.18 9.43 9.77
72db6607 -
Martin Storsjö authored
This speeds things up a bit on older cores. Also do a load that duplicates the input over the whole register instead of just loading a single lane in iprev_v_w4. This can be a bit faster on Cortex A8. Before: Cortex A7 A8 A9 A53 A72 A73 intra_pred_v_w4_8bpc_neon: 54.0 38.4 46.4 47.7 20.4 18.1 intra_pred_h_w4_8bpc_neon: 66.3 43.1 55.0 57.0 27.9 22.2 intra_pred_h_w8_8bpc_neon: 81.0 60.2 76.7 66.5 31.1 30.1 intra_pred_dc_left_w4_8bpc_neon: 91.0 49.0 72.8 77.7 35.4 38.5 intra_pred_dc_left_w8_8bpc_neon: 103.8 73.5 90.2 84.7 42.8 47.1 intra_pred_dc_left_w16_8bpc_neon: 156.1 101.8 186.1 119.4 77.7 92.6 intra_pred_dc_left_w32_8bpc_neon: 270.5 200.5 381.6 191.7 152.6 170.3 intra_pred_dc_left_w64_8bpc_neon: 560.7 439.1 877.0 375.4 333.5 343.6 After: intra_pred_v_w4_8bpc_neon: 53.9 38.0 46.4 47.7 19.8 19.2 intra_pred_h_w4_8bpc_neon: 66.5 39.2 52.6 57.0 27.7 22.2 intra_pred_h_w8_8bpc_neon: 80.5 55.8 72.9 66.5 31.4 30.1 intra_pred_dc_left_w4_8bpc_neon: 91.0 48.2 71.8 77.7 34.9 38.6 intra_pred_dc_left_w8_8bpc_neon: 103.8 69.6 89.2 84.7 43.2 47.3 intra_pred_dc_left_w16_8bpc_neon: 182.3 99.9 184.9 118.8 77.7 85.8 intra_pred_dc_left_w32_8bpc_neon: 355.4 198.9 380.1 190.6 152.9 161.0 intra_pred_dc_left_w64_8bpc_neon: 517.5 437.4 876.9 375.7 333.3 347.7
74d5cf57 -
Martin Storsjö authored
These came from matching some parts too closely to the arm64 version (where the summation can be done efficiently with uaddlv by zeroing the upper half of the register). Before: Cortex A7 A8 A9 A53 A72 A73 intra_pred_dc_w4_8bpc_neon: 124.5 65.1 90.2 100.4 48.1 50.4 After: intra_pred_dc_w4_8bpc_neon: 120.3 60.7 83.6 94.0 44.1 47.9
d00a0227 -
Martin Storsjö authored
This matches the arm64 original. The comment isn't about the condition, but about the state after the conditional branch.
f4a0127a -
Martin Storsjö authored8fd0bc90
-
Martin Storsjö authored
Do the horizontal summing in the same way as for other cases of 32 pixel summing. This doesn't seem to affect the runtime significantly though (checkasm benchmarks vary by a couple cycles), but it's 5 instructions shorter at least.
b4291523 -
Martin Storsjö authored
This matches was is implemented for arm64 so far. Align the dav1d_sm_weights table to allow aligned loads from it. Relative speedups over C code (vs potentially autovectorized code, built with Clang): Cortex A7 A8 A9 A53 A72 A73 intra_pred_paeth_w4_8bpc_neon: 4.81 7.61 5.82 5.50 5.61 6.94 intra_pred_paeth_w8_8bpc_neon: 7.83 11.95 9.51 11.05 8.90 10.51 intra_pred_paeth_w16_8bpc_neon: 4.86 4.49 3.90 4.60 3.76 3.54 intra_pred_paeth_w32_8bpc_neon: 4.55 4.03 3.52 4.27 3.30 3.21 intra_pred_paeth_w64_8bpc_neon: 4.38 3.72 3.32 3.95 3.08 3.00 intra_pred_smooth_h_w4_8bpc_neon: 5.74 10.80 5.32 6.79 4.77 6.48 intra_pred_smooth_h_w8_8bpc_neon: 10.59 17.95 9.39 16.03 6.94 8.98 intra_pred_smooth_h_w16_8bpc_neon: 2.81 3.19 2.12 3.70 2.90 3.59 intra_pred_smooth_h_w32_8bpc_neon: 2.63 2.41 1.86 3.44 2.24 2.66 intra_pred_smooth_h_w64_8bpc_neon: 2.42 2.52 1.79 3.24 1.81 2.11 intra_pred_smooth_v_w4_8bpc_neon: 4.15 7.99 3.46 4.63 3.83 4.39 intra_pred_smooth_v_w8_8bpc_neon: 7.31 12.42 7.04 10.00 4.26 6.20 intra_pred_smooth_v_w16_8bpc_neon: 3.70 3.44 2.53 3.33 2.76 3.21 intra_pred_smooth_v_w32_8bpc_neon: 3.91 3.74 2.70 3.51 2.50 2.96 intra_pred_smooth_v_w64_8bpc_neon: 4.03 3.94 2.80 3.64 2.36 2.80 intra_pred_smooth_w4_8bpc_neon: 4.09 7.74 4.54 4.79 3.26 5.10 intra_pred_smooth_w8_8bpc_neon: 5.63 8.93 6.62 8.28 3.73 6.04 intra_pred_smooth_w16_8bpc_neon: 3.97 3.40 3.32 3.74 3.01 3.77 intra_pred_smooth_w32_8bpc_neon: 3.75 3.14 3.07 3.28 2.65 3.17 intra_pred_smooth_w64_8bpc_neon: 3.60 3.04 2.93 2.97 2.35 2.85 intra_pred_filter_w4_8bpc_neon: 5.54 6.43 4.90 7.26 3.44 4.61 intra_pred_filter_w8_8bpc_neon: 7.05 7.15 5.50 10.05 4.29 6.02 intra_pred_filter_w16_8bpc_neon: 7.36 6.46 5.27 11.51 4.75 6.70 intra_pred_filter_w32_8bpc_neon: 7.56 6.32 5.01 12.34 4.47 6.97 pal_pred_w4_8bpc_neon: 5.47 7.76 4.40 5.20 8.32 7.03 pal_pred_w8_8bpc_neon: 11.11 14.12 8.44 13.95 11.88 12.43 pal_pred_w16_8bpc_neon: 14.38 20.95 9.84 17.43 14.77 13.56 pal_pred_w32_8bpc_neon: 12.91 19.85 10.87 19.03 14.63 14.62 pal_pred_w64_8bpc_neon: 14.01 19.23 10.82 19.82 16.23 16.32 cfl_ac_420_w4_8bpc_neon: 8.11 13.41 7.92 9.26 10.55 9.36 cfl_ac_420_w8_8bpc_neon: 7.77 15.71 7.69 8.94 9.76 8.56 cfl_ac_420_w16_8bpc_neon: 7.72 13.71 8.30 9.05 9.81 9.02 cfl_ac_422_w4_8bpc_neon: 8.85 15.80 8.26 10.97 13.04 10.00 cfl_ac_422_w8_8bpc_neon: 8.77 16.96 7.57 10.46 12.16 9.92 cfl_ac_422_w16_8bpc_neon: 8.28 14.91 7.16 9.69 10.57 9.18 cfl_ac_444_w4_8bpc_neon: 7.47 14.13 7.50 9.76 11.11 9.39 cfl_ac_444_w8_8bpc_neon: 6.81 15.46 5.27 9.11 12.09 9.76 cfl_ac_444_w16_8bpc_neon: 6.11 13.68 4.62 8.17 10.78 8.92 cfl_ac_444_w32_8bpc_neon: 5.71 12.11 4.28 7.53 9.53 8.52 cfl_pred_cfl_128_w4_8bpc_neon: 7.46 12.63 8.48 8.03 7.64 9.29 cfl_pred_cfl_128_w8_8bpc_neon: 5.05 5.16 3.79 4.64 5.07 4.42 cfl_pred_cfl_128_w16_8bpc_neon: 4.44 5.17 3.65 4.20 4.41 4.74 cfl_pred_cfl_128_w32_8bpc_neon: 4.51 5.25 3.67 4.29 4.39 4.73 cfl_pred_cfl_left_w4_8bpc_neon: 6.60 11.74 7.75 6.91 7.44 9.14 cfl_pred_cfl_left_w8_8bpc_neon: 4.92 5.15 3.80 4.41 5.44 4.81 cfl_pred_cfl_left_w16_8bpc_neon: 4.40 5.26 3.66 4.10 4.63 4.94 cfl_pred_cfl_left_w32_8bpc_neon: 4.50 5.31 3.68 4.25 4.43 4.82 cfl_pred_cfl_top_w4_8bpc_neon: 7.00 11.88 7.88 7.50 7.43 9.68 cfl_pred_cfl_top_w8_8bpc_neon: 4.96 5.07 3.78 4.51 5.31 4.75 cfl_pred_cfl_top_w16_8bpc_neon: 4.42 5.31 3.69 4.16 4.60 4.93 cfl_pred_cfl_top_w32_8bpc_neon: 4.52 5.36 3.71 4.29 4.47 4.83 cfl_pred_cfl_w4_8bpc_neon: 5.92 10.54 7.25 6.21 6.79 8.33 cfl_pred_cfl_w8_8bpc_neon: 4.67 5.16 3.77 4.14 5.20 4.71 cfl_pred_cfl_w16_8bpc_neon: 4.29 5.29 3.70 3.97 4.53 4.86 cfl_pred_cfl_w32_8bpc_neon: 4.47 5.34 3.72 4.20 4.42 4.83
8dd9c651 -
Jean-Baptiste Kempf authored
Removes files from top-level
f116e076 -
Janne Grunau authored
Fixes #345.
725f3768 -
Nico Weber authored
This matches the `.hidden` already used for ELF outputs. This is needed for Chromium's mac/arm64 build. Chromium has a build step that verifies that Chromium Framework only exports a small, fixed set of symbols. The dav1d symbols showed up unexpectedly. This fixes that.
ebc8e4d9 -
Nico Weber authored
This is a follow-up to ebc8e4d9. dav1d doesn't currently use this `const` macro, but rav1e does.
dfb22e57 -
Matthias Dressel authored
- Fix small typos - Add link to doxygen documentation - Add high bit-depth asm goals
1b9792f3 -
A bitstream may contain values larger than the currently defined entries, but it's technically UB to put such values into an enum. Discovered in Firefox through fuzzing with UBSan.
d69fc655 -
Matthias Dressel authored1317e619
-
Marvin Scholz authoredf55cd4c6
-
Henrik Gramner authored6cf58c8e
-
Victorien Le Couviour--Tuffet authored652e5b38
-
Victorien Le Couviour--Tuffet authored06f12a89
-
This gives a clearer indication about what is wrong, instead of running into illegal instruction errors in the individual tests. On ARM and AArch64, access to the cycle counter register is forbidden in user mode code by default on Linux and Darwin.
95a19254 -
The signal handler does a longjmp back to the location of declare_func when there's a signal. If declare_func is located within the check_func block, it will just end up in an endless loop, retrying running the failing tests again. On linux, after resuming from the signal handler, the second signal wouldn't trigger the signal handler but forcibly exit the process, while on darwin, it would get stuck in an endless loop. msac_decode_bool seems to be the only checkasm test with declare_func within the check_func block.
c3a12884 -
The cycle counter instructions aren't accessible on iOS/macOS on ARM. The mach_absolute_time() function has much coarser precision, but is the least bad option available.
7c4cbbf8 -
James Almer authored
Cosmetic change.
-
James Almer authored
The relevant structs are filled immediately after them.
-
Martin Storsjö authored
Move the declaration of func_ref/func_new into declare_func. This enforces that declare_func is a scope outside of/before check_func. This ensures that if the signal handler is triggered, we rewind to a scope outside of check_func, where check_func makes sure we don't rerun the test that just triggered the signal handler.
0b824944 -
Martin Storsjö authored
This fixes building in configurations where no readtime implementation is available at all, such as MSVC targeting 32 bit ARM. This was missed when the check was added in 95a19254.
5bbd9632 -
9435be18
-
4cd2f82d
-
Makes using unmodified upstream x86inc.asm possible.
9a2d1658 -
d0e50cac
-
Janne Grunau authoredacc92406
-
Janne Grunau authored
Also changes the type intptr_t to make adding variable size members more convenient.
89c57ce3 -
Janne Grunau authorede2d22c01
-
Janne Grunau authored6f3a8fb9
-
Janne Grunau authored791c4697
-
Janne Grunau authored
Errors on C11 features like anonymous strucs/unions.
1bcc5ecd -
Martin Storsjö authored
We can't compare the decoding speed with the intended decoding rate, but the frame rate alone is still useful.
f57189e3 -
Martin Storsjö authored
Cortex A7 A8 A9 A53 A72 A73 avg_w4_16bpc_neon: 131.4 81.8 117.3 111.0 50.9 58.8 avg_w8_16bpc_neon: 291.9 173.1 293.1 230.9 114.7 128.8 avg_w16_16bpc_neon: 803.3 480.1 821.4 645.8 345.7 384.9 avg_w32_16bpc_neon: 3350.0 1833.1 3188.1 2343.5 1343.9 1500.6 avg_w64_16bpc_neon: 8185.9 4390.6 10448.2 6078.8 3303.6 3466.7 avg_w128_16bpc_neon: 22384.3 10901.2 33721.9 16782.7 8165.1 8416.5 w_avg_w4_16bpc_neon: 251.3 165.8 203.9 158.3 99.6 106.9 w_avg_w8_16bpc_neon: 638.4 427.8 555.7 365.1 283.2 277.4 w_avg_w16_16bpc_neon: 1912.3 1257.5 1623.4 1056.5 879.5 841.8 w_avg_w32_16bpc_neon: 7461.3 4889.6 6383.8 3966.3 3286.8 3296.8 w_avg_w64_16bpc_neon: 18689.3 11698.1 18487.3 10134.1 8156.2 7939.5 w_avg_w128_16bpc_neon: 48776.6 28989.0 53203.3 26004.1 20055.2 20049.4 mask_w4_16bpc_neon: 298.6 189.2 242.3 191.6 115.2 129.6 mask_w8_16bpc_neon: 768.6 501.5 646.1 432.4 302.9 326.8 mask_w16_16bpc_neon: 2320.5 1480.9 1873.0 1270.2 932.2 976.1 mask_w32_16bpc_neon: 9412.0 5791.9 7348.5 4875.1 3896.4 3821.1 mask_w64_16bpc_neon: 23385.9 13875.6 21383.8 12235.9 9469.2 9160.2 mask_w128_16bpc_neon: 60466.4 34762.6 61055.9 31214.0 23299.0 23324.5 For comparison, the corresponding numbers for the existing arm64 implementation: avg_w4_16bpc_neon: 78.0 38.5 50.0 avg_w8_16bpc_neon: 198.3 105.4 117.8 avg_w16_16bpc_neon: 614.9 339.9 376.7 avg_w32_16bpc_neon: 2313.8 1391.1 1487.7 avg_w64_16bpc_neon: 5733.3 3269.1 3648.4 avg_w128_16bpc_neon: 15105.9 8143.5 8970.4 w_avg_w4_16bpc_neon: 119.2 87.7 92.9 w_avg_w8_16bpc_neon: 322.9 252.3 263.5 w_avg_w16_16bpc_neon: 1016.8 794.0 828.6 w_avg_w32_16bpc_neon: 3910.9 3159.6 3308.3 w_avg_w64_16bpc_neon: 9499.6 7933.9 8026.5 w_avg_w128_16bpc_neon: 24508.3 19502.0 20389.8 mask_w4_16bpc_neon: 138.9 98.7 106.7 mask_w8_16bpc_neon: 375.5 301.1 302.7 mask_w16_16bpc_neon: 1217.2 1064.6 954.4 mask_w32_16bpc_neon: 4821.0 4018.4 3825.7 mask_w64_16bpc_neon: 12262.7 9471.3 9169.7 mask_w128_16bpc_neon: 31356.6 22657.6 23324.5
80aa7823 -
This adds A<W>:<H> to the Y4M header, to preserve the intended aspect ratio for anamorphic video.
484d6595 -
The previous floating-point implementation produced results that were sometimes slightly off due to rounding errors. For example, a frame size of 432x240 with a render size of 176x240 previously resulted in a PAR of 98:240 instead of the correct 11:27. Also reduce fractions to produce more readable numbers.
3bfe8c7c -
Martin Storsjö authored
For loads of a half/full register, the actual size of the elements doesn't matter, but it makes the code more readable and understandable.
13fad75d -
Martin Storsjö authored
The previous form was a leftover from how it had to be written on aarch64.
ea7e13e7 -
Martin Storsjö authored458273ed
-
Martin Storsjö authored
This matches how the same logic is written for w4 and above.
65a1aafd -
Martin Storsjö authored
Narrowing the intermediates from the horizontal pass is beneficial (on most cores, but a small slowdown on A53) here as well. This increases consistency in the code between the cases. (The corresponding change in the upcoming arm32 version is beneficial on all tested cores except for on A53 - it helps, on some cores a lot, on A7, A8, A9, A72, A73 and only makes it marginally slower on A53.) Before: Cortex A53 A72 A73 mc_8tap_regular_w2_hv_16bpc_neon: 457.7 301.0 317.1 After: mc_8tap_regular_w2_hv_16bpc_neon: 472.0 276.0 284.3
4ae3f5f7 -
Martin Storsjö authored
Examples of checkasm benchmarks: Cortex A7 A8 A9 A53 A72 A73 mc_8tap_regular_w8_0_16bpc_neon: 158.7 106.2 167.0 127.9 55.0 77.2 mc_8tap_regular_w8_h_16bpc_neon: 1000.8 557.5 749.2 609.2 401.4 485.4 mc_8tap_regular_w8_hv_16bpc_neon: 2278.9 1255.4 1352.5 1277.2 867.8 915.9 mc_8tap_regular_w8_v_16bpc_neon: 1060.0 393.6 485.5 448.3 298.0 298.2 mc_bilinear_w8_0_16bpc_neon: 159.7 96.6 161.1 123.7 55.4 74.7 mc_bilinear_w8_h_16bpc_neon: 342.3 250.8 352.9 239.0 158.4 165.1 mc_bilinear_w8_hv_16bpc_neon: 587.7 373.8 469.0 339.8 244.4 247.5 mc_bilinear_w8_v_16bpc_neon: 285.8 189.3 284.9 180.4 103.4 100.9 mct_8tap_regular_w8_0_16bpc_neon: 233.0 136.6 229.3 169.3 86.2 98.3 mct_8tap_regular_w8_h_16bpc_neon: 1106.8 588.3 817.9 654.1 406.4 489.8 mct_8tap_regular_w8_hv_16bpc_neon: 2473.3 1326.3 1428.2 1373.7 903.3 951.1 mct_8tap_regular_w8_v_16bpc_neon: 1266.0 474.1 581.3 505.9 382.0 373.4 mct_bilinear_w8_0_16bpc_neon: 232.9 126.2 225.0 166.3 86.2 91.7 mct_bilinear_w8_h_16bpc_neon: 380.6 270.6 386.0 259.7 154.1 151.9 mct_bilinear_w8_hv_16bpc_neon: 631.4 409.2 509.4 372.1 243.1 244.1 mct_bilinear_w8_v_16bpc_neon: 349.5 233.5 347.9 212.4 138.7 138.4 For comparison, the corresponding numbers for the existing arm64 implementation: Cortex A53 A72 A73 mc_8tap_regular_w8_0_16bpc_neon: 94.1 48.9 62.3 mc_8tap_regular_w8_h_16bpc_neon: 570.4 388.1 467.3 mc_8tap_regular_w8_hv_16bpc_neon: 1035.8 775.0 891.2 mc_8tap_regular_w8_v_16bpc_neon: 399.8 284.5 278.2 mc_bilinear_w8_0_16bpc_neon: 90.0 44.3 57.4 mc_bilinear_w8_h_16bpc_neon: 191.7 158.7 156.4 mc_bilinear_w8_hv_16bpc_neon: 295.6 235.0 244.9 mc_bilinear_w8_v_16bpc_neon: 147.2 99.0 88.8 mct_8tap_regular_w8_0_16bpc_neon: 139.4 78.4 84.9 mct_8tap_regular_w8_h_16bpc_neon: 612.3 395.9 478.6 mct_8tap_regular_w8_hv_16bpc_neon: 1113.0 804.3 963.5 mct_8tap_regular_w8_v_16bpc_neon: 462.1 370.8 353.3 mct_bilinear_w8_0_16bpc_neon: 135.6 77.0 80.5 mct_bilinear_w8_h_16bpc_neon: 210.8 159.2 141.7 mct_bilinear_w8_hv_16bpc_neon: 325.7 238.4 227.3 mct_bilinear_w8_v_16bpc_neon: 180.7 136.7 129.5
856662b4 -
8c2a8976
-
Wan-Teh Chang authored
If c->operating_point_idc is nonzero and either bits 0-7 or bits 8-11 in it are all 0s, it will cause dav1d_parse_obus() to drop all layer-specific OBUs. Prohibit any op->idc with such properties because it could be selected as c->operating_point_idc.
50e876c6 -
5173de30
-
Janne Grunau authored
Fixes #350.
d85fdf52 -
Martin Storsjö authored
Don't pass the .S assembly sources as C source files in this case, as e.g. MSVC doesn't support them (and meson knows it doesn't, so it refuses to proceed with an MSVC/gas-preprocessor wrapper script, as meson detects it as MSVC - unless meson is hacked to allow passing .S files to MSVC). This allows building dav1d with MSVC for ARM targets without hacks to meson. (Building in a pure MSVC setup with no other compilers available does require a few new patches to gas-preprocessor though.) This has been postponed for quite some time, as compiling with MSVC for non-x86 targets in meson has been problematic, as meson used to require a working compiler for the build system as well, and MSVC for all targets are named cl.exe, and you can't have one for the cross target and the build machine first in the path at the same time. This was recently fixed though, see https://github.com/mesonbuild/meson/issues/4402 and https://github.com/mesonbuild/meson/pull/6512. This matches how gas-preprocessor is hooked up for e.g. OpenH264 in https://github.com/cisco/openh264/commit/013c4566a219a1f0fd50a8186f2b11fd8c3efcfb.
d68a2fc1 -
This avoids lots of warnings about unsupported warning options.
a5e45517 -
Makes !1078 redundant.
f90ada0d -
Martin Storsjö authored77b3b25c
-
Martin Storsjö authoredc3c4e3ab
-
Martin Storsjö authored8486bffe
-
Martin Storsjö authored
The vext.8 instructions only need to produce a single d register each, making more registers available as scratch space, allowing to hide latencies more, and group the vmul/vmla in the form that is beneficial for in-order cores (with a special forwarding path for such patterns).
41f59b02 -
Martin Storsjö authored911942ca
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 wiener_chroma_10bpc_neon: 177063.6 129197.3 127987.9 wiener_chroma_12bpc_neon: 177034.4 129206.8 128409.5 wiener_luma_10bpc_neon: 177072.6 129198.1 127931.8 wiener_luma_12bpc_neon: 177052.4 129196.0 127955.2 After: wiener_chroma_10bpc_neon: 176319.7 125992.1 128162.4 wiener_chroma_12bpc_neon: 176386.2 125986.4 128343.8 wiener_luma_10bpc_neon: 176174.0 126001.7 128227.8 wiener_luma_12bpc_neon: 176176.5 125992.1 128204.8 This gives a small speedup on A53, a bit larger one on A72 and little change (mostly noise?) on A73.
7ebcb777 -
Martin Storsjö authored
Checkasm benchmarks: Cortex A7 A8 A53 A72 A73 wiener_chroma_10bpc_neon: 385312.5 165772.7 184308.2 122311.2 126050.2 wiener_chroma_12bpc_neon: 385296.7 165538.0 184438.2 122290.5 126205.3 wiener_luma_10bpc_neon: 385318.5 165985.3 184147.4 122311.1 126168.4 wiener_luma_12bpc_neon: 385316.3 165819.1 184484.7 122304.4 125982.4 The corresponding numbers for arm64 for comparison: Cortex A53 A72 A73 wiener_chroma_10bpc_neon: 176319.7 125992.1 128162.4 wiener_chroma_12bpc_neon: 176386.2 125986.4 128343.8 wiener_luma_10bpc_neon: 176174.0 126001.7 128227.8 wiener_luma_12bpc_neon: 176176.5 125992.1 128204.8 The arm32 version actually seems to run marginally faster than the arm64 one on A72 and A73. I believe this is because the arm64 code is tuned for A53 (which makes it a bit slower on other cores), but the arm32 code can't be tuned exactly the same way due to fewer registers being available.
2c09aaa4 -
ac1cb28d
-
0243c3ff
-
Luc Trudeau authored
Prints out values and offsets for content light level and mastering display color volume
a902d6e3 -
Luc Trudeau authored
long is 32 bits on Win64, as such %ld are replaced with %td. For SEQHDR, %ld was used but the actual value is a 32bit unsigned so %u is enough.
901704e8 -
Victorien Le Couviour--Tuffet authored
This could cause a frame waiting on the current one to not be notified on error. Fixes #351.
a40d3b5f -
oddstone authored
The first index to task_idx_to_sby_and_tile_idx is task_idx not tile_idx
ffd052bd -
Luc Trudeau authoredbcebc7bd
-
Martin Storsjö authored
Since 3e6fbde94c1cb8d4e01b7daf0282c315ff0e6c7d in meson (past the 0.56 release), the b_lto option was changed from a bool to a tristate option (false/true/thin). One could just compare the b_lto option against 'false', but that causes warnings on older meson versions (on all existing releases).
920079ed -
Reuse buffers allocated for picture data instead of constantly freeing and allocating new ones. The impact of this can vary significantly between different systems, in particular it's highly beneficial on Windows where it can result in an overall performance increase of up to 10% in some cases.
9057d286 -
Luc Trudeau authored
Makes C code more alike ASM
e413c8ed -
3fdf468e
-
Matthias Dressel authoredba875b96
-
Martin Storsjö authoredc48ea15f
-
Martin Storsjö authorede41a2a1f
-
Martin Storsjö authored
Use a shared template file for assembly functions that can be templated into 8 and 16 bpc forms, just like in the arm64 version. Checkasm benchmarks: Cortex A7 A8 A53 A72 A73 cdef_dir_16bpc_neon: 975.9 853.2 555.2 378.7 386.9 cdef_filter_4x4_16bpc_neon: 746.9 521.7 481.2 333.0 340.8 cdef_filter_4x8_16bpc_neon: 1300.0 885.5 816.3 582.7 599.5 cdef_filter_8x8_16bpc_neon: 2282.5 1415.0 1417.6 1059.0 1076.3 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 cdef_dir_16bpc_neon: 418.0 306.7 310.7 cdef_filter_4x4_16bpc_neon: 453.4 282.9 297.4 cdef_filter_4x8_16bpc_neon: 807.5 514.2 533.8 cdef_filter_8x8_16bpc_neon: 1425.2 924.4 942.0
018e64e7 -
Martin Storsjö authored
Checkasm benchmarks: Cortex A7 A8 A53 A72 A73 warp_8x8_16bpc_neon: 4062.6 2109.4 2462.0 1338.9 1391.1 warp_8x8t_16bpc_neon: 3996.3 2102.4 2412.0 1273.8 1368.9 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 warp_8x8_16bpc_neon: 2037.0 1148.8 1222.0 warp_8x8t_16bpc_neon: 2008.0 1120.4 1200.9
dc98fff8 -
Add buffer pools for miscellaneous smaller buffers that are repeatedly being freed and reallocated. Also improve dav1d_ref_create() by consolidating two separate memory allocations into a single one.
236e1122 -
Jean-Baptiste Kempf authored
File moved
src/arm/32/cdef16.S
0 → 100644
src/arm/32/cdef_tmpl.S
0 → 100644
This diff is collapsed.
src/arm/32/looprestoration16.S
0 → 100644
src/arm/32/mc16.S
0 → 100644
This diff is collapsed.