- Jul 20, 2020
-
-
Marvin Scholz authored
-
- Jul 13, 2020
-
-
Matthias Dressel authored
-
A bitstream may contain values larger than the currently defined entries, but it's technically UB to put such values into an enum. Discovered in Firefox through fuzzing with UBSan.
-
Matthias Dressel authored
- Fix small typos - Add link to doxygen documentation - Add high bit-depth asm goals
-
- Jul 10, 2020
-
-
Nico Weber authored
This is a follow-up to ebc8e4d9. dav1d doesn't currently use this `const` macro, but rav1e does.
-
Nico Weber authored
This matches the `.hidden` already used for ELF outputs. This is needed for Chromium's mac/arm64 build. Chromium has a build step that verifies that Chromium Framework only exports a small, fixed set of symbols. The dav1d symbols showed up unexpectedly. This fixes that.
-
- Jul 09, 2020
-
-
Janne Grunau authored
Fixes #345.
-
- Jul 04, 2020
-
-
Jean-Baptiste Kempf authored
Removes files from top-level
-
- Jul 02, 2020
-
-
Martin Storsjö authored
This matches was is implemented for arm64 so far. Align the dav1d_sm_weights table to allow aligned loads from it. Relative speedups over C code (vs potentially autovectorized code, built with Clang): Cortex A7 A8 A9 A53 A72 A73 intra_pred_paeth_w4_8bpc_neon: 4.81 7.61 5.82 5.50 5.61 6.94 intra_pred_paeth_w8_8bpc_neon: 7.83 11.95 9.51 11.05 8.90 10.51 intra_pred_paeth_w16_8bpc_neon: 4.86 4.49 3.90 4.60 3.76 3.54 intra_pred_paeth_w32_8bpc_neon: 4.55 4.03 3.52 4.27 3.30 3.21 intra_pred_paeth_w64_8bpc_neon: 4.38 3.72 3.32 3.95 3.08 3.00 intra_pred_smooth_h_w4_8bpc_neon: 5.74 10.80 5.32 6.79 4.77 6.48 intra_pred_smooth_h_w8_8bpc_neon: 10.59 17.95 9.39 16.03 6.94 8.98 intra_pred_smooth_h_w16_8bpc_neon: 2.81 3.19 2.12 3.70 2.90 3.59 intra_pred_smooth_h_w32_8bpc_neon: 2.63 2.41 1.86 3.44 2.24 2.66 intra_pred_smooth_h_w64_8bpc_neon: 2.42 2.52 1.79 3.24 1.81 2.11 intra_pred_smooth_v_w4_8bpc_neon: 4.15 7.99 3.46 4.63 3.83 4.39 intra_pred_smooth_v_w8_8bpc_neon: 7.31 12.42 7.04 10.00 4.26 6.20 intra_pred_smooth_v_w16_8bpc_neon: 3.70 3.44 2.53 3.33 2.76 3.21 intra_pred_smooth_v_w32_8bpc_neon: 3.91 3.74 2.70 3.51 2.50 2.96 intra_pred_smooth_v_w64_8bpc_neon: 4.03 3.94 2.80 3.64 2.36 2.80 intra_pred_smooth_w4_8bpc_neon: 4.09 7.74 4.54 4.79 3.26 5.10 intra_pred_smooth_w8_8bpc_neon: 5.63 8.93 6.62 8.28 3.73 6.04 intra_pred_smooth_w16_8bpc_neon: 3.97 3.40 3.32 3.74 3.01 3.77 intra_pred_smooth_w32_8bpc_neon: 3.75 3.14 3.07 3.28 2.65 3.17 intra_pred_smooth_w64_8bpc_neon: 3.60 3.04 2.93 2.97 2.35 2.85 intra_pred_filter_w4_8bpc_neon: 5.54 6.43 4.90 7.26 3.44 4.61 intra_pred_filter_w8_8bpc_neon: 7.05 7.15 5.50 10.05 4.29 6.02 intra_pred_filter_w16_8bpc_neon: 7.36 6.46 5.27 11.51 4.75 6.70 intra_pred_filter_w32_8bpc_neon: 7.56 6.32 5.01 12.34 4.47 6.97 pal_pred_w4_8bpc_neon: 5.47 7.76 4.40 5.20 8.32 7.03 pal_pred_w8_8bpc_neon: 11.11 14.12 8.44 13.95 11.88 12.43 pal_pred_w16_8bpc_neon: 14.38 20.95 9.84 17.43 14.77 13.56 pal_pred_w32_8bpc_neon: 12.91 19.85 10.87 19.03 14.63 14.62 pal_pred_w64_8bpc_neon: 14.01 19.23 10.82 19.82 16.23 16.32 cfl_ac_420_w4_8bpc_neon: 8.11 13.41 7.92 9.26 10.55 9.36 cfl_ac_420_w8_8bpc_neon: 7.77 15.71 7.69 8.94 9.76 8.56 cfl_ac_420_w16_8bpc_neon: 7.72 13.71 8.30 9.05 9.81 9.02 cfl_ac_422_w4_8bpc_neon: 8.85 15.80 8.26 10.97 13.04 10.00 cfl_ac_422_w8_8bpc_neon: 8.77 16.96 7.57 10.46 12.16 9.92 cfl_ac_422_w16_8bpc_neon: 8.28 14.91 7.16 9.69 10.57 9.18 cfl_ac_444_w4_8bpc_neon: 7.47 14.13 7.50 9.76 11.11 9.39 cfl_ac_444_w8_8bpc_neon: 6.81 15.46 5.27 9.11 12.09 9.76 cfl_ac_444_w16_8bpc_neon: 6.11 13.68 4.62 8.17 10.78 8.92 cfl_ac_444_w32_8bpc_neon: 5.71 12.11 4.28 7.53 9.53 8.52 cfl_pred_cfl_128_w4_8bpc_neon: 7.46 12.63 8.48 8.03 7.64 9.29 cfl_pred_cfl_128_w8_8bpc_neon: 5.05 5.16 3.79 4.64 5.07 4.42 cfl_pred_cfl_128_w16_8bpc_neon: 4.44 5.17 3.65 4.20 4.41 4.74 cfl_pred_cfl_128_w32_8bpc_neon: 4.51 5.25 3.67 4.29 4.39 4.73 cfl_pred_cfl_left_w4_8bpc_neon: 6.60 11.74 7.75 6.91 7.44 9.14 cfl_pred_cfl_left_w8_8bpc_neon: 4.92 5.15 3.80 4.41 5.44 4.81 cfl_pred_cfl_left_w16_8bpc_neon: 4.40 5.26 3.66 4.10 4.63 4.94 cfl_pred_cfl_left_w32_8bpc_neon: 4.50 5.31 3.68 4.25 4.43 4.82 cfl_pred_cfl_top_w4_8bpc_neon: 7.00 11.88 7.88 7.50 7.43 9.68 cfl_pred_cfl_top_w8_8bpc_neon: 4.96 5.07 3.78 4.51 5.31 4.75 cfl_pred_cfl_top_w16_8bpc_neon: 4.42 5.31 3.69 4.16 4.60 4.93 cfl_pred_cfl_top_w32_8bpc_neon: 4.52 5.36 3.71 4.29 4.47 4.83 cfl_pred_cfl_w4_8bpc_neon: 5.92 10.54 7.25 6.21 6.79 8.33 cfl_pred_cfl_w8_8bpc_neon: 4.67 5.16 3.77 4.14 5.20 4.71 cfl_pred_cfl_w16_8bpc_neon: 4.29 5.29 3.70 3.97 4.53 4.86 cfl_pred_cfl_w32_8bpc_neon: 4.47 5.34 3.72 4.20 4.42 4.83
-
Martin Storsjö authored
Do the horizontal summing in the same way as for other cases of 32 pixel summing. This doesn't seem to affect the runtime significantly though (checkasm benchmarks vary by a couple cycles), but it's 5 instructions shorter at least.
-
Martin Storsjö authored
-
Martin Storsjö authored
This matches the arm64 original. The comment isn't about the condition, but about the state after the conditional branch.
-
Martin Storsjö authored
These came from matching some parts too closely to the arm64 version (where the summation can be done efficiently with uaddlv by zeroing the upper half of the register). Before: Cortex A7 A8 A9 A53 A72 A73 intra_pred_dc_w4_8bpc_neon: 124.5 65.1 90.2 100.4 48.1 50.4 After: intra_pred_dc_w4_8bpc_neon: 120.3 60.7 83.6 94.0 44.1 47.9
-
Martin Storsjö authored
This speeds things up a bit on older cores. Also do a load that duplicates the input over the whole register instead of just loading a single lane in iprev_v_w4. This can be a bit faster on Cortex A8. Before: Cortex A7 A8 A9 A53 A72 A73 intra_pred_v_w4_8bpc_neon: 54.0 38.4 46.4 47.7 20.4 18.1 intra_pred_h_w4_8bpc_neon: 66.3 43.1 55.0 57.0 27.9 22.2 intra_pred_h_w8_8bpc_neon: 81.0 60.2 76.7 66.5 31.1 30.1 intra_pred_dc_left_w4_8bpc_neon: 91.0 49.0 72.8 77.7 35.4 38.5 intra_pred_dc_left_w8_8bpc_neon: 103.8 73.5 90.2 84.7 42.8 47.1 intra_pred_dc_left_w16_8bpc_neon: 156.1 101.8 186.1 119.4 77.7 92.6 intra_pred_dc_left_w32_8bpc_neon: 270.5 200.5 381.6 191.7 152.6 170.3 intra_pred_dc_left_w64_8bpc_neon: 560.7 439.1 877.0 375.4 333.5 343.6 After: intra_pred_v_w4_8bpc_neon: 53.9 38.0 46.4 47.7 19.8 19.2 intra_pred_h_w4_8bpc_neon: 66.5 39.2 52.6 57.0 27.7 22.2 intra_pred_h_w8_8bpc_neon: 80.5 55.8 72.9 66.5 31.4 30.1 intra_pred_dc_left_w4_8bpc_neon: 91.0 48.2 71.8 77.7 34.9 38.6 intra_pred_dc_left_w8_8bpc_neon: 103.8 69.6 89.2 84.7 43.2 47.3 intra_pred_dc_left_w16_8bpc_neon: 182.3 99.9 184.9 118.8 77.7 85.8 intra_pred_dc_left_w32_8bpc_neon: 355.4 198.9 380.1 190.6 152.9 161.0 intra_pred_dc_left_w64_8bpc_neon: 517.5 437.4 876.9 375.7 333.3 347.7
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 cfl_ac_444_w4_16bpc_neon: 8.03 9.41 10.48 cfl_ac_444_w8_16bpc_neon: 10.17 10.54 10.38 cfl_ac_444_w16_16bpc_neon: 10.73 10.38 9.73 cfl_ac_444_w32_16bpc_neon: 10.18 9.43 9.77
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 cfl_ac_444_w4_8bpc_neon: 8.72 8.75 10.50 cfl_ac_444_w8_8bpc_neon: 13.10 10.77 11.23 cfl_ac_444_w16_8bpc_neon: 13.08 9.95 10.49 cfl_ac_444_w32_8bpc_neon: 12.58 9.43 10.63
-
Martin Storsjö authored
The branch target is directly afterwards, so the branch isn't needed.
-
Martin Storsjö authored
It became unused in 38629906.
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 intra_pred_filter_w16_8bpc_neon: 540.2 573.8 580.2 intra_pred_filter_w32_8bpc_neon: 1223.1 1364.1 1292.9 After: intra_pred_filter_w16_8bpc_neon: 531.4 559.8 565.4 intra_pred_filter_w32_8bpc_neon: 1243.0 1308.6 1270.9 This does give a minor slowdown for the w32 case on A53, but helps on w16 and quite notably in all cases on A72 and A73. Doing the same modification on ipred16.S doesn't give quite as clear gains (the gains on A72 and A73 are smaller, and the regression on A53 on on w32 is a bit bigger), so not doing the same adjustment there.
-
Martin Storsjö authored
-
Martin Storsjö authored
-
- Jul 01, 2020
-
-
-
%{:} macro operand ranges were broken in nasm 2.15 which causes errors when compiling, so avoid using those for now. Some new warnings regarding use of empty macro parameters has also been added, adjust some x86inc code to silence those.
-
- Jun 29, 2020
-
-
Meson does not yet normalises arm64 to the aarch64 in the reference table. To workaround this, in addition to the cpu_family check the cpu field.
-
Since 46d092ae the demuxer is no longer detected from extension but rather by probing.
-
-
- Jun 25, 2020
-
-
- Jun 24, 2020
-
-
- Jun 23, 2020
-
-
Broadcasting a memory operand is binary flag, you either broadcast or you don't, and there's only a single possible element size for any given instruction. The instruction syntax however requires the broadcast semanticts to be explicitly defined, which is an issue when using macros to template code for multiple register widhts. Add some helper defines to alleviate the issue.
-
Ronald S. Bultje authored
The shift-amount can be up to 56, and left-shifting 32-bit integers by values >=32 is undefined behaviour. Therefore, use 64-bit integers instead. Also slightly rewrite so we only call dav1d_get_bits() once for the combined more|bits value, and mask the relevant portions out instead of reading twice. Lastly, move the overflow check out of the loop (as suggested by @wtc) Fixes #341.
-
- Jun 21, 2020
-
-
-
-
dav1d_init_get_bits() initializes c->eof to 0, which implies c->ptr < c->ptr_end, or equivalently sz > 0.
-
- Jun 20, 2020
-
-
Jean-Baptiste Kempf authored
-
Luc Trudeau authored
-
- Jun 19, 2020
-
-
Some specific Haswell CPU:s have a hardware bug where the popcnt instruction doesn't set zero flag correctly, which causes the wrong branch to be taken. popcnt also has a 3-cycle latency on Intel CPU:s, so doing the branch on the input value instead of the output reduces the amount of time wasted going down the wrong code path in case of branch mispredictions.
-
Martin Storsjö authored
Only use this in the cases when NEON can be used unconditionally without runtime detection (when __ARM_NEON is defined). The speedup over the C code is very modest for the smaller functions (and the NEON version actually is a little slower than the C code on Cortex A7 for adapt4), but the speedup is around 2x for adapt16. Cortex A7 A8 A9 A53 A72 A73 msac_decode_bool_c: 41.1 43.0 43.0 37.3 26.2 31.3 msac_decode_bool_neon: 40.2 42.0 37.2 32.8 19.9 25.5 msac_decode_bool_adapt_c: 65.1 70.4 58.5 54.3 33.2 40.8 msac_decode_bool_adapt_neon: 56.8 52.4 49.3 42.6 27.1 33.7 msac_decode_bool_equi_c: 36.9 37.2 42.8 32.6 22.7 42.3 msac_decode_bool_equi_neon: 34.9 35.1 36.4 29.7 19.5 36.4 msac_decode_symbol_adapt4_c: 114.2 139.0 111.6 99.9 65.5 83.5 msac_decode_symbol_adapt4_neon: 119.2 128.3 95.7 82.2 58.2 57.5 msac_decode_symbol_adapt8_c: 176.0 207.9 164.0 154.4 88.0 117.0 msac_decode_symbol_adapt8_neon: 128.3 130.3 110.7 85.1 59.9 61.4 msac_decode_symbol_adapt16_c: 292.1 320.5 256.4 246.4 129.1 173.3 msac_decode_symbol_adapt16_neon: 162.2 144.3 129.0 104.2 69.2 69.9 (Omitting msac_decode_hi_tok from the benchmark, as the "C" version measured there uses the NEON version of msac_decode_symbol_adapt4.)
-
- Jun 18, 2020
-
-
Martin Storsjö authored
The speedup (over the normal version, that just calls the existing assembly version of symbol_adapt4) is not very impressive on bigger cores, but looks decent on small cores. It's an improvement though, in any case. Cortex A53 A72 A73 msac_decode_hi_tok_c: 175.7 136.2 138.1 msac_decode_hi_tok_neon: 146.8 129.4 125.9
-