- Sep 01, 2020
-
-
The previous floating-point implementation produced results that were sometimes slightly off due to rounding errors. For example, a frame size of 432x240 with a render size of 176x240 previously resulted in a PAR of 98:240 instead of the correct 11:27. Also reduce fractions to produce more readable numbers.
-
- Aug 30, 2020
-
-
This adds A<W>:<H> to the Y4M header, to preserve the intended aspect ratio for anamorphic video.
-
- Aug 29, 2020
-
-
Martin Storsjö authored
Cortex A7 A8 A9 A53 A72 A73 avg_w4_16bpc_neon: 131.4 81.8 117.3 111.0 50.9 58.8 avg_w8_16bpc_neon: 291.9 173.1 293.1 230.9 114.7 128.8 avg_w16_16bpc_neon: 803.3 480.1 821.4 645.8 345.7 384.9 avg_w32_16bpc_neon: 3350.0 1833.1 3188.1 2343.5 1343.9 1500.6 avg_w64_16bpc_neon: 8185.9 4390.6 10448.2 6078.8 3303.6 3466.7 avg_w128_16bpc_neon: 22384.3 10901.2 33721.9 16782.7 8165.1 8416.5 w_avg_w4_16bpc_neon: 251.3 165.8 203.9 158.3 99.6 106.9 w_avg_w8_16bpc_neon: 638.4 427.8 555.7 365.1 283.2 277.4 w_avg_w16_16bpc_neon: 1912.3 1257.5 1623.4 1056.5 879.5 841.8 w_avg_w32_16bpc_neon: 7461.3 4889.6 6383.8 3966.3 3286.8 3296.8 w_avg_w64_16bpc_neon: 18689.3 11698.1 18487.3 10134.1 8156.2 7939.5 w_avg_w128_16bpc_neon: 48776.6 28989.0 53203.3 26004.1 20055.2 20049.4 mask_w4_16bpc_neon: 298.6 189.2 242.3 191.6 115.2 129.6 mask_w8_16bpc_neon: 768.6 501.5 646.1 432.4 302.9 326.8 mask_w16_16bpc_neon: 2320.5 1480.9 1873.0 1270.2 932.2 976.1 mask_w32_16bpc_neon: 9412.0 5791.9 7348.5 4875.1 3896.4 3821.1 mask_w64_16bpc_neon: 23385.9 13875.6 21383.8 12235.9 9469.2 9160.2 mask_w128_16bpc_neon: 60466.4 34762.6 61055.9 31214.0 23299.0 23324.5 For comparison, the corresponding numbers for the existing arm64 implementation: avg_w4_16bpc_neon: 78.0 38.5 50.0 avg_w8_16bpc_neon: 198.3 105.4 117.8 avg_w16_16bpc_neon: 614.9 339.9 376.7 avg_w32_16bpc_neon: 2313.8 1391.1 1487.7 avg_w64_16bpc_neon: 5733.3 3269.1 3648.4 avg_w128_16bpc_neon: 15105.9 8143.5 8970.4 w_avg_w4_16bpc_neon: 119.2 87.7 92.9 w_avg_w8_16bpc_neon: 322.9 252.3 263.5 w_avg_w16_16bpc_neon: 1016.8 794.0 828.6 w_avg_w32_16bpc_neon: 3910.9 3159.6 3308.3 w_avg_w64_16bpc_neon: 9499.6 7933.9 8026.5 w_avg_w128_16bpc_neon: 24508.3 19502.0 20389.8 mask_w4_16bpc_neon: 138.9 98.7 106.7 mask_w8_16bpc_neon: 375.5 301.1 302.7 mask_w16_16bpc_neon: 1217.2 1064.6 954.4 mask_w32_16bpc_neon: 4821.0 4018.4 3825.7 mask_w64_16bpc_neon: 12262.7 9471.3 9169.7 mask_w128_16bpc_neon: 31356.6 22657.6 23324.5
-
- Aug 28, 2020
-
-
Martin Storsjö authored
We can't compare the decoding speed with the intended decoding rate, but the frame rate alone is still useful.
-
- Aug 22, 2020
-
-
Janne Grunau authored
Errors on C11 features like anonymous strucs/unions.
-
Janne Grunau authored
-
Janne Grunau authored
-
Janne Grunau authored
-
Janne Grunau authored
Also changes the type intptr_t to make adding variable size members more convenient.
-
- Aug 21, 2020
-
-
Janne Grunau authored
-
-
Makes using unmodified upstream x86inc.asm possible.
-
-
-
- Aug 07, 2020
-
-
Martin Storsjö authored
This fixes building in configurations where no readtime implementation is available at all, such as MSVC targeting 32 bit ARM. This was missed when the check was added in 95a19254.
-
Martin Storsjö authored
Move the declaration of func_ref/func_new into declare_func. This enforces that declare_func is a scope outside of/before check_func. This ensures that if the signal handler is triggered, we rewind to a scope outside of check_func, where check_func makes sure we don't rerun the test that just triggered the signal handler.
-
- Aug 06, 2020
-
-
James Almer authored
The relevant structs are filled immediately after them.
-
James Almer authored
Cosmetic change.
-
The cycle counter instructions aren't accessible on iOS/macOS on ARM. The mach_absolute_time() function has much coarser precision, but is the least bad option available.
-
The signal handler does a longjmp back to the location of declare_func when there's a signal. If declare_func is located within the check_func block, it will just end up in an endless loop, retrying running the failing tests again. On linux, after resuming from the signal handler, the second signal wouldn't trigger the signal handler but forcibly exit the process, while on darwin, it would get stuck in an endless loop. msac_decode_bool seems to be the only checkasm test with declare_func within the check_func block.
-
This gives a clearer indication about what is wrong, instead of running into illegal instruction errors in the individual tests. On ARM and AArch64, access to the cycle counter register is forbidden in user mode code by default on Linux and Darwin.
-
Victorien Le Couviour--Tuffet authored
-
- Aug 05, 2020
-
-
Victorien Le Couviour--Tuffet authored
-
- Jul 20, 2020
-
-
Henrik Gramner authored
-
Marvin Scholz authored
-
- Jul 13, 2020
-
-
Matthias Dressel authored
-
A bitstream may contain values larger than the currently defined entries, but it's technically UB to put such values into an enum. Discovered in Firefox through fuzzing with UBSan.
-
Matthias Dressel authored
- Fix small typos - Add link to doxygen documentation - Add high bit-depth asm goals
-
- Jul 10, 2020
-
-
Nico Weber authored
This is a follow-up to ebc8e4d9. dav1d doesn't currently use this `const` macro, but rav1e does.
-
Nico Weber authored
This matches the `.hidden` already used for ELF outputs. This is needed for Chromium's mac/arm64 build. Chromium has a build step that verifies that Chromium Framework only exports a small, fixed set of symbols. The dav1d symbols showed up unexpectedly. This fixes that.
-
- Jul 09, 2020
-
-
Janne Grunau authored
Fixes #345.
-
- Jul 04, 2020
-
-
Jean-Baptiste Kempf authored
Removes files from top-level
-
- Jul 02, 2020
-
-
Martin Storsjö authored
This matches was is implemented for arm64 so far. Align the dav1d_sm_weights table to allow aligned loads from it. Relative speedups over C code (vs potentially autovectorized code, built with Clang): Cortex A7 A8 A9 A53 A72 A73 intra_pred_paeth_w4_8bpc_neon: 4.81 7.61 5.82 5.50 5.61 6.94 intra_pred_paeth_w8_8bpc_neon: 7.83 11.95 9.51 11.05 8.90 10.51 intra_pred_paeth_w16_8bpc_neon: 4.86 4.49 3.90 4.60 3.76 3.54 intra_pred_paeth_w32_8bpc_neon: 4.55 4.03 3.52 4.27 3.30 3.21 intra_pred_paeth_w64_8bpc_neon: 4.38 3.72 3.32 3.95 3.08 3.00 intra_pred_smooth_h_w4_8bpc_neon: 5.74 10.80 5.32 6.79 4.77 6.48 intra_pred_smooth_h_w8_8bpc_neon: 10.59 17.95 9.39 16.03 6.94 8.98 intra_pred_smooth_h_w16_8bpc_neon: 2.81 3.19 2.12 3.70 2.90 3.59 intra_pred_smooth_h_w32_8bpc_neon: 2.63 2.41 1.86 3.44 2.24 2.66 intra_pred_smooth_h_w64_8bpc_neon: 2.42 2.52 1.79 3.24 1.81 2.11 intra_pred_smooth_v_w4_8bpc_neon: 4.15 7.99 3.46 4.63 3.83 4.39 intra_pred_smooth_v_w8_8bpc_neon: 7.31 12.42 7.04 10.00 4.26 6.20 intra_pred_smooth_v_w16_8bpc_neon: 3.70 3.44 2.53 3.33 2.76 3.21 intra_pred_smooth_v_w32_8bpc_neon: 3.91 3.74 2.70 3.51 2.50 2.96 intra_pred_smooth_v_w64_8bpc_neon: 4.03 3.94 2.80 3.64 2.36 2.80 intra_pred_smooth_w4_8bpc_neon: 4.09 7.74 4.54 4.79 3.26 5.10 intra_pred_smooth_w8_8bpc_neon: 5.63 8.93 6.62 8.28 3.73 6.04 intra_pred_smooth_w16_8bpc_neon: 3.97 3.40 3.32 3.74 3.01 3.77 intra_pred_smooth_w32_8bpc_neon: 3.75 3.14 3.07 3.28 2.65 3.17 intra_pred_smooth_w64_8bpc_neon: 3.60 3.04 2.93 2.97 2.35 2.85 intra_pred_filter_w4_8bpc_neon: 5.54 6.43 4.90 7.26 3.44 4.61 intra_pred_filter_w8_8bpc_neon: 7.05 7.15 5.50 10.05 4.29 6.02 intra_pred_filter_w16_8bpc_neon: 7.36 6.46 5.27 11.51 4.75 6.70 intra_pred_filter_w32_8bpc_neon: 7.56 6.32 5.01 12.34 4.47 6.97 pal_pred_w4_8bpc_neon: 5.47 7.76 4.40 5.20 8.32 7.03 pal_pred_w8_8bpc_neon: 11.11 14.12 8.44 13.95 11.88 12.43 pal_pred_w16_8bpc_neon: 14.38 20.95 9.84 17.43 14.77 13.56 pal_pred_w32_8bpc_neon: 12.91 19.85 10.87 19.03 14.63 14.62 pal_pred_w64_8bpc_neon: 14.01 19.23 10.82 19.82 16.23 16.32 cfl_ac_420_w4_8bpc_neon: 8.11 13.41 7.92 9.26 10.55 9.36 cfl_ac_420_w8_8bpc_neon: 7.77 15.71 7.69 8.94 9.76 8.56 cfl_ac_420_w16_8bpc_neon: 7.72 13.71 8.30 9.05 9.81 9.02 cfl_ac_422_w4_8bpc_neon: 8.85 15.80 8.26 10.97 13.04 10.00 cfl_ac_422_w8_8bpc_neon: 8.77 16.96 7.57 10.46 12.16 9.92 cfl_ac_422_w16_8bpc_neon: 8.28 14.91 7.16 9.69 10.57 9.18 cfl_ac_444_w4_8bpc_neon: 7.47 14.13 7.50 9.76 11.11 9.39 cfl_ac_444_w8_8bpc_neon: 6.81 15.46 5.27 9.11 12.09 9.76 cfl_ac_444_w16_8bpc_neon: 6.11 13.68 4.62 8.17 10.78 8.92 cfl_ac_444_w32_8bpc_neon: 5.71 12.11 4.28 7.53 9.53 8.52 cfl_pred_cfl_128_w4_8bpc_neon: 7.46 12.63 8.48 8.03 7.64 9.29 cfl_pred_cfl_128_w8_8bpc_neon: 5.05 5.16 3.79 4.64 5.07 4.42 cfl_pred_cfl_128_w16_8bpc_neon: 4.44 5.17 3.65 4.20 4.41 4.74 cfl_pred_cfl_128_w32_8bpc_neon: 4.51 5.25 3.67 4.29 4.39 4.73 cfl_pred_cfl_left_w4_8bpc_neon: 6.60 11.74 7.75 6.91 7.44 9.14 cfl_pred_cfl_left_w8_8bpc_neon: 4.92 5.15 3.80 4.41 5.44 4.81 cfl_pred_cfl_left_w16_8bpc_neon: 4.40 5.26 3.66 4.10 4.63 4.94 cfl_pred_cfl_left_w32_8bpc_neon: 4.50 5.31 3.68 4.25 4.43 4.82 cfl_pred_cfl_top_w4_8bpc_neon: 7.00 11.88 7.88 7.50 7.43 9.68 cfl_pred_cfl_top_w8_8bpc_neon: 4.96 5.07 3.78 4.51 5.31 4.75 cfl_pred_cfl_top_w16_8bpc_neon: 4.42 5.31 3.69 4.16 4.60 4.93 cfl_pred_cfl_top_w32_8bpc_neon: 4.52 5.36 3.71 4.29 4.47 4.83 cfl_pred_cfl_w4_8bpc_neon: 5.92 10.54 7.25 6.21 6.79 8.33 cfl_pred_cfl_w8_8bpc_neon: 4.67 5.16 3.77 4.14 5.20 4.71 cfl_pred_cfl_w16_8bpc_neon: 4.29 5.29 3.70 3.97 4.53 4.86 cfl_pred_cfl_w32_8bpc_neon: 4.47 5.34 3.72 4.20 4.42 4.83
-
Martin Storsjö authored
Do the horizontal summing in the same way as for other cases of 32 pixel summing. This doesn't seem to affect the runtime significantly though (checkasm benchmarks vary by a couple cycles), but it's 5 instructions shorter at least.
-
Martin Storsjö authored
-
Martin Storsjö authored
This matches the arm64 original. The comment isn't about the condition, but about the state after the conditional branch.
-
Martin Storsjö authored
These came from matching some parts too closely to the arm64 version (where the summation can be done efficiently with uaddlv by zeroing the upper half of the register). Before: Cortex A7 A8 A9 A53 A72 A73 intra_pred_dc_w4_8bpc_neon: 124.5 65.1 90.2 100.4 48.1 50.4 After: intra_pred_dc_w4_8bpc_neon: 120.3 60.7 83.6 94.0 44.1 47.9
-
Martin Storsjö authored
This speeds things up a bit on older cores. Also do a load that duplicates the input over the whole register instead of just loading a single lane in iprev_v_w4. This can be a bit faster on Cortex A8. Before: Cortex A7 A8 A9 A53 A72 A73 intra_pred_v_w4_8bpc_neon: 54.0 38.4 46.4 47.7 20.4 18.1 intra_pred_h_w4_8bpc_neon: 66.3 43.1 55.0 57.0 27.9 22.2 intra_pred_h_w8_8bpc_neon: 81.0 60.2 76.7 66.5 31.1 30.1 intra_pred_dc_left_w4_8bpc_neon: 91.0 49.0 72.8 77.7 35.4 38.5 intra_pred_dc_left_w8_8bpc_neon: 103.8 73.5 90.2 84.7 42.8 47.1 intra_pred_dc_left_w16_8bpc_neon: 156.1 101.8 186.1 119.4 77.7 92.6 intra_pred_dc_left_w32_8bpc_neon: 270.5 200.5 381.6 191.7 152.6 170.3 intra_pred_dc_left_w64_8bpc_neon: 560.7 439.1 877.0 375.4 333.5 343.6 After: intra_pred_v_w4_8bpc_neon: 53.9 38.0 46.4 47.7 19.8 19.2 intra_pred_h_w4_8bpc_neon: 66.5 39.2 52.6 57.0 27.7 22.2 intra_pred_h_w8_8bpc_neon: 80.5 55.8 72.9 66.5 31.4 30.1 intra_pred_dc_left_w4_8bpc_neon: 91.0 48.2 71.8 77.7 34.9 38.6 intra_pred_dc_left_w8_8bpc_neon: 103.8 69.6 89.2 84.7 43.2 47.3 intra_pred_dc_left_w16_8bpc_neon: 182.3 99.9 184.9 118.8 77.7 85.8 intra_pred_dc_left_w32_8bpc_neon: 355.4 198.9 380.1 190.6 152.9 161.0 intra_pred_dc_left_w64_8bpc_neon: 517.5 437.4 876.9 375.7 333.3 347.7
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 cfl_ac_444_w4_16bpc_neon: 8.03 9.41 10.48 cfl_ac_444_w8_16bpc_neon: 10.17 10.54 10.38 cfl_ac_444_w16_16bpc_neon: 10.73 10.38 9.73 cfl_ac_444_w32_16bpc_neon: 10.18 9.43 9.77
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 cfl_ac_444_w4_8bpc_neon: 8.72 8.75 10.50 cfl_ac_444_w8_8bpc_neon: 13.10 10.77 11.23 cfl_ac_444_w16_8bpc_neon: 13.08 9.95 10.49 cfl_ac_444_w32_8bpc_neon: 12.58 9.43 10.63
-