- 31 Jan, 2023 2 commits
-
-
Martin Storsjö authored
Add an option for selecting the core where the single thread of checkasm runs. This allows benchmarking on specific CPU cores on heterogenous CPUs, like ARM big.LITTLE configurations. On Linux, one can easily wrap an invocation of checkasm with "taskset -c <n> [...]" - so this option isn't very essential there - however it is quite useful on Windows. On Windows, it is somewhat possible to do the same by launching the tool with "start /B /affinity <hexmask> [...]", but that doesn't work well with scripting ("start" returns before the command has finished running, and it's not obvious how to invoke "start" from within WSL). Using "taskset" to launch processes on specific cores within WSL on Windows doesn't work - regardless of the Linux level affinity, the process ends up running on the performance cores anyway.
-
Martin Storsjö authored
The implementation is a hybrid between two approaches; one generic (but non-ideal) for cases with large max_base_y, which fills two pixel columns at a time, i.e. looping over pixels first vertically, then horizontally - i.e. in a non-optimal manner. For cases with smaller max_base_y, it does two rows at a time, essentially doing gathers with the TBX instruction. Relative speedup over the C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z3_w4_8bpc_neon: 3.32 2.89 2.78 3.52 2.52 9.67 intra_pred_z3_w8_8bpc_neon: 6.24 5.55 4.76 5.60 4.11 6.40 intra_pred_z3_w16_8bpc_neon: 7.64 7.07 4.37 6.23 4.18 8.60 intra_pred_z3_w32_8bpc_neon: 7.51 7.21 4.34 5.92 4.27 7.88 intra_pred_z3_w64_8bpc_neon: 6.82 6.25 4.08 5.83 3.52 7.31
-
- 27 Jan, 2023 2 commits
-
-
Martin Storsjö authored
Relative speedup over the C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z1_w4_8bpc_neon: 4.09 3.15 3.63 4.16 3.27 13.00 intra_pred_z1_w8_8bpc_neon: 6.93 5.66 5.57 6.76 5.51 5.50 intra_pred_z1_w16_8bpc_neon: 7.81 6.85 6.24 7.78 6.59 9.00 intra_pred_z1_w32_8bpc_neon: 10.56 9.95 8.72 10.95 8.28 13.33 intra_pred_z1_w64_8bpc_neon: 11.00 11.38 9.11 11.62 8.65 14.61 (The speedup numbers for M1 are kinda noisy due to the very coarse granularity of the timer used there.)
-
Martin Storsjö authored
These functions contain a number of different codepaths; try to make sure that we hit most codepaths for each size combination. This both gives better test coverage in one single run of checkasm, but also should give a better averaged runtime in benchmarks.
-
- 26 Jan, 2023 2 commits
-
-
Henrik Gramner authored
-
Victorien Le Couviour--Tuffet authored
Fixes #416.
-
- 12 Jan, 2023 1 commit
-
-
Henrik Gramner authored
The intent was good, but in practice it results in a significant amount of problems due to various compiler bugs for negligible gains.
-
- 14 Dec, 2022 4 commits
-
-
James Almer authored
Should be useful for scenarios like wanting only keyframes to quickly generate a set of preview images of the whole stream.
-
James Almer authored
-
Henrik Gramner authored
-
Henrik Gramner authored
-
- 13 Dec, 2022 3 commits
-
-
Henrik Gramner authored
bits_left could underflow after reaching EOB. Credit to OSS-Fuzz.
-
Henrik Gramner authored
-
-
- 09 Dec, 2022 5 commits
-
-
Henrik Gramner authored
-
Henrik Gramner authored
A length of 1 is by far the most common case, and having a special case for that is not only slightly faster but also reduces code size by a decent amount due to not having to pass a length argument every time.
-
Henrik Gramner authored
The Dav1dSequenceHeader struct is already zero-initialized, so zeroing individual values a second time is redundant.
-
Henrik Gramner authored
According to section 6.4.1 of the AV1 specification, the value should be equal to BUFFER_POOL_MAX_SIZE (10) when not explicitly signaled.
-
James Almer authored
Fixes segfaults if you run the CLI with an invalid argument for --inloopfilters
-
- 04 Dec, 2022 1 commit
-
-
Luca Barbato authored
Fixes: #412
-
- 21 Nov, 2022 2 commits
-
-
Luca Barbato authored
It mirrors what is done with neon as well. Fixes: #413
-
Luca Barbato authored
clang-15 doesn't consider it compile-time-constant anymore.
-
- 10 Nov, 2022 1 commit
-
-
- 30 Oct, 2022 2 commits
- 27 Oct, 2022 1 commit
-
-
Victorien Le Couviour--Tuffet authored
-
- 26 Oct, 2022 1 commit
-
-
Martin Storsjö authored
This fixes building with MSVC (and older GCC versions) after 3e7886db.
-
- 20 Oct, 2022 1 commit
-
-
Victorien Le Couviour--Tuffet authored
The completion of the first frame to decode while an async reset request on that same frame is pending will render it stale. The processing of such a stale request is likely to result in a hang. One reason this happens is the skip condition at the beginning of reset_task_cur(). => Consume the async request before that check. Another reason is several threads producing async reset requests in parallel: an async request for the first frame could cascade through the other threads (other frames) during completion of that frame, meaning not being caught by the last synchronous reset_task_cur() after signaling the main thread and before releasing the lock. => To solve this we need to add protections at the racy locations. That means after we increase first, before returning from reset_task_cur_async(), and after consuming the async request.
-
- 10 Oct, 2022 1 commit
-
-
Sebastian Dröge authored
Despite not being documented in Meson's list of canonical system names, Meson does accept 'ios' mostly a synonym for darwin. By using 'ios' instead of darwin, it allows distinguishing between the two in the cases where that is necessary. Therefore, within dav1d, allow using the 'ios' name as alias for 'darwin' for system name, to allow using cross files that does this distinction. meson itself also allows 'tvos' in addition to 'ios' in the internal `is_darwin()` function, as such all 3 are handled the same here.
-
- 30 Sep, 2022 2 commits
-
-
Henrik Gramner authored
-
Henrik Gramner authored
'-fvisibility=hidden' only applies to definitions, not declarations, so the compiler has to be conservative about how references to global data symbols are performed. Explicitly specifying the visibility allows for better code generation.
-
- 28 Sep, 2022 3 commits
-
-
Henrik Gramner authored
Whitespace is added to the result if compiling with MSVC using /std:c11 which breaks various things. Adding strip() fixes the problem.
-
Henrik Gramner authored
-
Henrik Gramner authored
Use explicit parameter type detection and manually clobber the upper bits instead of relying on internal compiler behavior.
-
- 26 Sep, 2022 1 commit
-
-
Henrik Gramner authored
The 32-bit width parameter was used directly as a pointer offset, but the upper half is undefined. Fix it by replacing 'cmp' with 'sub' to explicitly zero those bits.
-
- 19 Sep, 2022 4 commits
-
-
Martin Storsjö authored
This fixes conformance with the argon test samples, in particular with these samples: profile0_core/streams/test10100_579_8614.obu profile0_core/streams/test10218_6914.obu This gives a pretty notable slowdown to these transforms - some examples: Before: Cortex A53 A72 A73 Apple M1 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 365.7 290.2 299.8 0.3 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1865.2 1384.1 1457.5 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 33976.3 26817.0 24864.2 40.4 After: inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 397.7 322.2 335.1 0.4 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 2121.9 1336.7 1664.6 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 38569.4 27622.6 28176.0 51.0 Thus, for the transforms alone, it makes them around 10-13% slower (the Apple M1 measurements are too noisy to be conclusive here). Measured on actual full decoding, it makes decoding of 10 bpc Chimera around maybe 1% slower on an Apple M1 - close to measurement noise anyway.
-
Henrik Gramner authored
-
Henrik Gramner authored
Using smaller immediates also results in a small code size reduction in some cases, so apply those changes to the (10bpc-only) SSE code as well.
-
Henrik Gramner authored
Certain clips were incorrectly performed on negated values, which caused things to be off-by-one in both directions. Correct this by negating such values prior to clipping instead of afterwards.
-
- 15 Sep, 2022 1 commit
-
-
Martin Storsjö authored
Since meson 0.58.0 (released in May 2021), meson accepts adding '.S' assembly files as source files to the clang-cl compiler. If using an older version of meson, keep using gas-preprocessor just like for MSVC builds.
-