- 16 Apr, 2021 1 commit
-
-
James Almer authored
And a function to fetch them. Should be useful to signal changes in the bitstream the user may want to know about. Starting with two flags, DAV1D_EVENT_FLAG_NEW_SEQUENCE and DAV1D_EVENT_FLAG_NEW_OP_PARAMS_INFO, which signal the presence of an updated sequence header in the last returned (or to be returned) picture.
-
- 14 Apr, 2021 5 commits
-
-
Martin Storsjö authored
This is the same as what was done for the fguv function, to reduce the amount of space used for it (and also simplifying the calling code). This gives no significant slowdown for the case currently benchmarked by checkasm, while shrinking the code produced by film_grain.S by 320 bytes.
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 Apple M1 fguv_32x32xn_8bpc_420_csfl0_neon: 4.51 2.87 3.88 6.51 fguv_32x32xn_8bpc_420_csfl1_neon: 3.74 2.96 2.96 3.49 fguv_32x32xn_8bpc_422_csfl0_neon: 4.49 3.18 4.07 5.00 fguv_32x32xn_8bpc_422_csfl1_neon: 3.74 3.03 3.04 2.67 fguv_32x32xn_8bpc_444_csfl0_neon: 6.68 4.24 5.66 5.02 fguv_32x32xn_8bpc_444_csfl1_neon: 5.40 3.69 4.22 3.61
-
Martin Storsjö authored
A static_assert is used if available, otherwise a custom construct.
-
Martin Storsjö authored
Previously, only some combinations of overlap were tested in each run. Also benchmark with and without overlap.
-
Martin Storsjö authored
The fgy function already used the round2 helper function in this way.
-
- 12 Apr, 2021 1 commit
-
-
Matthias Dressel authored
meson 0.57.0 introduced an optimization [0] for `meson test` to only rebuild test dependencies. This does not cover changing the build configuration anymore. [0] https://mesonbuild.com/Release-notes-for-0-57-0.html
-
- 16 Mar, 2021 1 commit
-
-
Martin Storsjö authored
The usual two-layer macro expansion for concatenation isn't needed here, as the parameters that needs expanding (PIXEL_TYPE, COEF_TYPE) end up expanded by the intermediate checkasm_check() macro anyway.
-
- 15 Mar, 2021 1 commit
-
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_8bpc_neon: 4.48 2.84 3.73 5.64
-
- 07 Mar, 2021 1 commit
-
-
Matthias Dressel authored
Some AVX2 instructions cannot be macroed by x86inc.asm. Some instructions are valid in SSE4 but not in SSSE3, therefor checking both. * Conroe is up to SSSE3 * Penryn is up to SSE4.1 See also: 4dd94315
-
- 21 Feb, 2021 1 commit
-
-
Jean-Baptiste Kempf authored
-
- 19 Feb, 2021 9 commits
-
-
Martin Storsjö authored
Relative speedup vs C for a few functions: Cortex A7 A8 A9 A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 2.79 5.08 2.99 2.83 3.49 4.44 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 5.74 9.43 5.72 7.19 6.73 6.92 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 3.13 3.68 2.79 3.25 3.21 3.33 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 7.09 10.41 7.00 10.55 8.06 9.02 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.01 6.76 4.56 5.58 5.52 2.97 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 8.62 12.48 13.71 11.75 15.94 16.86 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 6.05 8.81 6.13 8.18 7.90 12.27 inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.90 3.90 2.16 2.63 3.56 2.74 inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 13.57 17.00 13.30 13.76 14.54 17.08 inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 8.29 10.54 8.05 10.68 12.75 14.36 inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 6.78 8.40 7.60 10.12 8.97 12.96 inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 6.48 6.74 6.00 7.38 7.67 9.70 inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 3.02 4.59 2.21 2.65 3.36 2.47 inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 9.86 11.30 9.14 13.80 12.46 14.83 inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 8.65 9.76 7.60 12.05 10.55 12.62 inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 7.78 8.65 6.98 10.63 9.15 11.73 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 6.61 7.01 5.52 8.41 8.33 9.69
-
Martin Storsjö authored
While these might not be needed in practice, add them for consistency.
-
Martin Storsjö authored
-
Martin Storsjö authored
This makes these instances consistent with the rest of similar cases.
-
Martin Storsjö authored
-
Martin Storsjö authored
In these cases, the function wrote a 64 pixel wide output, regardless of the actual width.
-
Martin Storsjö authored
-
Nathan Egge authored
Relative speed-ups over C code (compared with gcc-9.3.0): C AVX2 wiener_5tap_10bpc: 194892.0 14831.9 13.14x wiener_5tap_12bpc: 194295.4 14828.9 13.10x wiener_7tap_10bpc: 194391.7 19461.4 9.99x wiener_7tap_12bpc: 194136.1 19418.7 10.00x
-
Nathan Egge authored
-
- 17 Feb, 2021 3 commits
-
-
Jean-Baptiste Kempf authored
-
Nathan Egge authored
Relative speed-ups over C code (compared with gcc-9.3.0): C ASM cdef_dir_16bpc_avx2: 534.2 72.5 7.36x cdef_dir_16bpc_ssse3: 534.2 104.8 5.10x cdef_dir_16bpc_ssse3 (x86-32): 854.1 116.2 7.35x
-
Nathan Egge authored
-
- 16 Feb, 2021 2 commits
-
-
Henrik Gramner authored
-
Henrik Gramner authored
-
- 15 Feb, 2021 5 commits
-
-
Jean-Baptiste Kempf authored
-
Henrik Gramner authored
It's supposed to warn about const-correctness issues, but it doesn't handle arrays of pointers correctly and will cause false positive warnings when using memset() to zero such arrays for example.
-
Henrik Gramner authored
-
Henrik Gramner authored
Not having a quantizer matrix is the most common case, so it's worth having a separate code path for it that eliminates some calculations and table lookups. Without a qm, not only can we skip calculating dq * qm, but only Exp-Golomb-coded coefficients will have the potential to overflow, so we can also skip clipping for the vast majority of coefficients.
-
Henrik Gramner authored
Cache indices of non-zero coefficients during the AC token decoding loop in order to speed up the sign decoding/dequant loop later.
-
- 13 Feb, 2021 1 commit
-
-
Henrik Gramner authored
Looprestoration SIMD code may overread the input buffers by a small amount. Pad the buffer to make sure this memory is valid to access.
-
- 12 Feb, 2021 3 commits
-
-
Marvin Scholz authored
-
Martin Storsjö authored
On darwin, 32 bit parameters that aren't passed in registers but on the stack, are packed tightly instead of each of them occupying an 8 byte slot.
-
Nathan Egge authored
-
- 11 Feb, 2021 5 commits
-
-
Emmanuel Gil Peyrot authored
-
Henrik Gramner authored
The previous implementation did multiple passes in the horizontal and vertical directions, with the intermediate values being stored in buffers on the stack. This caused bad cache thrashing. By interleaving the all the different passes in combination with a ring buffer for storing only a few rows at a time the performance is improved by a significant amount. Also slightly speed up neighbor calculations by packing the a and b values into a single 32-bit unsigned integer which allows calculations on both values simultaneously.
-
Henrik Gramner authored
Split the 5x5, 3x3, and mix cases into separate functions. Shrink some tables. Move some scalar calculations out of the DSP function. Make Wiener and SGR share the same function prototype to eliminate a branch in lr_stripe().
-
Henrik Gramner authored
Large stack allocations on Windows need to use stack probing in order to guarantee that all stack memory is committed before accessing it. This is done by ensuring that the guard page(s) at the end of the currently committed pages are touched prior to any pages beyond that.
-
Emmanuel Gil Peyrot authored
Neither --buildtype=plain nor --buildtype=debug set -ffast-math, so llround() is kept as a function call and isn’t optimised out into cvttsd2siq (on amd64), thus requiring the math lib to be linked. Note that even with -ffast-math, it isn’t guaranteed that a call to llround() will always be omitted (I have reproduced this on PowerPC), so this fix is correct even if we ever decide to enable -ffast-math in other build types.
-
- 10 Feb, 2021 1 commit
-
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 40.7 23.0 24.0 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.5 78.2 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 85.7 50.7 53.8 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 287.0 203.5 215.2 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 255.7 129.1 140.4 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1401.4 1026.7 1039.2 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1913.2 1407.3 1479.6 After: inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 38.7 21.5 22.2 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.3 77.2 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 76.7 44.7 43.5 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 278.0 203.0 203.9 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 236.9 106.2 116.2 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1368.7 999.7 1008.4 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1880.5 1381.2 1459.4
-