- Feb 08, 2021
-
-
The required 'xxhash.h' header can either be in system include directory or can be copied to 'tools/output'. The xxh3_128bits based muxer shows no significant slowdown compared to the null muxer. Decoding times Chimera-AV1-8bit-1920x1080-6736kbps.ivf with 4 frame and 4 tile threads on a core i7-8550U (disabled turbo boost): null: 72.5 s md5: 99.8 s xxh3: 73.8 s Decoding Chimera-AV1-10bit-1920x1080-6191kbps.ivf with 6 frame and 4 tile threads on a m1 mc mini: null: 27.8 s md5: 105.9 s xxh3: 28.3 s
-
Matthias Dressel authored
Verification should not succeed if the given string is too short to be a real hash. Fixes videolan/dav1d#361
-
- Feb 06, 2021
-
-
Janne Grunau authored
-
Henrik Gramner authored
-
The arm32 version is less generic and has a bit more caveats, but still belongs as a shared utility in a header.
-
The current playback loop triggers a repaint on any single event, including spammy events such as SDL_MOUSEMOTION. Fix this by only repainting on SDL_WINDOWEVENT_EXPOSED, which is defined as the event sent when the window was damaged and needs to be repainted, as well as on new frames. Fixes videolan/dav1d#356
-
Upstream libplacebo added support for dav1d integration directly, allowing us to vastly simplify all of this code. In order to take advantage of new optimizations, I had to allow update_frame to unref the Dav1dPicture. (This is fine, since double unref is a no-op) In addition, some of the functions we use were deprecated in recent libplacebo versions, so since we're taking a new dependency we might as well fix the deprecation warnings.
-
These functions are not thread-safe on GL, because they are not called from the thread holding the GL context. Work around this by simply disabling it. Not very optimal, but better than crashing.
-
- Feb 05, 2021
-
-
If the postfilter tasks allocation fails, a deadlock would occur.
-
Martin Storsjö authored
-
- Reorder loads of filters to benifit in order cores. - Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the first stage which will hurt performance on some older big cores. - Rework horz stage for 8 bit mode: * Use smull instead of mul * Replace existing narrow and long instructions * Replace mov after calling with right shift Before: Cortex A55 A53 A72 A73 warp_8x8_8bpc_neon: 1683.2 1860.6 1065.0 1102.6 warp_8x8t_8bpc_neon: 1673.2 1846.4 1057.0 1098.4 warp_8x8_16bpc_neon: 1870.7 2031.7 1147.3 1220.7 warp_8x8t_16bpc_neon: 1848.0 2006.2 1121.6 1188.0 After: warp_8x8_8bpc_neon: 1267.2 1446.2 807.0 871.5 warp_8x8t_8bpc_neon: 1245.4 1422.0 810.2 868.4 warp_8x8_16bpc_neon: 1769.8 1929.3 1132.0 1238.2 warp_8x8t_16bpc_neon: 1747.3 1904.1 1101.5 1207.9 Cortex-A55 Before: warp_8x8_8bpc_neon: 1683.2 warp_8x8t_8bpc_neon: 1673.2 warp_8x8_16bpc_neon: 1870.7 warp_8x8t_16bpc_neon: 1848.0 After: warp_8x8_8bpc_neon: 1267.2 warp_8x8t_8bpc_neon: 1245.4 warp_8x8_16bpc_neon: 1769.8 warp_8x8t_16bpc_neon: 1747.3
-
Avoid moving between 8 and 16-bit vectors where possible.
-
- Feb 04, 2021
-
-
Kyle Siefring authored
-
Use mla (8-bit -> 8-bit) instead of smlal (8-bit -> 16-bit). Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 389.7 264.0 261.7 cdef_filter_4x8_8bpc_neon: 687.2 476.2 465.5 cdef_filter_8x8_8bpc_neon: 1152.9 752.1 789.5 After: cdef_filter_4x4_8bpc_neon: 385.2 263.4 259.2 cdef_filter_4x8_8bpc_neon: 677.5 473.8 459.8 cdef_filter_8x8_8bpc_neon: 1134.4 744.6 774.6
-
- Feb 02, 2021
-
-
Martin Storsjö authored
-
- Feb 01, 2021
-
-
Victorien Le Couviour--Tuffet authored
-
- Jan 28, 2021
-
-
Signed-off-by:
James Almer <jamrial@gmail.com>
-
Should make the code more readable.
-
Replace checks for INTER or SWITCH frames with a simple macro for increased readability and maintainability.
-
Martin Storsjö authored
Only doing this for 8bpc; for higher bitdepths, adding the input coefficients can overflow a signed 16 bit element. Before: Cortex A53 A72 A73 wiener_7tap_8bpc_neon: 142985.0 94400.8 89959.3 After: wiener_7tap_8bpc_neon: 136614.4 88828.3 86997.0
-
Martin Storsjö authored
This gives a minor speedup on 8 bpc and a bit bigger speedup on 16 bpc. Sample speedups from arm64: Before: Cortex A53 A72 A73 wiener_7tap_8bpc_neon: 143885.7 101571.5 96187.2 wiener_7tap_10bpc_neon: 171210.8 119410.4 122447.8 After: wiener_7tap_8bpc_neon: 142985.0 94400.8 89959.3 wiener_7tap_10bpc_neon: 168818.4 113980.2 116662.0
-
Martin Storsjö authored
Use a variable mask for inserting padding, instead of fixed code paths for different padding widths. This allows simplifying the filtering logic to simply always process 8 pixels at a time. Also improve scheduling of the loop subtract instruction in all these cases.
-
Martin Storsjö authored
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
Calling dav1d_close already takes care of flushing the internal state. Calling it just before is superfluous.
-
- Jan 25, 2021
-
-
Matthias Dressel authored
Remains from code restructuring in 89ea92ba
-
Matthias Dressel authored
-
- Jan 21, 2021
-
-
-
-
SGR uses edge detection to decide which pixels to modify, but if the input is pure random noise there isn't going to be many (if any) edges. As a result the entire function call often ends up doing nothing, which isn't ideal when we want test code for correctness. Change the input randomization algorithm to generate a checkerboard pattern with limited noise applied to the flat areas.
-
Victorien Le Couviour--Tuffet authored
-
- Jan 20, 2021
-
-
On zen 2 and 3, vpermq is slower than vperm2i128. In some assembly, we use the former to swap lanes of a vector when we could be using the latter. On zen 1, the most expensive instruction is swapped, so this patch will be slower on them. On current intel cpus, these instructions are equally expensive, so there should be no impact there.
-
Janne Grunau authored
oss-fuzz uses '-Denable_tools=false'.
-
Before: Cortex A53 A55 A72 A73 cdef_dir_8bpc_neon: 400.0 391.2 269.7 282.9 cdef_dir_16bpc_neon: 417.7 413.0 303.8 313.6 After: Cortex A55 cdef_dir_8bpc_neon: 369.0 360.2 248.4 273.4 cdef_dir_16bpc_neon: 388.7 384.0 272.2 290.7
-
- Jan 18, 2021
-
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
Closes #203.
-
- Jan 15, 2021
-
-
Victorien Le Couviour--Tuffet authored
-
- Jan 11, 2021
-
-
Relative speed-ups compared with gcc-9.2.0: Before After mc_8tap_regular_w2_h_16bpc_c: 276.6 219.9 mc_8tap_regular_w4_h_16bpc_c: 489.5 374.5 mc_8tap_regular_w8_h_16bpc_c: 897.7 686.8 mc_8tap_regular_w16_h_16bpc_c: 2573.7 2314.2 mc_8tap_regular_w32_h_16bpc_c: 7647.3 7012.4 mc_8tap_regular_w64_h_16bpc_c: 28163.8 25057.4 mc_8tap_regular_w128_h_16bpc_c: 77678.4 73570.0
-
Remove half of the masks since they are only used for cdef on a 8x8 level of granularity. Load the mask and combine the 16-bit sections into the 32-bit sections outside of the inner cdef loop. This should save some registers. Results in mild performance improvements.
-