- Feb 09, 2021
-
-
Make them operate in a more cache friendly manner, interleaving horizontal and vertical filtering (reducing the amount of stack used from 51 KB to 4 KB), similar to what was done for x86 in 78d27b7d. This also adds separate 5tap versions of the filters and unrolls the vertical filter a bit more (which maybe could have been done without doing the rewrite). This does, however, increase the compiled code size by around 3.5 KB. Before: Cortex A53 A72 A73 wiener_5tap_8bpc_neon: 136855.6 91446.2 87363.6 wiener_7tap_8bpc_neon: 136861.6 91454.9 87374.5 wiener_5tap_10bpc_neon: 167685.3 114720.3 116522.1 wiener_5tap_12bpc_neon: 167677.5 114724.7 116511.9 wiener_7tap_10bpc_neon: 167681.6 114738.5 116567.0 wiener_7tap_12bpc_neon: 167673.8 114720.8 116515.4 After: wiener_5tap_8bpc_neon: 87102.1 60460.6 66803.8 wiener_7tap_8bpc_neon: 110831.7 78489.0 82015.9 wiener_5tap_10bpc_neon: 109999.2 90259.0 89238.0 wiener_5tap_12bpc_neon: 109978.3 90255.7 89220.7 wiener_7tap_10bpc_neon: 137877.6 107578.5 103435.6 wiener_7tap_12bpc_neon: 137868.8 107568.9 103390.4
-
Change order of multiply accumulates to allow inorder cores to forward the results.
-
- Feb 08, 2021
-
-
Additionally reschedule instructions for loading, to reduce stalls on in order cores. This applies the changes from a3b8157e on the arm32 version. Before: Cortex A7 A8 A9 A53 A72 A73 warp_8x8_8bpc_neon: 3659.3 1746.0 1931.9 2128.8 1173.7 1188.9 warp_8x8t_8bpc_neon: 3650.8 1724.6 1919.8 2105.0 1147.7 1206.9 warp_8x8_16bpc_neon: 4039.4 2111.9 2337.1 2462.5 1334.6 1396.5 warp_8x8t_16bpc_neon: 3973.9 2137.1 2299.6 2413.2 1282.8 1369.6 After: warp_8x8_8bpc_neon: 2920.8 1269.8 1410.3 1767.3 860.2 1004.8 warp_8x8t_8bpc_neon: 2904.9 1283.9 1397.5 1743.7 863.6 1024.7 warp_8x8_16bpc_neon: 3895.5 2060.7 2339.8 2376.6 1331.1 1394.0 warp_8x8t_16bpc_neon: 3822.7 2026.7 2298.7 2325.4 1278.1 1360.8
-
We currently run 'git describe --match' to obtain the current version, but meson doesn't properly quote/escape the pattern string on Windows. As a result, "fatal: Not a valid object name .ninja_log" is printed when compiling on Windows systems. Compilation still works, but the warning is annoying and misleading. Currently we don't actually need the pattern matching functionality (which is why things still work), so simply remove it as a workaround.
-
Martin Storsjö authored
This silences the following warning: tools/output/xxhash.c(127): warning C4244: '=': conversion from 'unsigned long' to 'unsigned char', possible loss of data
-
Martin Storsjö authored
This fixes bus errors due to missing alignment, when built with GCC 9 for arm32 with -mfpu=neon.
-
The required 'xxhash.h' header can either be in system include directory or can be copied to 'tools/output'. The xxh3_128bits based muxer shows no significant slowdown compared to the null muxer. Decoding times Chimera-AV1-8bit-1920x1080-6736kbps.ivf with 4 frame and 4 tile threads on a core i7-8550U (disabled turbo boost): null: 72.5 s md5: 99.8 s xxh3: 73.8 s Decoding Chimera-AV1-10bit-1920x1080-6191kbps.ivf with 6 frame and 4 tile threads on a m1 mc mini: null: 27.8 s md5: 105.9 s xxh3: 28.3 s
-
Matthias Dressel authored
Verification should not succeed if the given string is too short to be a real hash. Fixes videolan/dav1d#361
-
- Feb 06, 2021
-
-
Janne Grunau authored
-
Henrik Gramner authored
-
The arm32 version is less generic and has a bit more caveats, but still belongs as a shared utility in a header.
-
The current playback loop triggers a repaint on any single event, including spammy events such as SDL_MOUSEMOTION. Fix this by only repainting on SDL_WINDOWEVENT_EXPOSED, which is defined as the event sent when the window was damaged and needs to be repainted, as well as on new frames. Fixes #356
-
Upstream libplacebo added support for dav1d integration directly, allowing us to vastly simplify all of this code. In order to take advantage of new optimizations, I had to allow update_frame to unref the Dav1dPicture. (This is fine, since double unref is a no-op) In addition, some of the functions we use were deprecated in recent libplacebo versions, so since we're taking a new dependency we might as well fix the deprecation warnings.
-
These functions are not thread-safe on GL, because they are not called from the thread holding the GL context. Work around this by simply disabling it. Not very optimal, but better than crashing.
-
- Feb 05, 2021
-
-
If the postfilter tasks allocation fails, a deadlock would occur.
-
Martin Storsjö authored
-
- Reorder loads of filters to benifit in order cores. - Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the first stage which will hurt performance on some older big cores. - Rework horz stage for 8 bit mode: * Use smull instead of mul * Replace existing narrow and long instructions * Replace mov after calling with right shift Before: Cortex A55 A53 A72 A73 warp_8x8_8bpc_neon: 1683.2 1860.6 1065.0 1102.6 warp_8x8t_8bpc_neon: 1673.2 1846.4 1057.0 1098.4 warp_8x8_16bpc_neon: 1870.7 2031.7 1147.3 1220.7 warp_8x8t_16bpc_neon: 1848.0 2006.2 1121.6 1188.0 After: warp_8x8_8bpc_neon: 1267.2 1446.2 807.0 871.5 warp_8x8t_8bpc_neon: 1245.4 1422.0 810.2 868.4 warp_8x8_16bpc_neon: 1769.8 1929.3 1132.0 1238.2 warp_8x8t_16bpc_neon: 1747.3 1904.1 1101.5 1207.9 Cortex-A55 Before: warp_8x8_8bpc_neon: 1683.2 warp_8x8t_8bpc_neon: 1673.2 warp_8x8_16bpc_neon: 1870.7 warp_8x8t_16bpc_neon: 1848.0 After: warp_8x8_8bpc_neon: 1267.2 warp_8x8t_8bpc_neon: 1245.4 warp_8x8_16bpc_neon: 1769.8 warp_8x8t_16bpc_neon: 1747.3
-
Avoid moving between 8 and 16-bit vectors where possible.
-
- Feb 04, 2021
-
-
Kyle Siefring authored
-
Use mla (8-bit -> 8-bit) instead of smlal (8-bit -> 16-bit). Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 389.7 264.0 261.7 cdef_filter_4x8_8bpc_neon: 687.2 476.2 465.5 cdef_filter_8x8_8bpc_neon: 1152.9 752.1 789.5 After: cdef_filter_4x4_8bpc_neon: 385.2 263.4 259.2 cdef_filter_4x8_8bpc_neon: 677.5 473.8 459.8 cdef_filter_8x8_8bpc_neon: 1134.4 744.6 774.6
-
- Feb 02, 2021
-
-
Martin Storsjö authored
-
- Feb 01, 2021
-
-
Victorien Le Couviour--Tuffet authored
-
- Jan 28, 2021
-
-
Signed-off-by:
James Almer <jamrial@gmail.com>
-
Should make the code more readable.
-
Replace checks for INTER or SWITCH frames with a simple macro for increased readability and maintainability.
-
Martin Storsjö authored
Only doing this for 8bpc; for higher bitdepths, adding the input coefficients can overflow a signed 16 bit element. Before: Cortex A53 A72 A73 wiener_7tap_8bpc_neon: 142985.0 94400.8 89959.3 After: wiener_7tap_8bpc_neon: 136614.4 88828.3 86997.0
-
Martin Storsjö authored
This gives a minor speedup on 8 bpc and a bit bigger speedup on 16 bpc. Sample speedups from arm64: Before: Cortex A53 A72 A73 wiener_7tap_8bpc_neon: 143885.7 101571.5 96187.2 wiener_7tap_10bpc_neon: 171210.8 119410.4 122447.8 After: wiener_7tap_8bpc_neon: 142985.0 94400.8 89959.3 wiener_7tap_10bpc_neon: 168818.4 113980.2 116662.0
-
Martin Storsjö authored
Use a variable mask for inserting padding, instead of fixed code paths for different padding widths. This allows simplifying the filtering logic to simply always process 8 pixels at a time. Also improve scheduling of the loop subtract instruction in all these cases.
-
Martin Storsjö authored
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
Calling dav1d_close already takes care of flushing the internal state. Calling it just before is superfluous.
-
- Jan 25, 2021
-
-
Matthias Dressel authored
Remains from code restructuring in 89ea92ba
-
Matthias Dressel authored
-
- Jan 21, 2021
-
-
-
-
SGR uses edge detection to decide which pixels to modify, but if the input is pure random noise there isn't going to be many (if any) edges. As a result the entire function call often ends up doing nothing, which isn't ideal when we want test code for correctness. Change the input randomization algorithm to generate a checkerboard pattern with limited noise applied to the flat areas.
-
Victorien Le Couviour--Tuffet authored
-
- Jan 20, 2021
-
-
On zen 2 and 3, vpermq is slower than vperm2i128. In some assembly, we use the former to swap lanes of a vector when we could be using the latter. On zen 1, the most expensive instruction is swapped, so this patch will be slower on them. On current intel cpus, these instructions are equally expensive, so there should be no impact there.
-
Janne Grunau authored
oss-fuzz uses '-Denable_tools=false'.
-