Commits · e61685253608 · VideoLAN / dav1d

Feb 08, 2021

tools: add optional xxh3 based muxer · e6168525

Janne Grunau authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

The required 'xxhash.h' header can either be in system include directory
or can be copied to 'tools/output'.

The xxh3_128bits based muxer shows no significant slowdown compared to
the null muxer. Decoding times Chimera-AV1-8bit-1920x1080-6736kbps.ivf
with 4 frame and 4 tile threads on a core i7-8550U (disabled turbo boost):

null:  72.5 s
md5:   99.8 s
xxh3:  73.8 s

Decoding Chimera-AV1-10bit-1920x1080-6191kbps.ivf with 6 frame and 4 tile
threads on a m1 mc mini:

null:  27.8 s
md5:  105.9 s
xxh3:  28.3 s

e6168525

cli: Fix md5 verification for short values · 061ac9ae

Matthias Dressel authored 4 years ago

Verification should not succeed if the given string is too short to be a
real hash.

Fixes videolan/dav1d#361

061ac9ae

Feb 06, 2021

tools: fix '--verify' with muxer explicitly set · 93319cef
Janne Grunau authored 4 years ago

93319cef
Shrink some wedge initialization tables · ffc4e01c
Henrik Gramner authored 4 years ago

ffc4e01c
arm: Move the sub_sp and sub_sp_aligned macros to the shared util file · 2dca9b28
Martin Storsjö authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
The arm32 version is less generic and has a bit more caveats, but still
belongs as a shared utility in a header.
```
2dca9b28

dav1dplay: Only repaint the window when necessary · eab4ef6a

Niklas Haas authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

The current playback loop triggers a repaint on any single event,
including spammy events such as SDL_MOUSEMOTION.

Fix this by only repainting on SDL_WINDOWEVENT_EXPOSED, which is defined
as the event sent when the window was damaged and needs to be repainted,
as well as on new frames.

Fixes videolan/dav1d#356

eab4ef6a

dav1dplay: Update/modernize placebo-based renderer · 61b65456

Niklas Haas authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Upstream libplacebo added support for dav1d integration directly,
allowing us to vastly simplify all of this code. In order to take
advantage of new optimizations, I had to allow update_frame to unref the
Dav1dPicture. (This is fine, since double unref is a no-op)

In addition, some of the functions we use were deprecated in recent
libplacebo versions, so since we're taking a new dependency we might as
well fix the deprecation warnings.

61b65456

dav1dplay: Disable zerocopy on placebo-gl · 06e8ed37

Niklas Haas authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

These functions are not thread-safe on GL, because they are not called
from the thread holding the GL context. Work around this by simply
disabling it.

Not very optimal, but better than crashing.

06e8ed37

Feb 05, 2021

Fix potential deadlock · 8b1a96e4
Victorien Le Couviour--Tuffet authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
If the postfilter tasks allocation fails, a deadlock would occur.
```
8b1a96e4
arm32: loopfilter16: Remove an extra immediate move · 505e9990
Martin Storsjö authored 4 years ago

505e9990

arm64: warped motion: Various optimizations · a3b8157e

Kyle Siefring authored 4 years ago and

Martin Storsjö committed 4 years ago

- Reorder loads of filters to benifit in order cores.
- Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the
   first stage which will hurt performance on some older big cores.
- Rework horz stage for 8 bit mode:
    * Use smull instead of mul
    * Replace existing narrow and long instructions
    * Replace mov after calling with right shift

Before:            Cortex A55    A53     A72     A73
warp_8x8_8bpc_neon:    1683.2  1860.6  1065.0  1102.6
warp_8x8t_8bpc_neon:   1673.2  1846.4  1057.0  1098.4
warp_8x8_16bpc_neon:   1870.7  2031.7  1147.3  1220.7
warp_8x8t_16bpc_neon:  1848.0  2006.2  1121.6  1188.0
After:
warp_8x8_8bpc_neon:    1267.2  1446.2   807.0   871.5
warp_8x8t_8bpc_neon:   1245.4  1422.0   810.2   868.4
warp_8x8_16bpc_neon:   1769.8  1929.3  1132.0  1238.2
warp_8x8t_16bpc_neon:  1747.3  1904.1  1101.5  1207.9

Cortex-A55
Before:
warp_8x8_8bpc_neon:   1683.2
warp_8x8t_8bpc_neon:  1673.2
warp_8x8_16bpc_neon:  1870.7
warp_8x8t_16bpc_neon: 1848.0
After:
warp_8x8_8bpc_neon:   1267.2
warp_8x8t_8bpc_neon:  1245.4
warp_8x8_16bpc_neon:  1769.8
warp_8x8t_16bpc_neon: 1747.3

a3b8157e

arm64: loopfilter: Avoid leaving 8-bits · 833382b3
Kyle Siefring authored 4 years ago and Martin Storsjö committed 4 years ago
```
Avoid moving between 8 and 16-bit vectors where possible.
```
833382b3

Feb 04, 2021

arm64: loopfilter16: Remove extra immediate move · 95c43101
Kyle Siefring authored 4 years ago

95c43101

arm64: cdef 8bpc: Accumulate sum in bytes · 38d4d0bd

Kyle Siefring authored 4 years ago and

Martin Storsjö committed 4 years ago

Use mla (8-bit -> 8-bit) instead of smlal (8-bit -> 16-bit).

Before:                 Cortex A53     A72     A73
cdef_filter_4x4_8bpc_neon:   389.7   264.0   261.7
cdef_filter_4x8_8bpc_neon:   687.2   476.2   465.5
cdef_filter_8x8_8bpc_neon:  1152.9   752.1   789.5
After:
cdef_filter_4x4_8bpc_neon:   385.2   263.4   259.2
cdef_filter_4x8_8bpc_neon:   677.5   473.8   459.8
cdef_filter_8x8_8bpc_neon:  1134.4   744.6   774.6

38d4d0bd

Feb 02, 2021
- looprestoration: Document how much filters are allowed to write past the right edge · 6660fd00
  Martin Storsjö authored 4 years ago
  
  6660fd00
Feb 01, 2021
- dav1dplay: Add pause and seek features · 288ed4b8
  Victorien Le Couviour--Tuffet authored 4 years ago
  
  288ed4b8
Jan 28, 2021

Properly fix LOAD_MM_PERMUTATION for AVX-512 · 7c316a70
Anton Mitrofanov authored 4 years ago and James Almer committed 4 years ago
```
Signed-off-by: James Almer <jamrial@gmail.com>
```
7c316a70
src: Replace check for intra-/key-frame with dedicated macro · 6361e88d
Matthias Dressel authored 4 years ago and Ronald S. Bultje committed 4 years ago
```
Should make the code more readable.
```
6361e88d
src: Use a macro for testing frame_type · 54747d42
Matthias Dressel authored 4 years ago and Ronald S. Bultje committed 4 years ago
```
Replace checks for INTER or SWITCH frames with a simple macro for
increased readability and maintainability.
```
54747d42

arm: looprestoration: Exploit wiener filter symmetry in the vert filter · 24f9304e

Martin Storsjö authored 4 years ago

Only doing this for 8bpc; for higher bitdepths, adding the input
coefficients can overflow a signed 16 bit element.

Before:               Cortex A53       A72       A73
wiener_7tap_8bpc_neon:  142985.0   94400.8   89959.3

After:
wiener_7tap_8bpc_neon:  136614.4   88828.3   86997.0

24f9304e

arm: looprestoration: Exploit wiener filter symmetry in the horz filter · 55e9f7a4

Martin Storsjö authored 4 years ago

This gives a minor speedup on 8 bpc and a bit bigger speedup on 16
bpc. Sample speedups from arm64:

Before:                Cortex A53       A72       A73
wiener_7tap_8bpc_neon:   143885.7  101571.5   96187.2
wiener_7tap_10bpc_neon:  171210.8  119410.4  122447.8

After:
wiener_7tap_8bpc_neon:   142985.0   94400.8   89959.3
wiener_7tap_10bpc_neon:  168818.4  113980.2  116662.0

55e9f7a4

arm: looprestoration: Simplify right edge padding in horz filters · 9c1f276d

Martin Storsjö authored 4 years ago

Use a variable mask for inserting padding, instead of fixed code paths
for different padding widths.

This allows simplifying the filtering logic to simply always process
8 pixels at a time.

Also improve scheduling of the loop subtract instruction in all these
cases.

9c1f276d

arm: looprestoration: Simplify dup'ing the padding pixel · 52c09394
Martin Storsjö authored 4 years ago

52c09394
Add post-filters threading model · 549086e4
Victorien Le Couviour--Tuffet authored 4 years ago

549086e4
tests: Refactor seek_stress decoding functions · 4db73f11
Victorien Le Couviour--Tuffet authored 4 years ago

4db73f11

fuzzer: Remove redundant flush · 66c8a1ec

Victorien Le Couviour--Tuffet authored 4 years ago

Calling dav1d_close already takes care of flushing the internal state.
Calling it just before is superfluous.

66c8a1ec

Jan 25, 2021
- data: Remove dead code · ecb00748
  Matthias Dressel authored 4 years ago
```
Remains from code restructuring in 89ea92ba
```
  ecb00748
- Update copyright year · bb3539a9
  Matthias Dressel authored 4 years ago
  
  bb3539a9
Jan 21, 2021
- checkasm: Factor out common offsets in looprestoration tests · 9463c9f5
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  9463c9f5
- checkasm: Print more information on SGR test failure · f54cf173
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  f54cf173
- checkasm: Improve looprestoration input randomization · f539111b
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
```
SGR uses edge detection to decide which pixels to modify, but if the
input is pure random noise there isn't going to be many (if any) edges.
As a result the entire function call often ends up doing nothing,
which isn't ideal when we want test code for correctness.

Change the input randomization algorithm to generate a checkerboard
pattern with limited noise applied to the flat areas.
```
  f539111b
- tests/seek_stress: Reduce the number of iterations · 5686e835
  Victorien Le Couviour--Tuffet authored 4 years ago
  
  5686e835
Jan 20, 2021

AVX2: Swap shuffles with zen 2/3 friendly equivalents · a0e9a2e3

Kyle Siefring authored 4 years ago and

Henrik Gramner committed 4 years ago

On zen 2 and 3, vpermq is slower than vperm2i128. In some assembly, we
use the former to swap lanes of a vector when we could be using the
latter.

On zen 1, the most expensive instruction is swapped, so this patch will
be slower on them.

On current intel cpus, these instructions are equally expensive, so
there should be no impact there.

a0e9a2e3

build: unbreak '-Denable_tools=false' build and add CI · dd32acea
Janne Grunau authored 4 years ago
```
oss-fuzz uses '-Denable_tools=false'.
```
dd32acea

arm64: cdef_dir: Preload rows to prevent stalling · 11cb2efa

Kyle Siefring authored 4 years ago and

Martin Storsjö committed 4 years ago

Before:            Cortex A53     A55     A72     A73
cdef_dir_8bpc_neon:     400.0   391.2   269.7   282.9
cdef_dir_16bpc_neon:    417.7   413.0   303.8   313.6

After: Cortex A55
cdef_dir_8bpc_neon:     369.0   360.2   248.4   273.4
cdef_dir_16bpc_neon:    388.7   384.0   272.2   290.7

11cb2efa

Jan 18, 2021
- CI: Run the seek stress test · 05d05f97
  Victorien Le Couviour--Tuffet authored 4 years ago
  
  05d05f97
- tests: Add a seek stress test · 63a918b4
  Victorien Le Couviour--Tuffet authored 4 years ago
```
Closes #203.
```
  63a918b4
Jan 15, 2021
- input/ivf: Add seeking capability · 493d2b91
  Victorien Le Couviour--Tuffet authored 4 years ago
  
  493d2b91
Jan 11, 2021

Round and clip with one step, mc_8tap_regular_h_c · b12229cc

Nathan E. Egge authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Relative speed-ups compared with gcc-9.2.0:

                                  Before     After
mc_8tap_regular_w2_h_16bpc_c:      276.6     219.9
mc_8tap_regular_w4_h_16bpc_c:      489.5     374.5
mc_8tap_regular_w8_h_16bpc_c:      897.7     686.8
mc_8tap_regular_w16_h_16bpc_c:    2573.7    2314.2
mc_8tap_regular_w32_h_16bpc_c:    7647.3    7012.4
mc_8tap_regular_w64_h_16bpc_c:   28163.8   25057.4
mc_8tap_regular_w128_h_16bpc_c:  77678.4   73570.0

b12229cc

Rework the usage of noskip_mask · 0bd57c6b

Kyle Siefring authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Remove half of the masks since they are only used for cdef on a 8x8
level of granularity.

Load the mask and combine the 16-bit sections into the 32-bit sections
outside of the inner cdef loop. This should save some registers.

Results in mild performance improvements.

0bd57c6b