Commits · 2e73051c57a1b2c28c46f72f9edec62f299ebac5 · VideoLAN / dav1d

Feb 09, 2021

arm64: looprestoration: Rewrite the wiener functions · 2e73051c

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Make them operate in a more cache friendly manner, interleaving
horizontal and vertical filtering (reducing the amount of stack
used from 51 KB to 4 KB), similar to what was done for x86 in
78d27b7d.

This also adds separate 5tap versions of the filters and unrolls
the vertical filter a bit more (which maybe could have been done
without doing the rewrite).

This does, however, increase the compiled code size by around
3.5 KB.

Before:                Cortex A53       A72       A73
wiener_5tap_8bpc_neon:   136855.6   91446.2   87363.6
wiener_7tap_8bpc_neon:   136861.6   91454.9   87374.5
wiener_5tap_10bpc_neon:  167685.3  114720.3  116522.1
wiener_5tap_12bpc_neon:  167677.5  114724.7  116511.9
wiener_7tap_10bpc_neon:  167681.6  114738.5  116567.0
wiener_7tap_12bpc_neon:  167673.8  114720.8  116515.4
After:
wiener_5tap_8bpc_neon:    87102.1   60460.6   66803.8
wiener_7tap_8bpc_neon:   110831.7   78489.0   82015.9
wiener_5tap_10bpc_neon:  109999.2   90259.0   89238.0
wiener_5tap_12bpc_neon:  109978.3   90255.7   89220.7
wiener_7tap_10bpc_neon:  137877.6  107578.5  103435.6
wiener_7tap_12bpc_neon:  137868.8  107568.9  103390.4

2e73051c

arm64: mc: Improve first tap for inorder cores · 4e869495
Kyle Siefring authored 4 years ago and Martin Storsjö committed 4 years ago
```
Change order of multiply accumulates to allow inorder cores to forward
the results.
```
4e869495

Feb 08, 2021

arm32: mc: Optimize warp by doing horz filtering in 8 bit · 0477fcf1

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Additionally reschedule instructions for loading, to reduce stalls
on in order cores.

This applies the changes from a3b8157e
on the arm32 version.

Before:             Cortex A7      A8      A9     A53     A72     A73
warp_8x8_8bpc_neon:    3659.3  1746.0  1931.9  2128.8  1173.7  1188.9
warp_8x8t_8bpc_neon:   3650.8  1724.6  1919.8  2105.0  1147.7  1206.9
warp_8x8_16bpc_neon:   4039.4  2111.9  2337.1  2462.5  1334.6  1396.5
warp_8x8t_16bpc_neon:  3973.9  2137.1  2299.6  2413.2  1282.8  1369.6
After:
warp_8x8_8bpc_neon:    2920.8  1269.8  1410.3  1767.3   860.2  1004.8
warp_8x8t_8bpc_neon:   2904.9  1283.9  1397.5  1743.7   863.6  1024.7
warp_8x8_16bpc_neon:   3895.5  2060.7  2339.8  2376.6  1331.1  1394.0
warp_8x8t_16bpc_neon:  3822.7  2026.7  2298.7  2325.4  1278.1  1360.8

0477fcf1

build: Fix ninja warning message on Windows · 69268d3a

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

We currently run 'git describe --match' to obtain the current version,
but meson doesn't properly quote/escape the pattern string on Windows.

As a result, "fatal: Not a valid object name .ninja_log" is printed
when compiling on Windows systems. Compilation still works, but the
warning is annoying and misleading.

Currently we don't actually need the pattern matching functionality
(which is why things still work), so simply remove it as a workaround.

69268d3a

xxhash: Add a cast to silence a warning when built with MSVC · 95884615

Martin Storsjö authored 4 years ago

This silences the following warning:
tools/output/xxhash.c(127): warning C4244: '=': conversion from 'unsigned long' to 'unsigned char', possible loss of data

95884615

lf_mask: Align an array that is accessed via aliasing structures · 0a577fd2
Martin Storsjö authored 4 years ago
```
This fixes bus errors due to missing alignment, when built with GCC 9
for arm32 with -mfpu=neon.
```
0a577fd2

tools: add optional xxh3 based muxer · e6168525

Janne Grunau authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

The required 'xxhash.h' header can either be in system include directory
or can be copied to 'tools/output'.

The xxh3_128bits based muxer shows no significant slowdown compared to
the null muxer. Decoding times Chimera-AV1-8bit-1920x1080-6736kbps.ivf
with 4 frame and 4 tile threads on a core i7-8550U (disabled turbo boost):

null:  72.5 s
md5:   99.8 s
xxh3:  73.8 s

Decoding Chimera-AV1-10bit-1920x1080-6191kbps.ivf with 6 frame and 4 tile
threads on a m1 mc mini:

null:  27.8 s
md5:  105.9 s
xxh3:  28.3 s

e6168525

cli: Fix md5 verification for short values · 061ac9ae

Matthias Dressel authored 4 years ago

Verification should not succeed if the given string is too short to be a
real hash.

Fixes videolan/dav1d#361

061ac9ae

Feb 06, 2021

tools: fix '--verify' with muxer explicitly set · 93319cef
Janne Grunau authored 4 years ago

93319cef
Shrink some wedge initialization tables · ffc4e01c
Henrik Gramner authored 4 years ago

ffc4e01c
arm: Move the sub_sp and sub_sp_aligned macros to the shared util file · 2dca9b28
Martin Storsjö authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
The arm32 version is less generic and has a bit more caveats, but still
belongs as a shared utility in a header.
```
2dca9b28

dav1dplay: Only repaint the window when necessary · eab4ef6a

Niklas Haas authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

The current playback loop triggers a repaint on any single event,
including spammy events such as SDL_MOUSEMOTION.

Fix this by only repainting on SDL_WINDOWEVENT_EXPOSED, which is defined
as the event sent when the window was damaged and needs to be repainted,
as well as on new frames.

Fixes #356

eab4ef6a

dav1dplay: Update/modernize placebo-based renderer · 61b65456

Niklas Haas authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Upstream libplacebo added support for dav1d integration directly,
allowing us to vastly simplify all of this code. In order to take
advantage of new optimizations, I had to allow update_frame to unref the
Dav1dPicture. (This is fine, since double unref is a no-op)

In addition, some of the functions we use were deprecated in recent
libplacebo versions, so since we're taking a new dependency we might as
well fix the deprecation warnings.

61b65456

dav1dplay: Disable zerocopy on placebo-gl · 06e8ed37

Niklas Haas authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

These functions are not thread-safe on GL, because they are not called
from the thread holding the GL context. Work around this by simply
disabling it.

Not very optimal, but better than crashing.

06e8ed37

Feb 05, 2021

Fix potential deadlock · 8b1a96e4
Victorien Le Couviour--Tuffet authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
If the postfilter tasks allocation fails, a deadlock would occur.
```
8b1a96e4
arm32: loopfilter16: Remove an extra immediate move · 505e9990
Martin Storsjö authored 4 years ago

505e9990

arm64: warped motion: Various optimizations · a3b8157e

Kyle Siefring authored 4 years ago and

Martin Storsjö committed 4 years ago

- Reorder loads of filters to benifit in order cores.
- Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the
   first stage which will hurt performance on some older big cores.
- Rework horz stage for 8 bit mode:
    * Use smull instead of mul
    * Replace existing narrow and long instructions
    * Replace mov after calling with right shift

Before:            Cortex A55    A53     A72     A73
warp_8x8_8bpc_neon:    1683.2  1860.6  1065.0  1102.6
warp_8x8t_8bpc_neon:   1673.2  1846.4  1057.0  1098.4
warp_8x8_16bpc_neon:   1870.7  2031.7  1147.3  1220.7
warp_8x8t_16bpc_neon:  1848.0  2006.2  1121.6  1188.0
After:
warp_8x8_8bpc_neon:    1267.2  1446.2   807.0   871.5
warp_8x8t_8bpc_neon:   1245.4  1422.0   810.2   868.4
warp_8x8_16bpc_neon:   1769.8  1929.3  1132.0  1238.2
warp_8x8t_16bpc_neon:  1747.3  1904.1  1101.5  1207.9

Cortex-A55
Before:
warp_8x8_8bpc_neon:   1683.2
warp_8x8t_8bpc_neon:  1673.2
warp_8x8_16bpc_neon:  1870.7
warp_8x8t_16bpc_neon: 1848.0
After:
warp_8x8_8bpc_neon:   1267.2
warp_8x8t_8bpc_neon:  1245.4
warp_8x8_16bpc_neon:  1769.8
warp_8x8t_16bpc_neon: 1747.3

a3b8157e

arm64: loopfilter: Avoid leaving 8-bits · 833382b3
Kyle Siefring authored 4 years ago and Martin Storsjö committed 4 years ago
```
Avoid moving between 8 and 16-bit vectors where possible.
```
833382b3

Feb 04, 2021

arm64: loopfilter16: Remove extra immediate move · 95c43101
Kyle Siefring authored 4 years ago

95c43101

arm64: cdef 8bpc: Accumulate sum in bytes · 38d4d0bd

Kyle Siefring authored 4 years ago and

Martin Storsjö committed 4 years ago

Use mla (8-bit -> 8-bit) instead of smlal (8-bit -> 16-bit).

Before:                 Cortex A53     A72     A73
cdef_filter_4x4_8bpc_neon:   389.7   264.0   261.7
cdef_filter_4x8_8bpc_neon:   687.2   476.2   465.5
cdef_filter_8x8_8bpc_neon:  1152.9   752.1   789.5
After:
cdef_filter_4x4_8bpc_neon:   385.2   263.4   259.2
cdef_filter_4x8_8bpc_neon:   677.5   473.8   459.8
cdef_filter_8x8_8bpc_neon:  1134.4   744.6   774.6

38d4d0bd

Feb 02, 2021
- looprestoration: Document how much filters are allowed to write past the right edge · 6660fd00
  Martin Storsjö authored 4 years ago
  
  6660fd00
Feb 01, 2021
- dav1dplay: Add pause and seek features · 288ed4b8
  Victorien Le Couviour--Tuffet authored 4 years ago
  
  288ed4b8
Jan 28, 2021

Properly fix LOAD_MM_PERMUTATION for AVX-512 · 7c316a70
Anton Mitrofanov authored 4 years ago and James Almer committed 4 years ago
```
Signed-off-by: James Almer <jamrial@gmail.com>
```
7c316a70
src: Replace check for intra-/key-frame with dedicated macro · 6361e88d
Matthias Dressel authored 4 years ago and Ronald S. Bultje committed 4 years ago
```
Should make the code more readable.
```
6361e88d
src: Use a macro for testing frame_type · 54747d42
Matthias Dressel authored 4 years ago and Ronald S. Bultje committed 4 years ago
```
Replace checks for INTER or SWITCH frames with a simple macro for
increased readability and maintainability.
```
54747d42

arm: looprestoration: Exploit wiener filter symmetry in the vert filter · 24f9304e

Martin Storsjö authored 4 years ago

Only doing this for 8bpc; for higher bitdepths, adding the input
coefficients can overflow a signed 16 bit element.

Before:               Cortex A53       A72       A73
wiener_7tap_8bpc_neon:  142985.0   94400.8   89959.3

After:
wiener_7tap_8bpc_neon:  136614.4   88828.3   86997.0

24f9304e

arm: looprestoration: Exploit wiener filter symmetry in the horz filter · 55e9f7a4

Martin Storsjö authored 4 years ago

This gives a minor speedup on 8 bpc and a bit bigger speedup on 16
bpc. Sample speedups from arm64:

Before:                Cortex A53       A72       A73
wiener_7tap_8bpc_neon:   143885.7  101571.5   96187.2
wiener_7tap_10bpc_neon:  171210.8  119410.4  122447.8

After:
wiener_7tap_8bpc_neon:   142985.0   94400.8   89959.3
wiener_7tap_10bpc_neon:  168818.4  113980.2  116662.0

55e9f7a4

arm: looprestoration: Simplify right edge padding in horz filters · 9c1f276d

Martin Storsjö authored 4 years ago

Use a variable mask for inserting padding, instead of fixed code paths
for different padding widths.

This allows simplifying the filtering logic to simply always process
8 pixels at a time.

Also improve scheduling of the loop subtract instruction in all these
cases.

9c1f276d

arm: looprestoration: Simplify dup'ing the padding pixel · 52c09394
Martin Storsjö authored 4 years ago

52c09394
Add post-filters threading model · 549086e4
Victorien Le Couviour--Tuffet authored 4 years ago

549086e4
tests: Refactor seek_stress decoding functions · 4db73f11
Victorien Le Couviour--Tuffet authored 4 years ago

4db73f11

fuzzer: Remove redundant flush · 66c8a1ec

Victorien Le Couviour--Tuffet authored 4 years ago

Calling dav1d_close already takes care of flushing the internal state.
Calling it just before is superfluous.

66c8a1ec

Jan 25, 2021
- data: Remove dead code · ecb00748
  Matthias Dressel authored 4 years ago
```
Remains from code restructuring in 89ea92ba
```
  ecb00748
- Update copyright year · bb3539a9
  Matthias Dressel authored 4 years ago
  
  bb3539a9
Jan 21, 2021
- checkasm: Factor out common offsets in looprestoration tests · 9463c9f5
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  9463c9f5
- checkasm: Print more information on SGR test failure · f54cf173
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  f54cf173
- checkasm: Improve looprestoration input randomization · f539111b
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
```
SGR uses edge detection to decide which pixels to modify, but if the
input is pure random noise there isn't going to be many (if any) edges.
As a result the entire function call often ends up doing nothing,
which isn't ideal when we want test code for correctness.

Change the input randomization algorithm to generate a checkerboard
pattern with limited noise applied to the flat areas.
```
  f539111b
- tests/seek_stress: Reduce the number of iterations · 5686e835
  Victorien Le Couviour--Tuffet authored 4 years ago
  
  5686e835
Jan 20, 2021

AVX2: Swap shuffles with zen 2/3 friendly equivalents · a0e9a2e3

Kyle Siefring authored 4 years ago and

Henrik Gramner committed 4 years ago

On zen 2 and 3, vpermq is slower than vperm2i128. In some assembly, we
use the former to swap lanes of a vector when we could be using the
latter.

On zen 1, the most expensive instruction is swapped, so this patch will
be slower on them.

On current intel cpus, these instructions are equally expensive, so
there should be no impact there.

a0e9a2e3

build: unbreak '-Denable_tools=false' build and add CI · dd32acea
Janne Grunau authored 4 years ago
```
oss-fuzz uses '-Denable_tools=false'.
```
dd32acea