Commits · master · VideoLAN / dav1d

Mar 10, 2025
- CI: Build '-mavx' code as debugoptimized · 8d956180
  Matthias Dressel authored 1 month ago
```
Workaround a GCC 14 bug where it does not insert `vzeroupper` in C code
built without at least '-O2'.
```
  8d956180
- CI: Update images · edeac873
  Matthias Dressel authored 1 month ago
  
  edeac873
Mar 05, 2025
- CI: Update ppc64le image · 1d0cda02
  Matthias Dressel authored 2 months ago
```
Since there seems to be a problem with gcc-14 stay on gcc-13 for now.
```
  1d0cda02
Feb 28, 2025
- refactor: simplify deltaq bitstream parsing logic · caef9681
  Gianni Rosato authored 1 month ago
  
  caef9681
Feb 21, 2025

Mark C globals with small code model · 7d4b789f

Pranav Kant authored 1 month ago and

Henrik Gramner committed 1 month ago

We have new option in clang (https://github.com/llvm/llvm-project/pull/124834)
to mark globals to be allocated in non-large sections. We can mark all globals
that are referenced from hardcoded assembly (which implicitly references globals
assuming they are in non-large sections) with this attribute to avoid running
into problems when dav1d is built with -mcmodel=medium with clang.

7d4b789f

Jan 19, 2025
- Update NEWS for 1.5.1 · 42b2b24f
  Jean-Baptiste Kempf authored 3 months ago
  
  1.5.1
  
  42b2b24f
Jan 10, 2025
- Include <string.h> for memcpy() · 40ff2a12
  Wan-Teh Chang authored 3 months ago and Ronald S. Bultje committed 3 months ago
  
  40ff2a12
Jan 09, 2025

AArch64: Add Neon implementation of load_tmvs · edb16889

Arpad Panyik authored 3 months ago

This patch adds a vectorised variant of the mv_projection calculation
and a faster initialisation of motion vectors for load_tmvs_neon.

Checkasm uplifts after this patch on some Neoverse and Cortex CPU cores
compared to the C reference compiled with GCC-13 and Clang-19:

                     GCC    Clang
 AWS Graviton 4:   1.62x    1.59x
 Cortex-X4:        1.45x    1.46x
 Cortex-X3:        1.68x    1.69x
 Cortex-X1:        1.55x    1.52x
 Cortex-A720:      1.54x    1.57x
 Cortex-A715:      1.47x    1.55x
 Cortex-A78:       1.21x    1.18x
 Cortex-A76:       1.38x    1.37x
 Cortex-A72:       1.08x    1.11x
 Cortex-A520:      0.97x    1.18x
 Cortex-A510:      0.99x    1.14x
 Cortex-A55:       1.16x    1.23x

This patch increases the .text by ~660 bytes, but smaller than the
reference implementation by about 0.5 KiB.

edb16889

Jan 02, 2025

mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap} · b129d9f2

Martin Storsjö authored 3 months ago

For the bilin cases, this seems to make things marginally faster
(measured on x86_64; 7-25% faster with compiler autovectorization).
For 8tap, it doesn't make much of a difference at all.

Before:                                      GCC   Clang
mc_scaled_8tap_regular_w128_8bpc_c:     115155.5   98549.3
mc_scaled_8tap_regular_w128_8bpc_ssse3:  17936.0   18411.1
mc_scaled_bilinear_w128_8bpc_c:          40290.0   51812.9
mc_scaled_bilinear_w128_8bpc_ssse3:      18243.9   18177.0
After:
mc_scaled_8tap_regular_w128_8bpc_c:     116304.3   99453.2
mc_scaled_8tap_regular_w128_8bpc_ssse3:  18387.0   18077.3
mc_scaled_bilinear_w128_8bpc_c:          37381.4   41145.0
mc_scaled_bilinear_w128_8bpc_ssse3:      18423.8   18031.6

(Benchmarked with the seed 0; the total runtime for the scaled
benchmarks are significantly affected by the random seed.)

This reduces the stack usage of these functions from around 65 KB
each, to less than 1 KB for bilin, and around 2 KB for 8tap.

With this in place, the required stack space for dav1d should
be mostly identical across configurations; on x86_64 (both with
and without assembly), it can run with 62 KB of stack, and
on arm and aarch64, it can run with 58 KB of stack.

b129d9f2

Dec 29, 2024
- riscv: Fix building on non-Linux OS's · cd5bfa12
  Brad Smith authored 3 months ago and Jean-Baptiste Kempf committed 3 months ago
```
CLOCK_MONOTONIC_RAW is not POSIX/portable.
```
  cd5bfa12
Dec 27, 2024
- obu: don't print warnings for Metadata OBUs of types "Unregistered user private" · 5ea4939a
  James Almer authored 3 months ago
  
  5ea4939a
Dec 20, 2024

arm32: looprestoration: Rewrite the wiener functions · 2ba57aa5

Martin Storsjö authored 3 months ago

Switch to the same cache-friendly algorithm as was done for arm64
in 2e73051c and for the reference
C code in 8291a66e.

Contrary to the arm64 implementation, this uses a main loop in C
(very similar to the one in the main C implementation in
8291a66e) rather than assembly;
this gives a bit more overhead on the call to each function, but
it shouldn't affect the big picture much.

Performane wise, this doesn't make much of a difference - it makes
things a little bit faster on some cores, and a little bit slower
on others:

Before:                 Cortex A7        A8       A53       A72       A73
wiener_7tap_8bpc_neon:   269384.4  147730.7  140028.5   92662.5   92929.0
wiener_7tap_10bpc_neon:  352690.2  159970.2  169427.8  116614.9  119371.1
After:
wiener_7tap_8bpc_neon:   238328.0  157274.1  134588.6   92200.3   97619.6
wiener_7tap_10bpc_neon:  336369.3  162182.0  161954.4  125521.2  130634.0

This is mostly in line with the results on arm64 in
2e73051c. On arm64, there was a
bit larger speedup for the 7tap case, mostly attributed to
unrolling the vertical filter (and the new filter_hv function) to
operate on 16 pixels at a time. On arm32, there's not enough
registers to do that, so we can't get such gains from unrolling.
(Reducing the unrolling on the arm64 version to match the case
on arm32 also shows similar performance numbers as on arm32 here.)

In the arm64 version, we also added separate 5tap versions of all
functions; not doing that for arm32 at this point.

This increases the binary size by 2 KB.

This doesn't have any immediate effect on how much stack space
dav1d requires in total, since the largest stack users on arm
currently are the 8tap_scaled functions.

2ba57aa5

Dec 19, 2024

looprestoration: Use only 6 row buffer for wiener, like NEON/x86 · 8291a66e

Martin Storsjö authored 3 months ago

This uses a separate function for combined horizontal and vertical
filtering, without needing to write the intermediate results
back to memory inbetween.

This mostly serves as an example for how to adjust the logic for
that case; unless we actually merge the horizontal and vertical
filtering within the _hv function, we still need space for a
7th row on the stack within that function (which means we use just
as much stack as before), but we also need one extra memcpy to
write it into the right destination.

In a build where the compiler is allowed to vectorize and inline
the wiener functions into each other, this change actually reduces
the final binary size by 4 KB, if the C version of the wiener filter
is retained.

This change makes the vectorized C code as fast as it was before
with Clang 18; on Xcode Clang 16, it's 2x slower than it was before.

Unfortunately, with GCC, this change makes the code a bit slower
again.

8291a66e

looprestoration: Make the C wiener h filter more optimizable for the compiler · a149f5c3

Martin Storsjö authored 3 months ago

This increases the binary size by 9 KB, on aarch64 with Xcode Clang 16,
if the C version of the filter is retained (which it isn't
by default).

This makes the vectorized C code roughly as fast as it was before
the rewrite on GCC; on Clang it also becomes 1.3x-2.0x faster,
while still being slower than it was initially.

a149f5c3

looprestoration: Rewrite the C version of the wiener filter · 9da303e9

Martin Storsjö authored 3 months ago

This reduces the stack usage of these functions (the C version)
significantly.

These C versions aren't used on architectures that already have
wiener filters implemented in assembly, but they matter both if
running e.g. with assembly disabled (e.g. for sanitizer builds),
and matter as example for how to do a cache efficient SIMD
implementation.

This roughly matches how these functions are implemented in the
aarch64 assembly (although that assembly function uses a mainloop
function written in assembly, and custom calling conventions
between the functions).

With this in place, dav1d can run with around 76 KB of stack
with assembly disabled.

This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16), unless built with (the default)
-Dtrim_dsp=true. (By default, the C version of the wiener filter
gets skipped entirely.)

On 32 bit arm, the assembly wiener function implementation still
uses large buffers on the stack though, but due to other functions
using less stack there, dav1d can still run with 72 KB of stack
there.

Unfortunately, this change also makes the functions slower, depending
on how well the compiler was able to optimize the previous version.
On GCC (which didn't manage to vectorize the functions so well before),
it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang
(where it was very well vectorized before).

Most of this performance can be gained back with later changes on
top, though.

9da303e9

Dec 02, 2024
- Replace Av1Block with pal_sz in read_pal_indices · d242c47b
  Luc Trudeau authored 4 months ago
  
  d242c47b
- Explicitly use uint8_t for the order_palette() scratch buffer · 9a75cebc
  Henrik Gramner authored 4 months ago
```
It previously used 'pixel' which is typedefed to uint8_t in files
that aren't bitdepth-templated, but those are indices and not
pixels so that was just confusing and misleading.
```
  9a75cebc
Nov 28, 2024

flush: Reset f->task_thread.error · 575af258

Victorien Le Couviour--Tuffet authored 4 months ago

f->task_thread.error can be set during flushing, not resetting this can
lead to c->task_thread.first being increased after having already submitted
a frame post flushing. That's fine if it happens on the very first frame,
but if that's the case on any subsequent frame it will incur a wrong frame
ordering.
Now that a non-first frame will be considered as such, its tasks won't be
able to execute (since they depend on a truly previous frame considered as
being after) and c->task_thread.cur will be increased past that frame, with
no way of it being reset, eventually leading to a hang.

575af258

Nov 26, 2024
- Fix ClangTidy misc-include-cleaner warnings · 767efeca
  Wan-Teh Chang authored 4 months ago and Ronald S. Bultje committed 4 months ago
  
  767efeca
Nov 21, 2024

checkasm: looprestoration: Do strict bounds checking of the output · f8d2620d

Martin Storsjö authored 4 months ago and

Jean-Baptiste Kempf committed 4 months ago

This would allow to immediately detect unintended writes out of
bounds like the ones fixed in
72b53807 and
1c7433a5.

Extend the PIXEL_RECT macro to provide a variable containing the
full, padded height of the buffer, for uses that operate on the
full buffer.

Allow overwriting past the right edge of the target output rectangle,
up to an alignment of 64 pixels, but allow no overwrite past the
bottom.

f8d2620d

riscv: Enable FreeBSD / OpenBSD elf_aux_info() support · f15666b7
Brad Smith authored 4 months ago and Jean-Baptiste Kempf committed 4 months ago

f15666b7

Nov 19, 2024

arm32: looprestoration: Rewrite the SGR functions · 30c3dd8e

Martin Storsjö authored 1 year ago

Switch to the same cache-friendly algorithm as was done for arm64
in c121b831.

This uses much less stack memory, and is much more cache friendly.
In this form, most of the individual asm functions only operate on
one single row of data at a time.

Some of the functions used to be unrolled to operate on two rows
at a time, while they now only operate on one at a time. In practice,
this is still a large performance win, as data is accessed in a
much more cache friendly manner.

This gives a 2-37% speedup, and reduces the peak amount of stack
used for these functions from 255 KB to 33 KB.

Before:              Cortex A7         A8        A53        A72        A73
sgr_3x3_8bpc_neon:    873990.7   748341.9   543410.2   383200.4   357502.9
sgr_3x3_10bpc_neon:   909728.0   732594.5   560123.6   392765.5   359377.7
sgr_5x5_8bpc_neon:    591597.9   527353.1   350347.4   263464.9   243098.8
sgr_5x5_10bpc_neon:   637958.2   529462.8   364613.3   280664.6   255164.6
sgr_mix_8bpc_neon:   1458977.4  1185423.2   884017.7   632922.5   587395.2
sgr_mix_10bpc_neon:  1532376.5  1259111.4   918729.3   658787.6   600317.0
After:
sgr_3x3_8bpc_neon:    836138.7   635556.5   530596.1   335794.6   348209.9
sgr_3x3_10bpc_neon:   850835.4   596445.0   534583.2   342713.4   349713.5
sgr_5x5_8bpc_neon:    577039.7   443916.5   341684.8   223374.0   232841.3
sgr_5x5_10bpc_neon:   600975.7   400041.3   347529.8   234759.9   239351.7
sgr_mix_8bpc_neon:   1297988.7   925739.1   830360.7   545476.1   548706.6
sgr_mix_10bpc_neon:  1340112.6   914395.7   873342.4   574815.7   554681.6

With this change in place, dav1d can run with around 72 KB of stack
on arm targets.

Not all functions have been merged in the same way as they were
for arm64 in c121b831, so some
minor differences remain; it's possible to incrementally optimize
this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse
finish_filter_row1/2 with sgr_weighted_row1, and make a version of
finish_filter_row1 that produces 2 rows, like is done for arm64.

It's also possible to rewrite the logic for calculating sgr_x_by_x
in the same way as was done for arm64 in
79db1624.

30c3dd8e

Nov 18, 2024

arm32: looprestoration: Apply simplifications to align with C code · 1b7f1263

Martin Storsjö authored 4 months ago

This applies the same simplifications that were done for the C
code and the x86 assembly in 4613d3a5,
and the arm64 assembly in ce80e6da,
to the arm32 implementation.

This gives a minor speedup of around a couple percent.

Before:             Cortex A7         A8        A53        A72        A73
sgr_3x3_8bpc_neon:   926600.0   753468.3   553704.1   399379.1   369674.4
sgr_5x5_8bpc_neon:   621722.9   540412.7   357275.9   274474.3   254996.0
sgr_mix_8bpc_neon:  1529715.1  1171282.5   894982.9   659996.6   610407.2
After:
sgr_3x3_8bpc_neon:   899020.3   697278.6   541569.9   382824.3   353891.8
sgr_5x5_8bpc_neon:   602183.2   498322.9   348974.5   264833.9   243837.7
sgr_mix_8bpc_neon:  1497870.8  1182121.3   880470.9   635939.3   590909.3

1b7f1263

arm64: looprestoration: Fix a comment typo · c43debf1
Martin Storsjö authored 4 months ago

c43debf1

arm: looprestoration: Fix the single line loop in sgr_weighted2 · 1c7433a5

Martin Storsjö authored 4 months ago

After processing one block, this accidentally jumped to the loop
for processing two lines at once.

The same bug was replicated in both 32 and 64 bit versions.

1c7433a5

looprestoration: Rewrite the C version of the SGR filter · f32b3146

Martin Storsjö authored 4 months ago

This reduces the stack usage of these functions (the C version)
significantly, and gives them a 15-40% speedup (on an Apple M3,
with Xcode Clang 16).

The C versions of this function does matter; even though we have
assembly implementations of it on x86 and aarch64, those only
covert the 8 and 10 bpc cases, while the C version is used as
fallback for 12 bpc.

This matches how these functions are implemented in the aarch64
assembly; operate over a window of 3 or 5 lines (of 384 pixels
each), instead of doing a full 384 x 64 block.

The individual functions for filtering a line each end up
much simpler, and closer to how this can be implemented in
assembly - but the overall business logic ends up much much
more complex.

The main difference to the aarch64 assembly implementation,
is that any buffer which is of int16_t size in the aarch64
assembly implementation, uses the type "coef" here, which
is 32 bit in the 10/12 bpc cases. (This is required for handling
the 12 bpc cases.)

With this in place, dav1d can run with around 66 KB of stack
on x86_64 with assembly enabled, with around 74 KB of stack on
aarch64 with assembly enabled, and with 118 KB of stack with
assembly disabled.

This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16).

On 32 bit arm, dav1d still requires around 270 KB of stack, as
that assembly implementation of the SGR filter uses a different
algorithm.

f32b3146

arm: looprestoration: Give symbols and defines unique names · 01d417c2

Martin Storsjö authored 4 months ago

As the machine specific init file is included in the common
template, give symbols and defines unique names that won't
clash with similar ones in the main template.

01d417c2

arm: looprestoration: Add spacing around operators · 847eece1
Martin Storsjö authored 4 months ago

847eece1
arm: looprestoration: Get rid of unnecessary rotate_ab_N intermediate functions · 56a55933
Martin Storsjö authored 4 months ago

56a55933
arm: looprestoration: Apply 'const' more consistently on parameters · 9db59d89
Martin Storsjö authored 4 months ago

9db59d89

Nov 16, 2024

checkasm: add loongarch GAS file to checkasm_asm_sources · c8fdaa86

Marvin Scholz authored 4 months ago

This is not an object so putting it in the objects variable seems wrong
and would also break using gaspp for that file.

c8fdaa86

Nov 15, 2024

Fix comments on Dav1dMasteringDisplay · f772f3e6

Maryla Ustarroz authored 4 months ago

The '///<' syntax is used to document a field after the field.
Mistakenly using it before the field results in the documentation
going to the wrong field, see:
https://videolan.videolan.me/dav1d/structDav1dMasteringDisplay.html

f772f3e6

arm64: looprestoration: Fix use of the wrong register · 72b53807

Martin Storsjö authored 4 months ago

When renumbering argument registers in
1648c232, this one register
reference was missed.

The missed register was meant to compare h with 2, but accidentally
ended up comparing bitdepth_max to 2. In the case of 8 bpc, there's
actually no bitdepth_max parameter, so it ended up comparing an
uninitialized value.

72b53807

Nov 14, 2024

arm: Use /proc/cpuinfo on linux if getauxval is unavailable · bed3a343
Martin Storsjö authored 4 months ago and Jean-Baptiste Kempf committed 4 months ago
```
On really old libc versions, getauxval isn't available. Fall back
on /proc/cpuinfo in those cases, just like we do on android too.
```
bed3a343

ci: Raise the timeout multipliers for jobs that run in QEMU · 718b62c8

Martin Storsjö authored 4 months ago and

Jean-Baptiste Kempf committed 4 months ago

For individual tests in dav1d-test-data, the default timeout
is 30 seconds (which is the Meson default if nothing is
specified). Previously it ran with a multiplier of 4, resulting
in a total timeout of 120 seconds.

When running tests in QEMU, exceeding this 120 second timeout
could happen occasionally. Raise the multiplier to 10, allowing
each individual job to run for up to 5 minutes.

This should hopefully reduce the amount of stray failures in the
CI.

For tests that already have a higher default timeout set, such
as checkasm which has got a 180 second default timeout, this results
in a much longer timeout period. However as long as we don't
frequently see issues where these actually hang, it should be
beneficial to just let them run to completion, rather than
aborting early due to a tight timeout.

718b62c8

arm64: looprestoration: Remove an unnecessary duplicate parameter in dav1d_sgr_weighted2_Xbpc_neon · 1648c232
Martin Storsjö authored 4 months ago
```
Also fix one case where the 32 bit input parameter w (which was in
x6, now in x4) was used without zero extension, by referencing to
it as w4 instead.
```
1648c232

Nov 13, 2024

arm64: looprestoration: Apply simplifications to align with C code · ce80e6da

Martin Storsjö authored 4 months ago

This applies the same simplifications that were done for the C
code and the x86 assembly in 4613d3a5,
to the arm64 implementation.

This gives a minor speedup of around a couple percent.

Before:            Cortex A53        A55        A72        A73       A76  Apple
M3
sgr_3x3_8bpc_neon:   368583.2   363654.2   279958.1   272065.1  169353.3  354.6
sgr_5x5_8bpc_neon:   258570.7   255018.5   200410.6   199478.3  117968.3  260.9
sgr_mix_8bpc_neon:   603698.1   577383.3   482468.3   436540.4  256632.9  541.8
After:
sgr_3x3_8bpc_neon:   367873.2   357884.1   275462.4   268363.9  165909.8  346.0
sgr_5x5_8bpc_neon:   254988.4   248184.2   190875.1   196939.1  120517.2  252.1
sgr_mix_8bpc_neon:   589204.7   563565.8   414025.6   427702.2  251651.2  533.4

ce80e6da

arm: looprestoration: Split an overly long line · 8bd31a92
Martin Storsjö authored 4 months ago

8bd31a92

Nov 10, 2024
- x86: Use the decl and init macros for put_8tap and prep_8tap · 330e2067
  Luca Barbato authored 5 months ago
  
  330e2067
- loongarch: Use the decl and init macros for put_8tap and prep_8tap · f966172f
  Luca Barbato authored 5 months ago
  
  f966172f