Commits · master · Kacper Michajłow / dav1d

Sep 01, 2024
- Allow getopt fallback to compile on non-Windows platforms · cc6eb3d5
  Cameron Cawley authored 4 months ago
  
  cc6eb3d5
Aug 30, 2024
- picture: copy HDR10+ and T35 metadata only to visible frames · bdef2997
  Cosmin Stejerean authored 6 months ago and James Almer committed 4 months ago
  
  bdef2997
Aug 29, 2024

Check for sys/types.h before using it · 6b3c489a
Cameron Cawley authored 4 months ago and Martin Storsjö committed 4 months ago

6b3c489a
Only include unistd.h and pthread.h when necessary · 7490d986
Cameron Cawley authored 4 months ago and Martin Storsjö committed 4 months ago

7490d986
Remove unused sys/stat.h includes · a796f66e
Cameron Cawley authored 4 months ago and Martin Storsjö committed 4 months ago

a796f66e
Allow compile time CPU detection to be used when trim_dsp is disabled · 41040189
Cameron Cawley authored 4 months ago and Martin Storsjö committed 4 months ago

41040189

aarch64: Split the jump tables to a separate const section · 41511bf1

Martin Storsjö authored 6 months ago

This should allow executing in environments where the executable
memory isn't readable.

Use 4 byte entries instead of 2; most object file formats support
relocations for a 4 byte symbol difference across sections, which
allows keeping the rest of the table lookup code similar to what
it was before.

Referencing a symbol in an arbitrary location in the executable
requires a two instruction sequence (adrp+add, via the movrel
macro).

Thus, the cost of this rewrite is doubling the size of the jump
tables (which were quite small so far), and adding one instruction
in each jump table setup prologue. On an ELF build, the .text section
shrinks by 1176 bytes, and the .rodata section grows by 3136 bytes,
i.e. a 1960 byte increase.

While refactoring, prefer doing sign extension during the load
(using ldrsw rather than ldr, to avoid using the "sxtw" modifier on
the add instruction), as extending ALU arithmetics have a higher
latency.

MS armasm64 doesn't seem to support calculating symbol differences
across sections (see [1]), so keep the jump tables in the text
section there, to let the assembler calculate it at assembly time
instead. (Keeping the condition as _WIN32 for simplicity, as we don't
interact directly with armasm64, but it is wrapped in gas-preprocessor.)

[1] https://developercommunity.visualstudio.com/t/armasm64-unable-to-create-cross-section/10722340

41511bf1

Fix the macro parameter name for the CHECK_SIZE macro · 0d8abee5
Martin Storsjö authored 4 months ago

0d8abee5
Ensure that the refmvs_refpair union is packed · 0255c2b2
Cameron Cawley authored 4 months ago and Ronald S. Bultje committed 4 months ago

0255c2b2
Detect availability of pthread_setname_np and pthread_set_name_np · 033a0909
Cameron Cawley authored 4 months ago and Ronald S. Bultje committed 4 months ago

033a0909

Aug 26, 2024

aarch64: Enable detection of SVE/SVE2 on Windows · ccb02ddf

Martin Storsjö authored 5 months ago

WinSDK 10.0.26100 added these processor feature constants.

Unfortunately, no constant was added for I8MM, but if SVE_I8MM
is available, we can at least be sure that regular I8MM is
available too.

ccb02ddf

Aug 24, 2024

aarch64: Fix a label typo · 27491dd9

Martin Storsjö authored 4 months ago

Apparently, this case isn't actually ever executed, at least in most
checkasm runs, but some tools could complain about the relocation
against 160b, which pointed elsewhere than intended.

27491dd9

Aug 23, 2024

aarch64: Avoid looping through the BTI instructions · e560d2ba
Martin Storsjö authored 5 months ago
```
This does the same optimizations as
3329f8d1 and
1790e132 on the rest of the
code.
```
e560d2ba

aarch64: ipred: Use the right fill width loop in ipred_z3_fill_padding_neon · 5a33c5c6

Martin Storsjö authored 5 months ago

This makes the code behave as intended, when filling a rectangle
with arbitrary width (filling with the largest power of two width
until filled); previously, it accidentally fell back on writing 4
pixel wide stripes immediately.

No measurable effect on checkasm benchmarks though.

5a33c5c6

Aug 22, 2024

AArch64: SVE MS armasm64 fix of HBD subpel filters · 472b31f8

Arpad Panyik authored 5 months ago and

Martin Storsjö committed 5 months ago

MS armasm64 cannot compile some SVE instructions with immediate
operands, e.g.:
  sub  z0.h, z0.h, #8192

The proper form is:
  sub  z0.h, z0.h, #32, lsl #8

This patch contains the needed fixes.

472b31f8

aarch64: mc16: Optimize the BTI landing pads in put/prep_neon · 3329f8d1

Martin Storsjö authored 5 months ago

Don't include the BTI landing pad instruction in the loops.

If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to
a no-op instruction that indicates that indirect jumps can land
there. But there's no need for the loops to include that instruction.

3329f8d1

AArch64: Add HBD subpel filters using 128-bit SVE2 · 01558f3f

Arpad Panyik authored 5 months ago and

Martin Storsjö committed 5 months ago

Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only
2D convolutions have 6-tap specialisations of their vertical passes.
All other convolutions are 4- or 8-tap filters which fit well with
the 4-element 16-bit SDOT instruction of SVE2.

This patch renames HBD prep/put_neon to prep/put_16bpc_neon and
exports put_16bpc_neon.

Benchmarks show up-to 17% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 8 KiB.

Relative performance to the C reference on some Cortex-A/X CPUs:

    regular     A715    A720      X3      X4    A510    A520
 w4 hv neon:    3.93x   4.10x   5.21x   5.17x   3.57x   5.27x
 w4 hv sve2:    4.99x   5.14x   6.00x   6.05x   4.33x   3.99x
 w8 hv neon:    1.72x   1.67x   1.98x   2.18x   2.95x   2.94x
 w8 hv sve2:    2.12x   2.29x   2.52x   2.62x   2.60x   2.60x
w16 hv neon:    1.59x   1.53x   1.83x   1.89x   2.35x   2.24x
w16 hv sve2:    1.94x   2.12x   2.33x   2.18x   2.06x   2.06x
w32 hv neon:    1.49x   1.50x   1.66x   1.76x   2.10x   2.16x
w32 hv sve2:    1.81x   2.09x   2.11x   2.09x   1.84x   1.87x
w64 hv neon:    1.52x   1.50x   1.55x   1.71x   1.95x   2.05x
w64 hv sve2:    1.84x   2.08x   1.97x   1.98x   1.74x   1.77x

 w4 h neon:     5.35x   5.47x   7.39x   5.78x   3.92x   5.19x
 w4 h sve2:     7.91x   8.35x  11.95x  10.33x   5.81x   5.42x
 w8 h neon:     4.49x   4.43x   6.50x   4.87x   7.18x   6.17x
 w8 h sve2:     6.09x   6.22x   9.59x   7.70x   7.89x   6.83x
w16 h neon:     2.53x   2.52x   2.34x   1.86x   2.71x   2.75x
w16 h sve2:     3.41x   3.47x   3.53x   3.25x   2.89x   2.96x
w32 h neon:     2.07x   2.08x   1.97x   1.56x   2.17x   2.21x
w32 h sve2:     2.76x   2.84x   2.94x   2.75x   2.24x   2.29x
w64 h neon:     1.86x   1.86x   1.76x   1.41x   1.87x   1.88x
w64 h sve2:     2.47x   2.54x   2.65x   2.46x   1.94x   1.94x

 w4 v neon:     5.22x   5.17x   6.36x   5.60x   4.23x   7.30x
 w4 v sve2:     5.86x   5.90x   7.81x   7.16x   4.86x   4.15x
 w8 v neon:     4.83x   4.79x   6.96x   6.45x   4.74x   8.40x
 w8 v sve2:     5.25x   5.23x   7.76x   6.79x   4.84x   4.13x
w16 v neon:     2.59x   2.60x   2.93x   2.47x   1.80x   4.16x
w16 v sve2:     2.85x   2.88x   3.36x   2.73x   1.86x   2.00x
w32 v neon:     2.12x   2.13x   2.33x   2.03x   1.34x   3.11x
w32 v sve2:     2.36x   2.40x   2.73x   2.32x   1.41x   1.48x
w64 v neon:     1.94x   1.92x   2.02x   1.78x   1.12x   2.59x
w64 v sve2:     2.16x   2.15x   2.37x   2.03x   1.17x   1.22x

 w4 0 neon:     1.75x   1.71x   1.44x   1.56x   3.18x   2.87x
 w4 0 sve2:     4.28x   4.39x   5.72x   6.42x   5.50x   4.68x
 w8 0 neon:     3.05x   3.04x   4.44x   4.64x   3.84x   3.52x
 w8 0 sve2:     3.85x   3.80x   5.45x   6.01x   4.92x   4.26x
w16 0 neon:     2.92x   2.93x   3.82x   3.23x   4.58x   4.44x
w16 0 sve2:     4.29x   4.27x   4.25x   4.15x   5.58x   5.29x
w32 0 neon:     2.73x   2.76x   3.50x   2.67x   4.44x   4.26x
w32 0 sve2:     4.09x   4.10x   3.75x   3.39x   5.67x   5.22x
w64 0 neon:     2.73x   2.70x   3.27x   3.14x   4.57x   4.68x
w64 0 sve2:     4.06x   3.97x   3.54x   3.18x   6.36x   6.25x

      sharp     A715    A720      X3      X4    A510    A520
 w4 hv neon:    3.54x   3.64x   4.43x   4.45x   3.03x   4.72x
 w4 hv sve2:    4.30x   4.55x   5.38x   5.26x   4.04x   3.76x
 w8 hv neon:    1.30x   1.25x   1.51x   1.60x   2.44x   2.43x
 w8 hv sve2:    1.86x   2.06x   2.09x   2.18x   2.37x   2.39x
w16 hv neon:    1.19x   1.16x   1.43x   1.36x   1.95x   1.98x
w16 hv sve2:    1.68x   1.91x   1.94x   1.84x   1.89x   1.94x
w32 hv neon:    1.13x   1.12x   1.30x   1.29x   1.75x   1.81x
w32 hv sve2:    1.58x   1.84x   1.75x   1.74x   1.70x   1.76x
w64 hv neon:    1.13x   1.13x   1.21x   1.25x   1.65x   1.69x
w64 hv sve2:    1.57x   1.84x   1.62x   1.67x   1.62x   1.65x

 w4 h neon:     5.38x   5.49x   7.46x   5.74x   3.93x   5.23x
 w4 h sve2:     7.86x   8.37x  11.99x  10.38x   5.81x   5.40x
 w8 h neon:     3.46x   3.49x   5.36x   4.64x   6.40x   5.62x
 w8 h sve2:     5.95x   6.23x   9.61x   7.76x   7.86x   6.89x
w16 h neon:     1.99x   1.97x   2.07x   1.91x   2.43x   2.51x
w16 h sve2:     3.42x   3.46x   3.75x   3.23x   2.89x   2.98x
w32 h neon:     1.67x   1.62x   1.66x   1.63x   1.95x   2.01x
w32 h sve2:     2.86x   2.84x   2.94x   2.72x   2.21x   2.29x
w64 h neon:     1.45x   1.45x   1.51x   1.48x   1.69x   1.70x
w64 h sve2:     2.47x   2.54x   2.64x   2.46x   1.93x   1.95x

 w4 v neon:     4.07x   4.01x   5.15x   4.74x   3.38x   6.56x
 w4 v sve2:     5.88x   5.86x   7.81x   7.15x   4.85x   4.39x
 w8 v neon:     3.64x   3.59x   5.38x   4.92x   3.59x   7.23x
 w8 v sve2:     5.23x   5.19x   7.77x   6.66x   4.81x   4.13x
w16 v neon:     1.93x   1.95x   2.25x   1.92x   1.35x   3.46x
w16 v sve2:     2.85x   2.88x   3.36x   2.71x   1.86x   1.94x
w32 v neon:     1.57x   1.58x   1.78x   1.60x   1.01x   2.67x
w32 v sve2:     2.36x   2.39x   2.73x   2.35x   1.41x   1.50x
w64 v neon:     1.44x   1.42x   1.54x   1.43x   0.85x   2.19x
w64 v sve2:     2.17x   2.15x   2.37x   2.06x   1.18x   1.25x

01558f3f

Aug 21, 2024

AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters · 713c076d

Arpad Panyik authored 5 months ago

Add 6-tap variant of standard bit-depth horizontal subpel filters
using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch
also extends the HV filter with 6-tap horizontal pass using USMMLA.

Benchmarks show up-to 6-7% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 1.2 KiB.

Relative runtime of micro benchmarks after this patch on Neoverse
and Cortex CPU cores:

regular      V2      V1      X3    A720    A715    A520    A510
  w8 hv:  0.860x  0.895x  0.870x  0.896x  0.896x  0.938x  0.936x
 w16 hv:  0.829x  0.886x  0.865x  0.908x  0.906x  0.946x  0.944x
 w32 hv:  0.837x  0.883x  0.862x  0.914x  0.915x  0.953x  0.949x
 w64 hv:  0.840x  0.883x  0.862x  0.914x  0.914x  0.955x  0.952x

  w8 h:   0.746x  0.754x  0.747x  0.723x  0.724x  0.874x  0.866x
 w16 h:   0.749x  0.764x  0.745x  0.731x  0.731x  0.858x  0.852x
 w32 h:   0.739x  0.754x  0.738x  0.729x  0.729x  0.839x  0.837x
 w64 h:   0.736x  0.749x  0.733x  0.725x  0.726x  0.847x  0.836x

713c076d

Aug 12, 2024

AArch64: Fix typo in SBD 6-tap 2D/HV subpel filter · 287e90a3

Arpad Panyik authored 5 months ago

The macro parameter \xmy of filter_8tap_fn was used incorrectly as a
pointer instead of \lsrc. They refer to the same register but in
different context.

287e90a3

Aug 04, 2024
- decode_coefs: Optimize index offset calculations · 5ef6b241
  Kyle Siefring authored 5 months ago
```
Performance Impact on Sapphire Rapids:

Chimera: 0.46% Faster
```
  5ef6b241
Jun 26, 2024

AArch64: Move constants of DotProd subpel filters to .rodata · 2355eeb8

Arpad Panyik authored 6 months ago

The constants used for the subpel filters were placed in the .text
section for simplicity and peak performance, but this does not work on
systems with execute only .text sections (e.g.: OpenBSD).

The performance cost of moving the constants to the .rodata section
is small and mostly within the measurable noise.

2355eeb8

Jun 25, 2024

aarch64: Explicitly use the ldur instruction where relevant in mc_dotprod.S · 7fbcdc6d

Martin Storsjö authored 6 months ago

The ldr instruction only can handle offsets that are a multiple
of the element size; most assemblers implicitly produce the ldur
instruction when a non-aligned offset is provided.

Older versions of MS armasm64, however, error out on this. Since
MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7
and earlier require explicitly writing the instruction as ldur.

Despite this, even older versions still fail to build the mc_dotprod.S
sources, with errors like this:

    src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range
        mov             x10, (((0*15-1)<<7)|(3*15-1))

This happens on MSVC 2022 17.1 and older, while 17.2 and newer
accept the negative value expression here.

In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure
script at the moment, as it uses inline assembly to test for external
assembler features.

7fbcdc6d

Add Arm OpenBSD run-time CPU feature detection support · 431f4fb2
Brad Smith authored 7 months ago and Martin Storsjö committed 6 months ago
```
Add run-time CPU feature detection for DotProd and i8mm on AArch64.
```
431f4fb2
x86: Add 6-tap variants of high bit-depth mc SSSE3 functions · 32bf6cde
Henrik Gramner authored 7 months ago

32bf6cde

Jun 17, 2024
- itx: restrict number of columns iterated over based on EOB · ca83ee6d
  Ronald S. Bultje authored 7 months ago
  
  ca83ee6d
Jun 10, 2024
- cli: Prevent buffer over-read · 01b94cc3
  Nathan E. Egge authored 7 months ago
  
  01b94cc3
Jun 05, 2024

AArch64: Fix potential out of bounds access in DotProd H/HV filters · 92f592ed

Arpad Panyik authored 7 months ago

The DotProd/I8MM horizontal and HV/2D subpel filters use -4 offset
for sampling instead of -3 to be better aligned in some cases. This
resulted in an out of bounds access, which led to crashes.

This patch fixes it.

92f592ed

May 27, 2024
- x86: Eliminate hardcoded struct offsets in refmvs load_tmvs() asm · da2cc781
  Henrik Gramner authored 7 months ago
  
  da2cc781
- refmvs: Consolidate r and rp_proj allocations · 26a2744e
  Henrik Gramner authored 7 months ago
```
The conditions for when to (re)allocate those buffers are identical,
so they can be merged into a single branch.

The allocation of the buffers themselves can also be combined to
reduce the number of allocation calls.
```
  26a2744e
- refmvs: Remove dav1d_refmvs_init() · 54801d07
  Henrik Gramner authored 7 months ago
```
It's only ever called on data which has already been zero-initialized.
```
  54801d07
- refmvs: Simplify 2-pass logic · 89a200c8
  Henrik Gramner authored 7 months ago
```
n_tc is always >= n_fc, so we only need to check the latter.
```
  89a200c8
- x86: Add 6-tap variants of 8bpc mc SSSE3 functions · ca156d90
  Henrik Gramner authored 8 months ago
  
  ca156d90
- x86: Add minor 8bpc mc SSE improvements · 8afbd4f6
  Henrik Gramner authored 8 months ago
  
  8afbd4f6
- x86: Remove 8bpc mc SSE2 asm · 85c16391
  Henrik Gramner authored 8 months ago
```
The amount of nested macros caused by having to support SSE2 makes
the code very difficult to maintain and modify. It is also of
questionable value considering most other asm requires SSSE3.
```
  85c16391
- x86: Remove unused macro in mc16_avx512.asm · d3997acb
  Henrik Gramner authored 8 months ago
  
  d3997acb
May 25, 2024
- Update NEWS for 1.4.2 · 805d9e5a
  Jean-Baptiste Kempf authored 8 months ago
  
  1.4.2
  
  805d9e5a
May 20, 2024
- ARM64: Minor improvement to symbol decode · 3623543c
  Kyle Siefring authored 8 months ago and Jean-Baptiste Kempf committed 8 months ago
```
Use a slightly shorter series of instructions to compute cdf update
rate.
```
  3623543c
- tests: Verify dav1d command line in dav1d_argon.bash · bb948769
  Henrik Gramner authored 8 months ago
```
Error out early instead of producing bogus mismatch errors in case
of an incorrect cpu mask for example.
```
  bb948769
May 19, 2024

arm64: msac: Explicitly use the ldur instruction · 9469e184

Martin Storsjö authored 8 months ago

The ldr instruction can take an immediate offset which is a multiple
of the loaded element size. If the ldr instruction is given an
immediate offset which isn't a multiple of the element size,
most assemblers implicitly generate a "ldur" instruction instead.

Older versions of MS armasm64.exe don't do this, but instead error
out with "error A2518: operand 2: Memory offset must be aligned".
(Current versions don't do this but correctly generate "ldur"
implicitly.)

Switch this instruction to an explicit "ldur", like we do elsewhere,
to fix building with these older tools.

9469e184

May 18, 2024

CI: Update Android image · 37155c11

Matthias Dressel authored 9 months ago and

Jean-Baptiste Kempf committed 8 months ago

NDK 26 dropped support for API versions 19 and 20 (KitKat, Android 4.4).
The minimum supported API is now 21 (Lollipop, Android 5.0).

37155c11