Commits · master · Luca Barbato / dav1d

Oct 19, 2023
- x86: Add 8-bit ipred z3 AVX-512 (Ice Lake) asm · fd4ecc2f
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
  
  fd4ecc2f
Oct 05, 2023
- deblock_avx512: convert byte-shifts to gf2p8affineqb · 47107e38
  Ronald S. Bultje authored 1 year ago
  
  47107e38
Oct 04, 2023
- x86: Add 8-bit ipred z1 AVX-512 (Ice Lake) asm · 4c012978
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
  
  4c012978
- x86: Consolidate some pb_0to31 and pb_0to63 constants · 8936bab7
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
  
  8936bab7
Oct 03, 2023
- Prepare for release 1.3.0 · 48035599
  Jean-Baptiste Kempf authored 1 year ago
  
  48035599
Sep 08, 2023

fix: various errors in implementation of BTI · 769bd145

André Kempe authored 1 year ago

Amend call type in refmvs. Because these blocks are reached via
blr x11, they need to be annotated.

Add missing BTI landing pads in ipred.S and ipred16.S. Because the
subroutines are called via a br from register, they need annotation with
'bti j' (AARCH64_VALID_JUMP_TARGET).

769bd145

Aug 18, 2023
- Use the correct free() function on dav1d_mem_pool_init() failure · 97becd73
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
  
  97becd73
Jul 25, 2023

Don't hard-code FGS block size · e58afe4d

Niklas Haas authored 1 year ago

Avoiding this hard-coded round-and-shift allows FGS to continue working
when modifying FG_BLOCK_SIZE (for whatever reason), and is better style
(no magic constants).

e58afe4d

Rename BLOCK_SIZE to FG_BLOCK_SIZE · 202f68e4
Niklas Haas authored 1 year ago
```
Makes this (globally available) constant more descriptive.
```
202f68e4

Jul 18, 2023

Account for chroma subsampling when allocating cbi buffers · 43a11ccb

Henrik Gramner authored 1 year ago and

Henrik Gramner committed 1 year ago

Reduces memory usage (by 3 kB per sb128 for 4:2:0) when decoding
streams with subsampled chroma when frame threading is enabled.

This also simplifies the logic for calculating cbi indices.
Both entropy decoding and reconstruction access the elements in
the same order, so calculating block x/y positions is redundant
and we can instead just store values sequentially and increase
the pointer by one every time it's accessed.

43a11ccb

Jul 12, 2023
- checkasm: Always bench C-only functions as well · 9278a14c
  Matthias Dressel authored 1 year ago
```
Integrates --bench-c into --bench to simplify benchmarks.
```
  9278a14c
Jul 07, 2023

windows: Clarify unicode characters in RC files · a7e12b62

Martin Storsjö authored 1 year ago

Windows RC files can have strings expressed either as narrow
chars expressed in a specific codepage, or as wide unicode strings.
Regardless of which way they are expressed, they are converted into
unicode strings in the compiled resource files.

When using narrow strings, even if using escaped chars like \251,
those chars are interpreted according to a specific codepage. The
codepage can be specified with arguments to the RC/windres tool
(or with a pragma, but not all tools support the pragmas),
but when no codepage is specified, the exact interpretation varies.

llvm-rc uses a hard stance of defaulting to only accepting ANSI
chars unless something else has been specified (and pragmas aren't
supported). llvm-windres defaults to CP 850 though, for compatibility
with what most people probably intend to.

However, GNU windres and MS rc.exe actually default to what the
system's current default codepage is. That means that if the resource
file is built on a machine with e.g. Japanese as the default locale,
the file gets built differently, with a different Unicode character
than what was intended.

By converting the strings to wide strings, it is unambiguous that
\251 refers to the Unicode code point u00A9 (octal 0251), i.e.
copyright sign.

This fixes building the RC files with llvm-rc. With GNU windres,
llvm-windres and rc.exe, the files still generate the bitwise exact
same output as before.

a7e12b62

checkasm: document '-t' in --help text · fc40a0db
Matthias Dressel authored 1 year ago

fc40a0db

Jul 06, 2023
- x86: Fix misaligned loads in high bit-depth pal_pred SSSE3 asm · 9eace34c
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
```
Regression introduced in 72e9c7c0.
```
  9eace34c
- x86: Add pal_idx_finish asm · 8dbf789e
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
  
  8dbf789e
- Move palette packing/edge-extension into a DSP function · 852cc340
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
  
  852cc340
- arm: ipred: Update pal_pred to work with packed indices · bc76a220
  Martin Storsjö authored 1 year ago and Henrik Gramner committed 1 year ago
  
  bc76a220
- Pack palette indices · 72e9c7c0
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
```
Pack two indices into each byte instead of storing them separately.

Reduces memory usage by up to 16 kB per sb128 in streams that uses
screen content tools when frame-threading is enabled, at the cost
of some additional computational overhead for packing/unpacking.
```
  72e9c7c0
- Use pixel instead of uint16_t for palette buffers · 233a424c
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
```
Reduces memory usage by 6 kB per sb128 in 8bpc streams that
uses screen content tools when frame-threading is enabled.
```
  233a424c
- Remove redundant 4:4:4 wedge sign tables · d437510e
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
```
Only one of the sign or no-sign 4:4:4 tables are ever used for
any given wedge index, so there's no point in having both.

Reduces the table size by around 50 kB.
```
  d437510e
- Optimize the size of interintra/wedge index tables · 90a45d89
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
```
Replace pointers with 16-bit relative offsets and remove entries
for unused block sizes (only 8x8..32x32 are relevant).

Reduces the table size by around 17 kB.
```
  90a45d89
Jul 01, 2023

arm32: refmvs: Fix building with MS armasm · 616bfd15

Martin Storsjö authored 1 year ago

Add an explicit align before the jump table; this avoids armasm bugs
in how label differences are calculated. This matches how all other
jump tables are written in our 32 bit arm assembly.

616bfd15

Jun 30, 2023

x86: Add refmvs.load_tmvs asm · a500abb7
Victorien Le Couviour--Tuffet authored 1 year ago

a500abb7

arm32: refmvs: Add NEON implementation of save_tmvs · b33d77f9

Martin Storsjö authored 1 year ago

Relative speedup compared to C:
             Cortex A7     A8     A9    A53    A72    A73
save_tmvs_neon:   1.20   1.42   1.25   1.58   1.26   1.99

b33d77f9

arm64: refmvs: Use addp instead of trn2+add · a1d7763f

Martin Storsjö authored 1 year ago

Also improve scheduling in the prologue and fix a few cases of
inconsistent indentation.

Before:        Cortex A53       A55      A72       A73      A76  Apple M1
save_tmvs_neon:   73657.2   74470.9  72238.1   56095.4  34135.7  207.9
After:
save_tmvs_neon:   72187.2   74434.6  71068.9   56043.9  33237.4  201.0

(The changes to the M1 numbers are mostly measurement noise though.)

a1d7763f

Jun 28, 2023

arm64: refmvs: Fix building with MSVC · 189d47c2

Martin Storsjö authored 1 year ago

Binutils and LLVM assemblers can infer that this str instruction must
be stur (and implicitly assemble it into that instruction), while MS
armasm64 errored out with this message:

src\libdav1d.a.p\refmvs.obj.asm(673) : error A2518: operand 2: Memory offset must be aligned
str q2, [x3, #(8*5-16)]

189d47c2

Jun 26, 2023

arm64: refmvs: Process two blocks at a time in save_tmvs · c39779f4

Martin Storsjö authored 1 year ago

Before:        Cortex A53       A55     A72       A73      A76  Apple M1
save_tmvs_neon:   79184.7   79889.9  54720.2  54522.6  29919.6  216.4
After:
save_tmvs_neon:   73780.0   74339.2  70414.1  59102.0  35028.4  213.9

The benefit from this is marginal on Cortex A53 and A55, and Apple
M1, while this change actually makes the code notably slower on
Cortex A72, A73 and A76.

c39779f4

arm64: refmvs: Add NEON implementation of save_tmvs · 6aa37aec

Martin Storsjö authored 1 year ago

               Cortex A53       A55      A72      A73      A76  Apple M1
save_tmvs_c:     116768.4  122653.1  82587.7  90445.0  45386.8  242.1
save_tmvs_neon:   79184.7   79889.9  54720.2  54522.6  29919.6  216.4

Relative speedup compared with C:
            Cortex A53    A55    A72    A73    A76   Apple M1
save_tmvs_neon:   1.47   1.54   1.51   1.66   1.52   1.12

6aa37aec

Jun 22, 2023

arm64: looprestoration: Rewrite the SGR functions · c121b831

Martin Storsjö authored 1 year ago

Make them operate in a more cache friendly manner, interleaving the
various passes, and merging some of the functions that operate on
data in similar patterns.

This reduces the amount of stack used from 207 KB to 14 KB for sgr_3x3,
from 207 KB to 16 KB for sgr_5x5 and from 255 KB to 33 KB for sgr_mix.

This does however increase the size of the binary by about 12 KB. (The
executable code generated from assembly actually shrinks by a little,
but the higher level logic in C is quite nontrivial.)

This is somewhat similar to what was done for x86 in
fe2bb774.

Benchmarks from checkasm:

Before: Cortex A53 A55 A72 A73 A76 Apple M1
sgr_3x3_8bpc_neon: 493005.0 483133.2 365056.3 345197.9 202819.1 537.3
sgr_5x5_8bpc_neon: 353152.6 349614.3 268962.2 248431.8 142302.4 385.9
sgr_mix_8bpc_neon: 829903.9 815910.9 622858.5 577238.0 333362.9 881.7
sgr_3x3_10bpc_neon: 504778.6 499851.6 379203.1 346695.2 199738.7 537.0
sgr_5x5_10bpc_neon: 363111.9 362489.7 267903.1 247506.5 138417.2 351.3
sgr_mix_10bpc_neon: 853053.7 846768.8 628349.6 584553.8 328399.5 843.6

After:
sgr_3x3_8bpc_neon: 387949.9 384216.4 294423.7 301968.2 184643.1 492.4
sgr_5x5_8bpc_neon: 259854.7 257233.2 193983.7 198388.4 128497.0 341.2
sgr_mix_8bpc_neon: 606401.5 595661.3 457209.7 462721.8 281906.7 738.6
sgr_3x3_10bpc_neon: 392472.7 394100.5 296048.1 304339.4 184271.4 471.3
sgr_5x5_10bpc_neon: 257248.3 257651.1 197552.5 199655.1 130739.7 322.9
sgr_mix_10bpc_neon: 605263.3 611197.4 441789.3 461339.2 286320.1 721.4

Speedup vs before:
27-41% 25-40% 23-42% 13-26% 5-18% 8-19%

c121b831

arm64: looprestoration: Properly use 32 bit registers for 32 bit parameters · 3c2f2087

Martin Storsjö authored 1 year ago

This issue isn't caught by checkasm, since these functions are
internal to the SGR implementation, and checkasm only affects
the parameters on the external DSP function interface.

This could potentially trigger errors with future compilers.

3c2f2087

Jun 12, 2023
- tools/dav1d: use the new version macros · 2373fda3
  James Almer authored 1 year ago
```
Signed-off-by: James Almer <jamrial@gmail.com>
```
  Verified
  
  2373fda3
- version.h: add macros to extract version components · ccb88afa
  James Almer authored 1 year ago
```
Signed-off-by: James Almer <jamrial@gmail.com>
```
  Verified
  
  ccb88afa
Jun 09, 2023
- log: replace validate_input() with assert() · fd1a5836
  James Almer authored 1 year ago
```
Missed in 31de9d50.
```
  fd1a5836
Jun 07, 2023
- Replace validate_input() with assert() in internal functions · 31de9d50
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
```
Always-enabled basic sanity checks in API functions is reasonable,
but within internal functions assert() is more appropriate when
it comes to checking for "should never happen" conditions.
```
  31de9d50
- Eliminate validate_input() printf calls in release mode · 47e2e672
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
  
  47e2e672
- Add a SIZE_MAX/2 validation check in dav1d_parse_sequence_header() · 682fb1ba
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
  
  682fb1ba
- Avoid an MSVC warning about conversion to smaller data types · 77d0cbaf
  Martin Storsjö authored 1 year ago
```
After 8f320d59, MSVC started
producing this warning:

[63/123] Compiling C object src/libdav1d.a.p/obu.c.obj
../src/obu.c(708): warning C4244: '=': conversion from 'uint16_t' to 'uint8_t',
possible loss of data
```
  77d0cbaf
- Add a debug feature for tracking heap memory usage · 51777727
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
  
  51777727
Jun 06, 2023

build: Simplify malloc handling · ed22e23d
Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago

ed22e23d

tools/dav1d: check for mismatching API version and not build version · 4ce4a50d

James Almer authored 1 year ago


There's no reason to be so strict by ensuring the tool only works with a
library built from the exact same git snapshot, when the only thing that
matters is API availability and ABI compatibility.

Signed-off-by: James Almer <jamrial@gmail.com>

4ce4a50d