Commits · master · Sylvestre Ledru / dav1d

Mar 07, 2020
- headers: add missing doxy to some Dav1dSettings fields · ecc5078c
  James Almer authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
  
  ecc5078c
- headers: split some public fields into separate lines and document them · a7374232
  James Almer authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
The description was being added only to the last field of each line by Doxygen.
```
  a7374232
- CLI: Remove additional space · acfbd09b
  Matthias Dressel authored 4 years ago
```
The argument for --input was aligned with the argument for
--output. None of the other arguments were aligned.
For consistency either align all or none.
This commit removes the alignment.
```
  acfbd09b
- CLI: Remove avx512 from help text · b2f7ba60
  Matthias Dressel authored 4 years ago
```
avx512 was merged with avx512icl.
See 7b208fa8
```
  b2f7ba60
Mar 06, 2020

This currently does not check the vulkan/placebo codepath since needed
packages are not yet in Debian unstable.

55439739

examples: fail when SDL is not found · e36ebb6f

Konstantin Pavlov authored 4 years ago

Now when -Denable_examples=true is requested, meson will fail as
expected if there is no SDL available.

e36ebb6f

CI: Add documentation CI job · b8200c13

Konstantin Pavlov authored 4 years ago

This requires a docker image with doxygen & dot installed, so bump it as
well.

Fixes #334.

b8200c13

CI: Deduplicate and template jobs · bf60f0ab

Konstantin Pavlov authored 4 years ago

This makes it much easier to introduce new jobs without copying walls of
text over and over.  No functional changes.

Changes are:
 - move docker images to common templates to make them easier to bump
 - replace "debian" tag with "docker" to choose runners
 - align meson parameters
 - use variables sections where applicable
 - move test data cache to before_script

bf60f0ab

doc: search for dot as it's needed to build doxygen documentation · c4dea948
Konstantin Pavlov authored 4 years ago

c4dea948

examples: chase · e04227c5

Jan Beich authored 4 years ago

../examples/dav1dplay.c:1030:5: warning: implicit declaration of function 'init_demuxers' is invalid in C99 [-Wimplicit-function-declaration]
    init_demuxers();
    ^
/usr/bin/ld.bfd: examples/c590b3c@@dav1dplay@exe/dav1dplay.c.o: in function `decoder_thread_main':
dav1dplay.c:(.text+0x1243): undefined reference to `init_demuxers'
cc: error: linker command failed with exit code 1 (use -v to see invocation)

e04227c5

Mar 05, 2020

Update NEWS for 0.6.0 · efd9e551
Jean-Baptiste Kempf authored 4 years ago

0.6.0

efd9e551

arm64: mc: NEON implementation of w_mask for 16 bpc · c8aaddea

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Checkasm numbers:          Cortex A53       A72       A73
w_mask_420_w4_16bpc_neon:       173.6     123.5     120.3
w_mask_420_w8_16bpc_neon:       484.2     344.1     329.5
w_mask_420_w16_16bpc_neon:     1411.2    1027.4    1035.1
w_mask_420_w32_16bpc_neon:     5561.5    4093.2    3980.1
w_mask_420_w64_16bpc_neon:    13809.6    9856.5    9581.0
w_mask_420_w128_16bpc_neon:   35614.7   25553.8   24284.4
w_mask_422_w4_16bpc_neon:       159.4     112.2     114.2
w_mask_422_w8_16bpc_neon:       453.4     326.1     326.7
w_mask_422_w16_16bpc_neon:     1394.6    1062.3    1050.2
w_mask_422_w32_16bpc_neon:     5485.8    4219.6    4027.3
w_mask_422_w64_16bpc_neon:    13701.2   10079.6    9692.6
w_mask_422_w128_16bpc_neon:   35455.3   25892.5   24625.9
w_mask_444_w4_16bpc_neon:       153.0     112.3     112.7
w_mask_444_w8_16bpc_neon:       437.2     331.8     325.8
w_mask_444_w16_16bpc_neon:     1395.1    1069.1    1041.7
w_mask_444_w32_16bpc_neon:     5370.1    4213.5    4138.1
w_mask_444_w64_16bpc_neon:    13482.6   10190.5   10004.6
w_mask_444_w128_16bpc_neon:   35583.7   26911.2   25638.8

Corresponding numbers for 8 bpc for comparison:

w_mask_420_w4_8bpc_neon:        126.6      79.1      87.7
w_mask_420_w8_8bpc_neon:        343.9     195.0     211.5
w_mask_420_w16_8bpc_neon:       886.3     540.3     577.7
w_mask_420_w32_8bpc_neon:      3558.6    2152.4    2216.7
w_mask_420_w64_8bpc_neon:      8894.9    5161.2    5297.0
w_mask_420_w128_8bpc_neon:    22520.1   13514.5   13887.2
w_mask_422_w4_8bpc_neon:        112.9      68.2      77.0
w_mask_422_w8_8bpc_neon:        314.4     175.5     208.7
w_mask_422_w16_8bpc_neon:       835.5     565.0     608.3
w_mask_422_w32_8bpc_neon:      3381.3    2231.8    2287.6
w_mask_422_w64_8bpc_neon:      8499.4    5343.6    5460.8
w_mask_422_w128_8bpc_neon:    21823.3   14206.5   14249.1
w_mask_444_w4_8bpc_neon:        104.6      65.8      72.7
w_mask_444_w8_8bpc_neon:        290.4     173.7     196.6
w_mask_444_w16_8bpc_neon:       831.4     586.7     591.7
w_mask_444_w32_8bpc_neon:      3320.8    2300.6    2251.0
w_mask_444_w64_8bpc_neon:      8300.0    5480.5    5346.8
w_mask_444_w128_8bpc_neon:    21633.8   15981.3   14384.8

c8aaddea

CI: run a selection of jobs on a node with avx2 · bce8fae9

Janne Grunau authored 4 years ago

Switches build-debian (for avx2 checkasm coverage) and test-win64 and
test-debian-unaligned-stack (for testing asm '%if's).
Refs #330, #333

bce8fae9

Mar 04, 2020

x86: Fix crash in AVX2 cdef_filter with <32-byte stack alignment · 3a6a55d8
Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago

3a6a55d8

arm64: mc: NEON implementation of blend for 16bpc · fb348f64

Martin Storsjö authored 4 years ago

Checkasm numbers:     Cortex A53     A72     A73
blend_h_w2_16bpc_neon:     109.3    83.1    56.7
blend_h_w4_16bpc_neon:     114.1    61.4    62.3
blend_h_w8_16bpc_neon:     133.3    80.8    81.1
blend_h_w16_16bpc_neon:    215.6   132.7   149.5
blend_h_w32_16bpc_neon:    390.4   254.2   235.8
blend_h_w64_16bpc_neon:    719.1   456.3   453.8
blend_h_w128_16bpc_neon:  1646.1  1112.3  1065.9
blend_v_w2_16bpc_neon:     185.9   175.9   180.0
blend_v_w4_16bpc_neon:     338.0   183.4   232.1
blend_v_w8_16bpc_neon:     426.5   213.8   250.6
blend_v_w16_16bpc_neon:    678.2   357.8   382.6
blend_v_w32_16bpc_neon:   1098.3   686.2   695.6
blend_w4_16bpc_neon:        75.7    31.5    32.0
blend_w8_16bpc_neon:       134.0    75.0    75.8
blend_w16_16bpc_neon:      467.9   267.3   310.0
blend_w32_16bpc_neon:     1201.9   658.7   779.7

Corresponding numbers for 8bpc for comparison:
blend_h_w2_8bpc_neon:      104.1    55.9    60.8
blend_h_w4_8bpc_neon:      108.9    58.7    48.2
blend_h_w8_8bpc_neon:       99.3    64.4    67.4
blend_h_w16_8bpc_neon:     145.2    93.4    85.1
blend_h_w32_8bpc_neon:     262.2   157.5   148.6
blend_h_w64_8bpc_neon:     466.7   278.9   256.6
blend_h_w128_8bpc_neon:   1054.2   624.7   571.0
blend_v_w2_8bpc_neon:      170.5   106.6   113.4
blend_v_w4_8bpc_neon:      333.0   189.9   225.9
blend_v_w8_8bpc_neon:      314.9   199.0   203.5
blend_v_w16_8bpc_neon:     476.9   300.8   241.1
blend_v_w32_8bpc_neon:     766.9   430.4   415.1
blend_w4_8bpc_neon:         66.7    35.4    26.0
blend_w8_8bpc_neon:        110.7    47.9    48.1
blend_w16_8bpc_neon:       299.4   161.8   162.3
blend_w32_8bpc_neon:       725.8   417.0   432.8

fb348f64

arm: mc: Optimize blend_v · 52e9b435

Martin Storsjö authored 4 years ago

Use a post-increment with a register on the last increment, avoiding
a separate increment. Avoid processing the last 8 pixels in the w32
case when we only output 24 pixels.

Before:
ARM32                Cortex A7      A8      A9     A53     A72     A73
blend_v_w4_8bpc_neon:    450.4   574.7   538.7   374.6   199.3   260.5
blend_v_w8_8bpc_neon:    559.6   351.3   552.5   357.6   214.8   204.3
blend_v_w16_8bpc_neon:   926.3   511.6   787.9   593.0   271.0   246.8
blend_v_w32_8bpc_neon:  1482.5   917.0  1149.5   991.9   354.0   368.9
ARM64
blend_v_w4_8bpc_neon:                            351.1   200.0   224.1
blend_v_w8_8bpc_neon:                            333.0   212.4   203.8
blend_v_w16_8bpc_neon:                           495.2   302.0   247.0
blend_v_w32_8bpc_neon:                           840.0   557.8   514.0

After:
ARM32
blend_v_w4_8bpc_neon:    435.5   575.8   537.6   356.2   198.3   259.5
blend_v_w8_8bpc_neon:    545.2   347.9   553.5   339.1   207.8   204.2
blend_v_w16_8bpc_neon:   913.7   511.0   788.1   573.7   275.4   243.3
blend_v_w32_8bpc_neon:  1445.3   951.2  1079.1   920.4   352.2   361.6
ARM64
blend_v_w4_8bpc_neon:                            333.0   191.3   225.9
blend_v_w8_8bpc_neon:                            314.9   199.3   203.5
blend_v_w16_8bpc_neon:                           476.9   301.3   241.1
blend_v_w32_8bpc_neon:                           766.9   432.8   416.9

52e9b435

arm64: mc: Treat the stride as a full 64 bit (potential signed) value in blend_8bpc_neon · a7f6fe32
Martin Storsjö authored 4 years ago

a7f6fe32
arm64: mc: Fix indentation · 48ffb05e
Martin Storsjö authored 4 years ago

48ffb05e

arm64: mc: Use more intuitive lane specifications for loads/stores · 83c62716

Martin Storsjö authored 4 years ago

For loads where we load/store a full or half register (instead of
a lanewise load/store), the lane specification in itself doesn't
matter, only its size.

This doesn't change the generated code, but makes it more readable.

83c62716

Mar 03, 2020
- Update NEWS for 0.6.0 · f4dac1a3
  Jean-Baptiste Kempf authored 4 years ago
  
  f4dac1a3
- CI/armv7: use `linux32 meson ...` to allow running on aarch64 · abaad816
  Janne Grunau authored 4 years ago
  
  abaad816
Mar 02, 2020

arm64: loopfilter: NEON implementation of loopfilter for 16 bpc · 360243c2

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 4 years ago

Checkasm runtimes:      Cortex A53     A72     A73
lpf_h_sb_uv_w4_16bpc_neon:   919.0   795.0   714.9
lpf_h_sb_uv_w6_16bpc_neon:  1267.7  1116.2  1081.9
lpf_h_sb_y_w4_16bpc_neon:   1500.2  1543.9  1778.5
lpf_h_sb_y_w8_16bpc_neon:   2216.1  2183.0  2568.1
lpf_h_sb_y_w16_16bpc_neon:  2641.8  2630.4  2639.4
lpf_v_sb_uv_w4_16bpc_neon:   836.5   572.7   667.3
lpf_v_sb_uv_w6_16bpc_neon:  1130.8   709.1   955.5
lpf_v_sb_y_w4_16bpc_neon:   1271.6  1434.4  1272.1
lpf_v_sb_y_w8_16bpc_neon:   1818.0  1759.1  1664.6
lpf_v_sb_y_w16_16bpc_neon:  1998.6  2115.8  1586.6

Corresponding numbers for 8 bpc for comparison:
lpf_h_sb_uv_w4_8bpc_neon:    799.4   632.8   695.4
lpf_h_sb_uv_w6_8bpc_neon:   1067.3   613.6   767.5
lpf_h_sb_y_w4_8bpc_neon:    1490.5  1179.1  1018.9
lpf_h_sb_y_w8_8bpc_neon:    1892.9  1382.0  1172.0
lpf_h_sb_y_w16_8bpc_neon:   2117.4  1625.4  1739.0
lpf_v_sb_uv_w4_8bpc_neon:    447.1   447.7   446.0
lpf_v_sb_uv_w6_8bpc_neon:    522.1   529.0   513.1
lpf_v_sb_y_w4_8bpc_neon:    1043.7   785.0   775.9
lpf_v_sb_y_w8_8bpc_neon:    1500.4  1115.9   881.2
lpf_v_sb_y_w16_8bpc_neon:   1493.5  1371.4  1248.5

360243c2

arm: loopfilter: Prepare for 16 bpc · ebbf91f4
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 4 years ago

ebbf91f4
arm: loopfilter: Fix a comment · ac492552
Martin Storsjö authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago

ac492552

Feb 25, 2020
- fuzzing: link the fuzzing binaries as C++ · d398da88
  Janne Grunau authored 5 years ago and Jean-Baptiste Kempf committed 4 years ago
```
Requires meson 0.51 for oss-fuzz and 0.49 for the fuzzing binaries in
general due to the use of the 'kwargs' keyword argument.
```
  d398da88
- fuzzing: split the fuzzing targets to their own meson.build file · 7675eb16
  Janne Grunau authored 5 years ago and Jean-Baptiste Kempf committed 4 years ago
  
  7675eb16
Feb 24, 2020

x86: Add mc w_mask 4:4:4 AVX-512 (Ice Lake) asm · 64f9db55
Henrik Gramner authored 5 years ago

64f9db55
x86: Add mc w_mask 4:2:2 AVX-512 (Ice Lake) asm · d4a7c647
Henrik Gramner authored 5 years ago

d4a7c647
x86: Add mc w_mask 4:2:0 AVX-512 (Ice Lake) asm · 50e9a39a
Henrik Gramner authored 5 years ago

50e9a39a
x86: Add mc avg/w_avg/mask AVX-512 (Ice Lake) asm · d085424c
Henrik Gramner authored 5 years ago

d085424c

x86: optimize cdef_filter_{4x{4,8},8x8}_avx2 · 22080aa3

Victorien Le Couviour--Tuffet authored 5 years ago

Add 2 seperate code paths for pri/sec strengths equal 0.
Having both strengths not equal to 0 is uncommon, branching to skip
unnecessary computations is therefore beneficial.

------------------------------------------
before: cdef_filter_4x4_8bpc_avx2: 93.8
 after: cdef_filter_4x4_8bpc_avx2: 71.7
---------------------
before: cdef_filter_4x8_8bpc_avx2: 161.5
 after: cdef_filter_4x8_8bpc_avx2: 116.3
---------------------
before: cdef_filter_8x8_8bpc_avx2: 221.8
 after: cdef_filter_8x8_8bpc_avx2: 156.4
------------------------------------------

22080aa3

x86: add a seperate fully edged case to cdef_filter_avx2 · 1bd078c2

Victorien Le Couviour--Tuffet authored 5 years ago

---------------------
fully edged blocks perf
------------------------------------------
before: cdef_filter_4x4_8bpc_avx2: 91.0
 after: cdef_filter_4x4_8bpc_avx2: 75.7
---------------------
before: cdef_filter_4x8_8bpc_avx2: 154.6
 after: cdef_filter_4x8_8bpc_avx2: 131.8
---------------------
before: cdef_filter_8x8_8bpc_avx2: 214.1
 after: cdef_filter_8x8_8bpc_avx2: 195.9
------------------------------------------

1bd078c2

checkasm: Improve the cdef input randomization algorithm · efbdf7a0
Henrik Gramner authored 5 years ago and Victorien Le Couviour--Tuffet committed 5 years ago
```
Change the input buffer randomization algorithm to more readily
trigger issues with both under- and overflows in cdef_filter.
```
efbdf7a0

Feb 21, 2020
- cli: Replace malloc + memset(0) with calloc in input.c · 296d1dc0
  Luc Trudeau authored 5 years ago
  
  296d1dc0
- cli: remove init_[de]muxers() functions · cacc8e35
  Luc Trudeau authored 5 years ago
```
Muxer and demuxers arrays are now statically initialized
```
  cacc8e35
Feb 20, 2020
- Replace malloc+memset(0) with calloc · 0c885607
  Luc Trudeau authored 5 years ago
  
  0c885607
Feb 18, 2020
- CI: update aarch64 docker image to buster with meson 0.49 · bf56afde
  Janne Grunau authored 5 years ago
  
  bf56afde
Feb 17, 2020

arm: cdef: Do an 8 bit implementation for cases with all edges present · b33f46e8

Martin Storsjö authored 5 years ago

This increases the code size by around 3 KB on arm64.

Before:
ARM32:                    Cortex A7      A8      A9     A53     A72     A73
cdef_filter_4x4_8bpc_neon:    807.1   517.0   617.7   506.6   429.9   357.8
cdef_filter_4x8_8bpc_neon:   1407.9   899.3  1054.6   862.3   726.5   628.1
cdef_filter_8x8_8bpc_neon:   2394.9  1456.8  1676.8  1461.2  1084.4  1101.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            460.7   301.8   308.0
cdef_filter_4x8_8bpc_neon:                            831.6   547.0   555.2
cdef_filter_8x8_8bpc_neon:                           1454.6   935.6   960.4

After:
ARM32:
cdef_filter_4x4_8bpc_neon:    669.3   541.3   524.4   424.9   322.7   298.1
cdef_filter_4x8_8bpc_neon:   1159.1   922.9   881.1   709.2   538.3   514.1
cdef_filter_8x8_8bpc_neon:   1888.8  1285.4  1358.5  1152.9   839.3   871.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            383.6   262.1   259.9
cdef_filter_4x8_8bpc_neon:                            684.9   472.2   464.7
cdef_filter_8x8_8bpc_neon:                           1160.0   756.8   788.0

(The checkasm benchmark averages three different cases; the fully
edged case is one of those three, while it's the most common case
in actual video. The difference is much bigger if only benchmarking
that particular case.)

This actually apparently makes the code a little bit slower for the w=4
cases on Cortex A8, while it's a significant speedup on all other cores.

b33f46e8

arm32: cdef: Fix a typo for consistency · aff9a210
Martin Storsjö authored 5 years ago
```
The signedness of elements doesn't matter for vsub; match the vsub.i16
next to it.
```
aff9a210

Feb 16, 2020

cli: Implement line buffering in print_stats() · 09d90658

Henrik Gramner authored 5 years ago

Console output is incredibly slow on Windows, which is aggravated by
the lack of line buffering. As a result, a significant percentage of
overall runtime is actually spent displaying the decoding progress.

Doing the line buffering manually alleviates most of the issue.

09d90658