Commits · master · Vibhoothi / dav1d

Sep 01, 2020

cli: Use proper integer math in Y4M PAR calculations · 3bfe8c7c

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

The previous floating-point implementation produced results that were
sometimes slightly off due to rounding errors.

For example, a frame size of 432x240 with a render size of 176x240
previously resulted in a PAR of 98:240 instead of the correct 11:27.

Also reduce fractions to produce more readable numbers.

3bfe8c7c

Aug 30, 2020
- Output render size to Y4M · 484d6595
  Raphaël Zumer authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
This adds A<W>:<H> to the Y4M header, to
preserve the intended aspect ratio for
anamorphic video.
```
  484d6595
Aug 29, 2020

arm32: mc: NEON implementation of avg/mask/w_avg for 16 bpc · 80aa7823

Martin Storsjö authored 4 years ago

Cortex A7 A8 A9 A53 A72 A73
avg_w4_16bpc_neon: 131.4 81.8 117.3 111.0 50.9 58.8
avg_w8_16bpc_neon: 291.9 173.1 293.1 230.9 114.7 128.8
avg_w16_16bpc_neon: 803.3 480.1 821.4 645.8 345.7 384.9
avg_w32_16bpc_neon: 3350.0 1833.1 3188.1 2343.5 1343.9 1500.6
avg_w64_16bpc_neon: 8185.9 4390.6 10448.2 6078.8 3303.6 3466.7
avg_w128_16bpc_neon: 22384.3 10901.2 33721.9 16782.7 8165.1 8416.5
w_avg_w4_16bpc_neon: 251.3 165.8 203.9 158.3 99.6 106.9
w_avg_w8_16bpc_neon: 638.4 427.8 555.7 365.1 283.2 277.4
w_avg_w16_16bpc_neon: 1912.3 1257.5 1623.4 1056.5 879.5 841.8
w_avg_w32_16bpc_neon: 7461.3 4889.6 6383.8 3966.3 3286.8 3296.8
w_avg_w64_16bpc_neon: 18689.3 11698.1 18487.3 10134.1 8156.2 7939.5
w_avg_w128_16bpc_neon: 48776.6 28989.0 53203.3 26004.1 20055.2 20049.4
mask_w4_16bpc_neon: 298.6 189.2 242.3 191.6 115.2 129.6
mask_w8_16bpc_neon: 768.6 501.5 646.1 432.4 302.9 326.8
mask_w16_16bpc_neon: 2320.5 1480.9 1873.0 1270.2 932.2 976.1
mask_w32_16bpc_neon: 9412.0 5791.9 7348.5 4875.1 3896.4 3821.1
mask_w64_16bpc_neon: 23385.9 13875.6 21383.8 12235.9 9469.2 9160.2
mask_w128_16bpc_neon: 60466.4 34762.6 61055.9 31214.0 23299.0 23324.5

For comparison, the corresponding numbers for the existing arm64
implementation:

avg_w4_16bpc_neon: 78.0 38.5 50.0
avg_w8_16bpc_neon: 198.3 105.4 117.8
avg_w16_16bpc_neon: 614.9 339.9 376.7
avg_w32_16bpc_neon: 2313.8 1391.1 1487.7
avg_w64_16bpc_neon: 5733.3 3269.1 3648.4
avg_w128_16bpc_neon: 15105.9 8143.5 8970.4
w_avg_w4_16bpc_neon: 119.2 87.7 92.9
w_avg_w8_16bpc_neon: 322.9 252.3 263.5
w_avg_w16_16bpc_neon: 1016.8 794.0 828.6
w_avg_w32_16bpc_neon: 3910.9 3159.6 3308.3
w_avg_w64_16bpc_neon: 9499.6 7933.9 8026.5
w_avg_w128_16bpc_neon: 24508.3 19502.0 20389.8
mask_w4_16bpc_neon: 138.9 98.7 106.7
mask_w8_16bpc_neon: 375.5 301.1 302.7
mask_w16_16bpc_neon: 1217.2 1064.6 954.4
mask_w32_16bpc_neon: 4821.0 4018.4 3825.7
mask_w64_16bpc_neon: 12262.7 9471.3 9169.7
mask_w128_16bpc_neon: 31356.6 22657.6 23324.5

80aa7823

Aug 28, 2020
- cli: Print the decoding fps even if the file lacks a nominal framerate · f57189e3
  Martin Storsjö authored 4 years ago
```
We can't compare the decoding speed with the intended decoding rate,
but the frame rate alone is still useful.
```
  f57189e3
Aug 22, 2020
- tests: test stand alone API header compilation · 1bcc5ecd
  Janne Grunau authored 4 years ago
```
Errors on C11 features like anonymous strucs/unions.
```
  1bcc5ecd
- dav1d/headers.h: add missing stdint.h include · 791c4697
  Janne Grunau authored 4 years ago
  
  791c4697
- contributing: document the allowed internal use of anonymous structs/unions · 6f3a8fb9
  Janne Grunau authored 4 years ago
  
  6f3a8fb9
- bump soname for API changes · e2d22c01
  Janne Grunau authored 4 years ago
  
  e2d22c01
- API: move reserved space in Dav1dSettings to the end · 89c57ce3
  Janne Grunau authored 4 years ago
```
Also changes the type intptr_t to make adding variable size members more
convenient.
```
  89c57ce3
Aug 21, 2020
- API: remove anonymous struct and union from Dav1dWarpedMotionParams · acc92406
  Janne Grunau authored 4 years ago
  
  acc92406
- CI: compare x86inc.asm with upstream · d0e50cac
  Janne Grunau authored 4 years ago and Henrik Gramner committed 4 years ago
  
  d0e50cac
- x86inc.asm: remove private_prefix define and config.asm include · 9a2d1658
  Janne Grunau authored 4 years ago and Henrik Gramner committed 4 years ago
```
Makes using unmodified upstream x86inc.asm possible.
```
  9a2d1658
- x86inc.asm: use standalone x86inc.asm as upstream · 4cd2f82d
  Janne Grunau authored 4 years ago and Henrik Gramner committed 4 years ago
  
  4cd2f82d
- x86inc.asm: Properly sort instructions in alphabetical order · 9435be18
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  9435be18
Aug 07, 2020

checkasm: Add ifdefs around the readtime check · 5bbd9632

Martin Storsjö authored 4 years ago

This fixes building in configurations where no readtime implementation
is available at all, such as MSVC targeting 32 bit ARM.

This was missed when the check was added in
95a19254.

5bbd9632

checkasm: Enforce declare_func to be outside of check_func · 0b824944

Martin Storsjö authored 4 years ago

Move the declaration of func_ref/func_new into declare_func. This
enforces that declare_func is a scope outside of/before check_func.

This ensures that if the signal handler is triggered, we rewind
to a scope outside of check_func, where check_func makes sure we
don't rerun the test that just triggered the signal handler.

0b824944

Aug 06, 2020

obu: remove a few unnecessary calls to memset() · e86ddd56
James Almer authored 4 years ago
```
The relevant structs are filled immediately after them.
```
Verified

e86ddd56
obu: reduce scope of some variables · a579cb8f
James Almer authored 4 years ago
```
Cosmetic change.
```
Verified

a579cb8f

checkasm: Use mach_absolute_time() as timer on darwin on ARM · 7c4cbbf8

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

The cycle counter instructions aren't accessible on iOS/macOS on
ARM. The mach_absolute_time() function has much coarser precision,
but is the least bad option available.

7c4cbbf8

checkasm: msac: Fix signal handler recovery in msac_decode_bool* · c3a12884

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

The signal handler does a longjmp back to the location of declare_func
when there's a signal. If declare_func is located within the check_func
block, it will just end up in an endless loop, retrying running the
failing tests again.

On linux, after resuming from the signal handler, the second signal
wouldn't trigger the signal handler but forcibly exit the process,
while on darwin, it would get stuck in an endless loop.

msac_decode_bool seems to be the only checkasm test with declare_func
within the check_func block.

c3a12884

checkasm: Explicitly test whether the readtime() function works · 95a19254

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

This gives a clearer indication about what is wrong, instead of
running into illegal instruction errors in the individual tests.

On ARM and AArch64, access to the cycle counter register is forbidden
in user mode code by default on Linux and Darwin.

95a19254

x86: Add {put/prep}_{8tap/bilin} SSSE3 asm (64-bit) · 06f12a89
Victorien Le Couviour--Tuffet authored 4 years ago

06f12a89

Aug 05, 2020
- x86: Minor changes to MC scaled AVX2 asm · 652e5b38
  Victorien Le Couviour--Tuffet authored 4 years ago
  
  652e5b38
Jul 20, 2020
- x86: Add cdef_filter SSE optimizations · 6cf58c8e
  Henrik Gramner authored 4 years ago
  
  6cf58c8e
- dav1dplay: Fix type mismatch warning · f55cd4c6
  Marvin Scholz authored 4 years ago
  
  f55cd4c6
Jul 13, 2020
- README: Update roadmap · 1317e619
  Matthias Dressel authored 4 years ago
  
  1317e619
- Add enum entries for the maximum valid metadata values · d69fc655
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
```
A bitstream may contain values larger than the currently defined
entries, but it's technically UB to put such values into an enum.

Discovered in Firefox through fuzzing with UBSan.
```
  d69fc655
- Update README.md · 1b9792f3
  Matthias Dressel authored 4 years ago
```
- Fix small typos
- Add link to doxygen documentation
- Add high bit-depth asm goals
```
  1b9792f3
Jul 10, 2020

Hide const symbols names too when building for mac/arm64 · dfb22e57
Nico Weber authored 4 years ago
```
This is a follow-up to ebc8e4d9. dav1d doesn't currently use
this `const` macro, but rav1e does.
```
dfb22e57

Hide symbols names when building for mac/arm64 · ebc8e4d9

Nico Weber authored 4 years ago

This matches the `.hidden` already used for ELF outputs.

This is needed for Chromium's mac/arm64 build. Chromium has a build step
that verifies that Chromium Framework only exports a small, fixed set of symbols.
The dav1d symbols showed up unexpectedly. This fixes that.

ebc8e4d9

Jul 09, 2020
- meson: disable asm for x32 ABI · 725f3768
  Janne Grunau authored 4 years ago
```
Fixes #345.
```
  725f3768
Jul 04, 2020
- Move logo to doc/ folder · f116e076
  Jean-Baptiste Kempf authored 4 years ago
```
Removes files from top-level
```
  f116e076
Jul 02, 2020

arm32: ipred: Port 8 bpc NEON implementations of remaining arm64 funtions · 8dd9c651

Martin Storsjö authored 4 years ago

This matches was is implemented for arm64 so far.

Align the dav1d_sm_weights table to allow aligned loads from it.

Relative speedups over C code (vs potentially autovectorized code, built
with Clang):

Cortex A7 A8 A9 A53 A72 A73
intra_pred_paeth_w4_8bpc_neon: 4.81 7.61 5.82 5.50 5.61 6.94
intra_pred_paeth_w8_8bpc_neon: 7.83 11.95 9.51 11.05 8.90 10.51
intra_pred_paeth_w16_8bpc_neon: 4.86 4.49 3.90 4.60 3.76 3.54
intra_pred_paeth_w32_8bpc_neon: 4.55 4.03 3.52 4.27 3.30 3.21
intra_pred_paeth_w64_8bpc_neon: 4.38 3.72 3.32 3.95 3.08 3.00
intra_pred_smooth_h_w4_8bpc_neon: 5.74 10.80 5.32 6.79 4.77 6.48
intra_pred_smooth_h_w8_8bpc_neon: 10.59 17.95 9.39 16.03 6.94 8.98
intra_pred_smooth_h_w16_8bpc_neon: 2.81 3.19 2.12 3.70 2.90 3.59
intra_pred_smooth_h_w32_8bpc_neon: 2.63 2.41 1.86 3.44 2.24 2.66
intra_pred_smooth_h_w64_8bpc_neon: 2.42 2.52 1.79 3.24 1.81 2.11
intra_pred_smooth_v_w4_8bpc_neon: 4.15 7.99 3.46 4.63 3.83 4.39
intra_pred_smooth_v_w8_8bpc_neon: 7.31 12.42 7.04 10.00 4.26 6.20
intra_pred_smooth_v_w16_8bpc_neon: 3.70 3.44 2.53 3.33 2.76 3.21
intra_pred_smooth_v_w32_8bpc_neon: 3.91 3.74 2.70 3.51 2.50 2.96
intra_pred_smooth_v_w64_8bpc_neon: 4.03 3.94 2.80 3.64 2.36 2.80
intra_pred_smooth_w4_8bpc_neon: 4.09 7.74 4.54 4.79 3.26 5.10
intra_pred_smooth_w8_8bpc_neon: 5.63 8.93 6.62 8.28 3.73 6.04
intra_pred_smooth_w16_8bpc_neon: 3.97 3.40 3.32 3.74 3.01 3.77
intra_pred_smooth_w32_8bpc_neon: 3.75 3.14 3.07 3.28 2.65 3.17
intra_pred_smooth_w64_8bpc_neon: 3.60 3.04 2.93 2.97 2.35 2.85
intra_pred_filter_w4_8bpc_neon: 5.54 6.43 4.90 7.26 3.44 4.61
intra_pred_filter_w8_8bpc_neon: 7.05 7.15 5.50 10.05 4.29 6.02
intra_pred_filter_w16_8bpc_neon: 7.36 6.46 5.27 11.51 4.75 6.70
intra_pred_filter_w32_8bpc_neon: 7.56 6.32 5.01 12.34 4.47 6.97
pal_pred_w4_8bpc_neon: 5.47 7.76 4.40 5.20 8.32 7.03
pal_pred_w8_8bpc_neon: 11.11 14.12 8.44 13.95 11.88 12.43
pal_pred_w16_8bpc_neon: 14.38 20.95 9.84 17.43 14.77 13.56
pal_pred_w32_8bpc_neon: 12.91 19.85 10.87 19.03 14.63 14.62
pal_pred_w64_8bpc_neon: 14.01 19.23 10.82 19.82 16.23 16.32
cfl_ac_420_w4_8bpc_neon: 8.11 13.41 7.92 9.26 10.55 9.36
cfl_ac_420_w8_8bpc_neon: 7.77 15.71 7.69 8.94 9.76 8.56
cfl_ac_420_w16_8bpc_neon: 7.72 13.71 8.30 9.05 9.81 9.02
cfl_ac_422_w4_8bpc_neon: 8.85 15.80 8.26 10.97 13.04 10.00
cfl_ac_422_w8_8bpc_neon: 8.77 16.96 7.57 10.46 12.16 9.92
cfl_ac_422_w16_8bpc_neon: 8.28 14.91 7.16 9.69 10.57 9.18
cfl_ac_444_w4_8bpc_neon: 7.47 14.13 7.50 9.76 11.11 9.39
cfl_ac_444_w8_8bpc_neon: 6.81 15.46 5.27 9.11 12.09 9.76
cfl_ac_444_w16_8bpc_neon: 6.11 13.68 4.62 8.17 10.78 8.92
cfl_ac_444_w32_8bpc_neon: 5.71 12.11 4.28 7.53 9.53 8.52
cfl_pred_cfl_128_w4_8bpc_neon: 7.46 12.63 8.48 8.03 7.64 9.29
cfl_pred_cfl_128_w8_8bpc_neon: 5.05 5.16 3.79 4.64 5.07 4.42
cfl_pred_cfl_128_w16_8bpc_neon: 4.44 5.17 3.65 4.20 4.41 4.74
cfl_pred_cfl_128_w32_8bpc_neon: 4.51 5.25 3.67 4.29 4.39 4.73
cfl_pred_cfl_left_w4_8bpc_neon: 6.60 11.74 7.75 6.91 7.44 9.14
cfl_pred_cfl_left_w8_8bpc_neon: 4.92 5.15 3.80 4.41 5.44 4.81
cfl_pred_cfl_left_w16_8bpc_neon: 4.40 5.26 3.66 4.10 4.63 4.94
cfl_pred_cfl_left_w32_8bpc_neon: 4.50 5.31 3.68 4.25 4.43 4.82
cfl_pred_cfl_top_w4_8bpc_neon: 7.00 11.88 7.88 7.50 7.43 9.68
cfl_pred_cfl_top_w8_8bpc_neon: 4.96 5.07 3.78 4.51 5.31 4.75
cfl_pred_cfl_top_w16_8bpc_neon: 4.42 5.31 3.69 4.16 4.60 4.93
cfl_pred_cfl_top_w32_8bpc_neon: 4.52 5.36 3.71 4.29 4.47 4.83
cfl_pred_cfl_w4_8bpc_neon: 5.92 10.54 7.25 6.21 6.79 8.33
cfl_pred_cfl_w8_8bpc_neon: 4.67 5.16 3.77 4.14 5.20 4.71
cfl_pred_cfl_w16_8bpc_neon: 4.29 5.29 3.70 3.97 4.53 4.86
cfl_pred_cfl_w32_8bpc_neon: 4.47 5.34 3.72 4.20 4.42 4.83

8dd9c651

arm32: ipred: Optimize ipred_dc_w32 · b4291523

Martin Storsjö authored 4 years ago

Do the horizontal summing in the same way as for other cases of
32 pixel summing.

This doesn't seem to affect the runtime significantly though (checkasm
benchmarks vary by a couple cycles), but it's 5 instructions shorter
at least.

b4291523

arm32: ipred: Use narrower vdups where possible · 8fd0bc90
Martin Storsjö authored 4 years ago

8fd0bc90

arm32: ipred: Fix comment formatting · f4a0127a

Martin Storsjö authored 4 years ago

This matches the arm64 original. The comment isn't about the condition,
but about the state after the conditional branch.

f4a0127a

arm32: ipred: Remove unnecessary operations in ipred_dc_w4 · d00a0227

Martin Storsjö authored 4 years ago

These came from matching some parts too closely to the arm64 version
(where the summation can be done efficiently with uaddlv by zeroing
the upper half of the register).

Before:                  Cortex A7     A8     A9    A53   A72    A73
intra_pred_dc_w4_8bpc_neon:  124.5   65.1   90.2  100.4  48.1   50.4
After:
intra_pred_dc_w4_8bpc_neon:  120.3   60.7   83.6   94.0  44.1   47.9

d00a0227

arm32: ipred: Mark a few more loads as aligned · 74d5cf57

Martin Storsjö authored 4 years ago

This speeds things up a bit on older cores.

Also do a load that duplicates the input over the whole register
instead of just loading a single lane in iprev_v_w4. This can be a
bit faster on Cortex A8.

Before: Cortex A7 A8 A9 A53 A72 A73
intra_pred_v_w4_8bpc_neon: 54.0 38.4 46.4 47.7 20.4 18.1
intra_pred_h_w4_8bpc_neon: 66.3 43.1 55.0 57.0 27.9 22.2
intra_pred_h_w8_8bpc_neon: 81.0 60.2 76.7 66.5 31.1 30.1
intra_pred_dc_left_w4_8bpc_neon: 91.0 49.0 72.8 77.7 35.4 38.5
intra_pred_dc_left_w8_8bpc_neon: 103.8 73.5 90.2 84.7 42.8 47.1
intra_pred_dc_left_w16_8bpc_neon: 156.1 101.8 186.1 119.4 77.7 92.6
intra_pred_dc_left_w32_8bpc_neon: 270.5 200.5 381.6 191.7 152.6 170.3
intra_pred_dc_left_w64_8bpc_neon: 560.7 439.1 877.0 375.4 333.5 343.6

After:
intra_pred_v_w4_8bpc_neon: 53.9 38.0 46.4 47.7 19.8 19.2
intra_pred_h_w4_8bpc_neon: 66.5 39.2 52.6 57.0 27.7 22.2
intra_pred_h_w8_8bpc_neon: 80.5 55.8 72.9 66.5 31.4 30.1
intra_pred_dc_left_w4_8bpc_neon: 91.0 48.2 71.8 77.7 34.9 38.6
intra_pred_dc_left_w8_8bpc_neon: 103.8 69.6 89.2 84.7 43.2 47.3
intra_pred_dc_left_w16_8bpc_neon: 182.3 99.9 184.9 118.8 77.7 85.8
intra_pred_dc_left_w32_8bpc_neon: 355.4 198.9 380.1 190.6 152.9 161.0
intra_pred_dc_left_w64_8bpc_neon: 517.5 437.4 876.9 375.7 333.3 347.7

74d5cf57

arm64: ipred: 16 bpc NEON implementation of the cfl_ac 444 function · 72db6607

Martin Storsjö authored 4 years ago

Relative speedup over C code:
                       Cortex A53    A72    A73
cfl_ac_444_w4_16bpc_neon:    8.03   9.41  10.48
cfl_ac_444_w8_16bpc_neon:   10.17  10.54  10.38
cfl_ac_444_w16_16bpc_neon:  10.73  10.38   9.73
cfl_ac_444_w32_16bpc_neon:  10.18   9.43   9.77

72db6607

arm64: ipred: 8 bpc NEON implementation of the cfl_ac 444 function · 9b40bb95

Martin Storsjö authored 4 years ago

Relative speedup over C code:
                      Cortex A53    A72    A73
cfl_ac_444_w4_8bpc_neon:    8.72   8.75  10.50
cfl_ac_444_w8_8bpc_neon:   13.10  10.77  11.23
cfl_ac_444_w16_8bpc_neon:  13.08   9.95  10.49
cfl_ac_444_w32_8bpc_neon:  12.58   9.43  10.63

9b40bb95