Commits · 0.5.0 · Thomas Wulz / dav1d

Oct 11, 2019
- Update news for 0.5.0: z2-avx2, ipred-neon and wiener-vsx · 5f86e719
  Jean-Baptiste Kempf authored 5 years ago
  
  0.5.0
  
  5f86e719
- arm: util: Split movrel into movrel and movrel_local · 5d014b41
  Martin Storsjö authored 5 years ago
  
  5d014b41
Oct 10, 2019

Check loopfilter levels prior to calling lf_mask · b7d7c8ce
Luc Trudeau authored 5 years ago

b7d7c8ce

arm64: ipred: NEON implementation of the cfl_ac functions · 57dd0aae

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedup over the C code:
                      Cortex A53    A72    A73
cfl_ac_420_w4_8bpc_neon:    7.73   6.48   9.22
cfl_ac_420_w8_8bpc_neon:    6.70   5.56   6.95
cfl_ac_420_w16_8bpc_neon:   6.51   6.93   6.67
cfl_ac_422_w4_8bpc_neon:    9.25   7.70   9.75
cfl_ac_422_w8_8bpc_neon:    8.53   5.95   7.13
cfl_ac_422_w16_8bpc_neon:   7.08   6.87   6.06

57dd0aae

arm64: ipred: NEON implementation of the cfl_pred functions · c7693386

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedup over the C code:
                             Cortex A53    A72    A73
cfl_pred_cfl_128_w4_8bpc_neon:    10.81   7.90   9.80
cfl_pred_cfl_128_w8_8bpc_neon:    18.38  11.15  13.24
cfl_pred_cfl_128_w16_8bpc_neon:   16.52  10.83  16.00
cfl_pred_cfl_128_w32_8bpc_neon:    3.27   3.60   3.70
cfl_pred_cfl_left_w4_8bpc_neon:    9.82   7.38   8.76
cfl_pred_cfl_left_w8_8bpc_neon:   17.22  10.63  11.97
cfl_pred_cfl_left_w16_8bpc_neon:  16.03  10.49  15.66
cfl_pred_cfl_left_w32_8bpc_neon:   3.28   3.61   3.72
cfl_pred_cfl_top_w4_8bpc_neon:     9.74   7.39   9.29
cfl_pred_cfl_top_w8_8bpc_neon:    17.48  10.89  12.58
cfl_pred_cfl_top_w16_8bpc_neon:   16.01  10.62  15.31
cfl_pred_cfl_top_w32_8bpc_neon:    3.25   3.62   3.75
cfl_pred_cfl_w4_8bpc_neon:         8.39   6.34   8.04
cfl_pred_cfl_w8_8bpc_neon:        15.99  10.12  12.42
cfl_pred_cfl_w16_8bpc_neon:       15.25  10.40  15.12
cfl_pred_cfl_w32_8bpc_neon:        3.23   3.58   3.71

The C code gets autovectorized for w >= 32, which is why the
relative speedup looks strange (but the performance of the NEON
functions is completely as expected).

c7693386

arm64: ipred: NEON implementation of the filter function · d322d451

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Use a different layout of the filter_intra_taps depending on
architecture; the current one is optimized for the x86 SIMD
implementation.

Relative speedups over the C code:
                             Cortex A53    A72    A73
intra_pred_filter_w4_8bpc_neon:    6.38   2.81   4.43
intra_pred_filter_w8_8bpc_neon:    9.30   3.62   5.71
intra_pred_filter_w16_8bpc_neon:   9.85   3.98   6.42
intra_pred_filter_w32_8bpc_neon:  10.77   4.08   7.09

d322d451

arm64: ipred: NEON implementation of palette prediction · 4f14573c

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedups over the C code:
                    Cortex A53    A72    A73
pal_pred_w4_8bpc_neon:    8.75   6.15   7.60
pal_pred_w8_8bpc_neon:   19.93  11.79  10.98
pal_pred_w16_8bpc_neon:  24.68  13.28  16.06
pal_pred_w32_8bpc_neon:  23.56  11.81  16.74
pal_pred_w64_8bpc_neon:  23.16  12.19  17.60

4f14573c

arm64: ipred: NEON implementation of smooth prediction · 4318600e

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedups over the C code:
                               Cortex A53    A72    A73
intra_pred_smooth_h_w4_8bpc_neon:    8.02   4.53   7.09
intra_pred_smooth_h_w8_8bpc_neon:   16.59   5.91   9.32
intra_pred_smooth_h_w16_8bpc_neon:  18.80   5.54  10.10
intra_pred_smooth_h_w32_8bpc_neon:   5.07   4.43   4.60
intra_pred_smooth_h_w64_8bpc_neon:   5.03   4.26   4.34
intra_pred_smooth_v_w4_8bpc_neon:    9.11   5.51   7.75
intra_pred_smooth_v_w8_8bpc_neon:   17.07   6.86  10.55
intra_pred_smooth_v_w16_8bpc_neon:  17.98   6.38  11.52
intra_pred_smooth_v_w32_8bpc_neon:  11.69   5.66   8.09
intra_pred_smooth_v_w64_8bpc_neon:   8.44   4.34   5.72
intra_pred_smooth_w4_8bpc_neon:      9.81   4.85   6.93
intra_pred_smooth_w8_8bpc_neon:     16.05   5.60   9.26
intra_pred_smooth_w16_8bpc_neon:    14.01   5.02   8.96
intra_pred_smooth_w32_8bpc_neon:     9.29   5.02   7.25
intra_pred_smooth_w64_8bpc_neon:     6.53   3.94   5.26

4318600e

arm64: ipred: NEON implementation of paeth prediction · 8ab69afb

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedups over the C code:
                            Cortex A53    A72    A73
intra_pred_paeth_w4_8bpc_neon:    8.36   6.55   7.27
intra_pred_paeth_w8_8bpc_neon:   15.24  11.36  11.34
intra_pred_paeth_w16_8bpc_neon:  16.63  13.20  14.17
intra_pred_paeth_w32_8bpc_neon:  10.83   9.21   9.87
intra_pred_paeth_w64_8bpc_neon:   8.37   7.07   7.45

8ab69afb

x86: Add ipred_z2 AVX2 asm · ea9fc9d9
Henrik Gramner authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

ea9fc9d9
Simplify ipred_z C code · afe901a6
Henrik Gramner authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

afe901a6
checkasm: Improve ipred_z tests · dfadb6df
Henrik Gramner authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

dfadb6df
x86: fix generate_grain_uv checkasm crashes on Windows x64 · a7c024ce
James Almer authored 5 years ago
```
The uv argument is normally in a gpr, but in checkasm it's forcefully
loaded from stack.
```
a7c024ce

Oct 09, 2019
- Update NEWS for 0.5.0 · c688d5b2
  Jean-Baptiste Kempf authored 5 years ago
  
  c688d5b2
- Add VSX wiener filter implementation · be60b142
  Michail Alvanos authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
  
  be60b142
Oct 08, 2019

Move snap to package/ subfolder · 3e0f1508
Jean-Baptiste Kempf authored 5 years ago

3e0f1508

arm: mc: Port the ARM64 warp filter to arm32 · 61442bee

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Relative speedup over C code:
                  Cortex A7     A8     A9    A53    A72    A73
warp_8x8_8bpc_neon:    2.79   5.45   4.18   3.96   4.16   4.51
warp_8x8t_8bpc_neon:   2.79   5.33   4.18   3.98   4.22   4.25

Comparison to original ARM64 assembly:

ARM64:            Cortex A53     A72     A73
warp_8x8_8bpc_neon:   1854.6  1072.5  1102.5
warp_8x8t_8bpc_neon:  1839.6  1069.4  1089.5
ARM32:
warp_8x8_8bpc_neon:   2132.5  1160.3  1218.0
warp_8x8t_8bpc_neon:  2113.7  1148.0  1209.1

61442bee

arm64: mc: Use addp instead of addv+trn1 in warp · 5647a57e

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Before:           Cortex A53     A72     A73
warp_8x8_8bpc_neon:   1952.8  1161.3  1151.1
warp_8x8t_8bpc_neon:  1937.1  1147.5  1139.0
After:
warp_8x8_8bpc_neon:   1860.8  1068.6  1105.8
warp_8x8t_8bpc_neon:  1846.9  1056.4  1099.8

5647a57e

arm: cdef: Port the ARM64 CDEF NEON assembly to 32 bit arm · 3489a9c1

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

The relative speedup ranges from 2.5 to 3.8x for find_dir and
around 5 to 10x for filter.

The find_dir function is a bit restricted by barely having enough
registers, leaving very few ones for temporaries, so less things can
be done in parallel and many instructions end up depending on the
result of the preceding instruction.

The ported functions end up slightly slower than the corresponding
ARM64 ones, but only marginally:

ARM64:                   Cortex A53     A72     A73
cdef_dir_8bpc_neon:           400.0   268.8   282.2
cdef_filter_4x4_8bpc_neon:    596.3   359.9   379.7
cdef_filter_4x8_8bpc_neon:   1091.0   670.4   698.5
cdef_filter_8x8_8bpc_neon:   1998.7  1207.2  1218.4
ARM32:
cdef_dir_8bpc_neon:           528.5   329.1   337.4
cdef_filter_4x4_8bpc_neon:    632.5   482.5   432.2
cdef_filter_4x8_8bpc_neon:   1107.2   854.8   782.3
cdef_filter_8x8_8bpc_neon:   1984.8  1381.0  1414.4

Relative speedup over C code:
                        Cortex A7     A8     A9    A53    A72    A73
cdef_dir_8bpc_neon:          2.92   2.54   2.67   3.87   3.37   3.83
cdef_filter_4x4_8bpc_neon:   5.09   7.61   6.10   6.85   4.94   7.41
cdef_filter_4x8_8bpc_neon:   5.53   8.23   6.77   7.67   5.60   8.01
cdef_filter_8x8_8bpc_neon:   6.26  10.14   8.49   8.54   6.94   4.27

3489a9c1

arm: Support PIC loading of non-global symbols in the movrel macro on apple platforms · 32ae5dd0
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

32ae5dd0
Remove branch when changing bit in LR edges mask · 7bbc5e3d
Luc Trudeau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

7bbc5e3d

arm64: cdef: Improve find_dir · dfaa2a10

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Only add .4h elements to the upper half of sum_alt, as only 11
elements are needed, and .8h + .4h gives 12 in total.

Fuse two consecutive ext #8 + ext #2 into ext #10.

Move a few stores further away from where they are calculated.

Before:         Cortex A53     A72     A73
cdef_dir_8bpc_neon:  404.0   278.2   302.4
After:
cdef_dir_8bpc_neon:  400.0   269.3   282.5

dfaa2a10

arm64: cdef: Calculate two initial parameters in the same vector · fa6a0924

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

As there's only two individual parameters, we can insert them into
the same vector, reducing the number of actual calculation instructions,
but adding a few more instructions to dup the results to the final
vectors instead.

fa6a0924

arm64: cdef: Use loads with postincrement in more places in the padding function · 1f835750
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

1f835750

arm64: cdef: Rewrite an expression slightly · bc26e300

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Instead of apply_sign(imin(abs(diff), clip), diff), do
imax(imin(diff, clip), -clip).

Before:                  Cortex A53     A72     A73
cdef_filter_4x4_8bpc_neon:    592.7   374.5   384.5
cdef_filter_4x8_8bpc_neon:   1093.0   704.4   706.6
cdef_filter_8x8_8bpc_neon:   1962.6  1239.4  1252.1
After:
cdef_filter_4x4_8bpc_neon:    593.7   355.5   373.2
cdef_filter_4x8_8bpc_neon:   1091.6   663.2   685.3
cdef_filter_8x8_8bpc_neon:   1964.2  1182.5  1210.8

bc26e300

Oct 07, 2019

Don't backup pixels if next restoration unit is NONE · d2c94ee1
Luc Trudeau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

d2c94ee1

Add AVX2 version of generate_grain_uv (4:2:0) · 4e22ef3a

Ronald S. Bultje authored 5 years ago

gen_grain_uv_ar0_8bpc_420_c: 30131.8
gen_grain_uv_ar0_8bpc_420_avx2: 6600.4
gen_grain_uv_ar1_8bpc_420_c: 46110.5
gen_grain_uv_ar1_8bpc_420_avx2: 17887.2
gen_grain_uv_ar2_8bpc_420_c: 73593.2
gen_grain_uv_ar2_8bpc_420_avx2: 26918.6
gen_grain_uv_ar3_8bpc_420_c: 114499.3
gen_grain_uv_ar3_8bpc_420_avx2: 29804.6

4e22ef3a

arm64: mc: Schedule instructions better in the warp8x8 functions · ff41197b

Martin Storsjö authored 5 years ago

Before:           Cortex A53     A72     A73
warp_8x8_8bpc_neon:   1997.3  1170.1  1199.9
warp_8x8t_8bpc_neon:  1982.4  1171.5  1192.6
After:
warp_8x8_8bpc_neon:   1954.6  1159.2  1153.3
warp_8x8t_8bpc_neon:  1938.5  1146.2  1136.7

ff41197b

Oct 03, 2019

Check for RESTORATION_NONE once per frame · e570088d

Luc Trudeau authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Prior checks were done at the sbrow level. This now allows to call
dav1d_lr_sbrow and dav1d_lr_copy_lpf only when there's something
for them to do.

e570088d

Oct 02, 2019
- arm64: mc: Use sbfx instead of ubfx+sxth in the warp function · a4ceff6f
  Martin Storsjö authored 5 years ago
  
  a4ceff6f
- x86: Increase precision of SSSE3 IDCT intermediates · d4dfa85c
  Henrik Gramner authored 5 years ago
  
  d4dfa85c
- x86: Increase precision of AVX2 IDCT intermediates · de561b3b
  Henrik Gramner authored 5 years ago
```
The existing code was using 16-bit intermediate precision for certain
calculations which is insufficient for some esoteric edge cases.
```
  de561b3b
- checkasm: Add a function listing feature · f404c722
  Henrik Gramner authored 5 years ago
```
--list-functions now prints a list of all function names. Uses stdout
for easy grepping/piping. Can be combined with the --test option to
only list functions within a specific test.

Also rename --list to --list-tests and make it print to stdout as well
for consistency.
```
  f404c722
Oct 01, 2019

Simplify README build instructions · 16e0741a
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago

16e0741a
Minor cleanup · f6a8cc0c
Ronald S. Bultje authored 5 years ago

f6a8cc0c

arm64: ipred: NEON implementation of dc/h/v prediction modes · f7743da1

Martin Storsjö authored 5 years ago

Relative speedups over the C code:
                              Cortex A53    A72    A73
intra_pred_dc_128_w4_8bpc_neon:     2.08   1.47   2.17
intra_pred_dc_128_w8_8bpc_neon:     3.33   2.49   4.03
intra_pred_dc_128_w16_8bpc_neon:    3.93   3.86   3.75
intra_pred_dc_128_w32_8bpc_neon:    3.14   3.79   2.90
intra_pred_dc_128_w64_8bpc_neon:    3.68   1.97   2.42
intra_pred_dc_left_w4_8bpc_neon:    2.41   1.70   2.23
intra_pred_dc_left_w8_8bpc_neon:    3.53   2.41   3.32
intra_pred_dc_left_w16_8bpc_neon:   3.87   3.54   3.34
intra_pred_dc_left_w32_8bpc_neon:   4.10   3.60   2.76
intra_pred_dc_left_w64_8bpc_neon:   3.72   2.00   2.39
intra_pred_dc_top_w4_8bpc_neon:     2.27   1.66   2.07
intra_pred_dc_top_w8_8bpc_neon:     3.83   2.69   3.43
intra_pred_dc_top_w16_8bpc_neon:    3.66   3.60   3.20
intra_pred_dc_top_w32_8bpc_neon:    3.92   3.54   2.66
intra_pred_dc_top_w64_8bpc_neon:    3.60   1.98   2.30
intra_pred_dc_w4_8bpc_neon:         2.29   1.42   2.16
intra_pred_dc_w8_8bpc_neon:         3.56   2.83   3.05
intra_pred_dc_w16_8bpc_neon:        3.46   3.37   3.15
intra_pred_dc_w32_8bpc_neon:        3.79   3.41   2.74
intra_pred_dc_w64_8bpc_neon:        3.52   2.01   2.41
intra_pred_h_w4_8bpc_neon:         10.34   5.74   5.94
intra_pred_h_w8_8bpc_neon:         12.13   6.33   6.43
intra_pred_h_w16_8bpc_neon:        10.66   7.31   5.85
intra_pred_h_w32_8bpc_neon:         6.28   4.18   2.88
intra_pred_h_w64_8bpc_neon:         3.96   1.85   1.75
intra_pred_v_w4_8bpc_neon:         11.44   6.12   7.57
intra_pred_v_w8_8bpc_neon:         14.76   7.58   7.95
intra_pred_v_w16_8bpc_neon:        11.34   6.28   5.88
intra_pred_v_w32_8bpc_neon:         6.56   3.33   3.34
intra_pred_v_w64_8bpc_neon:         4.57   1.24   1.97

f7743da1

Sep 30, 2019

x86: add warp_affine SSE4 and SSSE3 asm · a91a03b0

Victorien Le Couviour--Tuffet authored 5 years ago

------------------------------------------
x86_64: warp_8x8_8bpc_c: 1773.4
x86_32: warp_8x8_8bpc_c: 1740.4
----------
x86_64: warp_8x8_8bpc_ssse3: 317.5
x86_32: warp_8x8_8bpc_ssse3: 378.4
----------
x86_64: warp_8x8_8bpc_sse4: 303.7
x86_32: warp_8x8_8bpc_sse4: 367.7
----------
x86_64: warp_8x8_8bpc_avx2: 224.9
---------------------
---------------------
x86_64: warp_8x8t_8bpc_c: 1664.6
x86_32: warp_8x8t_8bpc_c: 1674.0
----------
x86_64: warp_8x8t_8bpc_ssse3: 320.7
x86_32: warp_8x8t_8bpc_ssse3: 379.5
----------
x86_64: warp_8x8t_8bpc_sse4: 304.8
x86_32: warp_8x8t_8bpc_sse4: 369.8
----------
x86_64: warp_8x8t_8bpc_avx2: 228.5
------------------------------------------

a91a03b0

Sep 29, 2019

arm64: itx: Fix overflows in idct · 713aa34c

Martin Storsjö authored 5 years ago

Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed
to be clipped.

This fixes mismatches for some samples, see issue #299.

Before:                                Cortex A53       A72       A73
inv_txfm_add_4x4_dct_dct_1_8bpc_neon:        93.0      52.8      49.5
inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       260.0     186.0     196.4
inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1371.0     953.4    1028.6
inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7363.2    4887.5    5135.8
inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25029.0   17492.3   18404.5
After:
inv_txfm_add_4x4_dct_dct_1_8bpc_neon:       105.0      58.7      55.2
inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       294.0     211.5     209.9
inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1495.8    1050.4    1070.6
inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7866.7    5197.8    5321.4
inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25807.2   18619.3   18526.9

713aa34c

arm64: itx: Consistently use the factor 2896 in adst · 0ed3ad19
Martin Storsjö authored 5 years ago
```
The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
```
0ed3ad19

arm64: itx: Use smull+smlal instead of addl+mul · a4950bce

Martin Storsjö authored 5 years ago

Even though smull+smlal does two multiplications instead of one,
the combination seems to be better handled by actual cores.

Before:                                 Cortex A53      A72      A73
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      356.0    279.2    278.0
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1785.0   1329.5   1308.8
After:
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      360.0    253.2    269.3
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1793.1   1300.9   1254.0

(In this particular cases, it seems like it is a minor regression
on A53, which is probably more due to having to change the ordering
of some instructions, due to how smull+smlal+smull2+smlal2 overwrites
the second output register sooner than an addl+addl2 would have, but
in general, smull+smlal seems to be equally good or better than
addl+mul on A53 as well.)

a4950bce