Commits · master · Ewout ter Hoeven / dav1d

Oct 09, 2019
- Update NEWS for 0.5.0 · c688d5b2
  Jean-Baptiste Kempf authored 5 years ago
  
  c688d5b2
- Add VSX wiener filter implementation · be60b142
  Michail Alvanos authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
  
  be60b142
Oct 08, 2019

Move snap to package/ subfolder · 3e0f1508
Jean-Baptiste Kempf authored 5 years ago

3e0f1508

arm: mc: Port the ARM64 warp filter to arm32 · 61442bee

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Relative speedup over C code:
                  Cortex A7     A8     A9    A53    A72    A73
warp_8x8_8bpc_neon:    2.79   5.45   4.18   3.96   4.16   4.51
warp_8x8t_8bpc_neon:   2.79   5.33   4.18   3.98   4.22   4.25

Comparison to original ARM64 assembly:

ARM64:            Cortex A53     A72     A73
warp_8x8_8bpc_neon:   1854.6  1072.5  1102.5
warp_8x8t_8bpc_neon:  1839.6  1069.4  1089.5
ARM32:
warp_8x8_8bpc_neon:   2132.5  1160.3  1218.0
warp_8x8t_8bpc_neon:  2113.7  1148.0  1209.1

61442bee

arm64: mc: Use addp instead of addv+trn1 in warp · 5647a57e

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Before:           Cortex A53     A72     A73
warp_8x8_8bpc_neon:   1952.8  1161.3  1151.1
warp_8x8t_8bpc_neon:  1937.1  1147.5  1139.0
After:
warp_8x8_8bpc_neon:   1860.8  1068.6  1105.8
warp_8x8t_8bpc_neon:  1846.9  1056.4  1099.8

5647a57e

arm: cdef: Port the ARM64 CDEF NEON assembly to 32 bit arm · 3489a9c1

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

The relative speedup ranges from 2.5 to 3.8x for find_dir and
around 5 to 10x for filter.

The find_dir function is a bit restricted by barely having enough
registers, leaving very few ones for temporaries, so less things can
be done in parallel and many instructions end up depending on the
result of the preceding instruction.

The ported functions end up slightly slower than the corresponding
ARM64 ones, but only marginally:

ARM64:                   Cortex A53     A72     A73
cdef_dir_8bpc_neon:           400.0   268.8   282.2
cdef_filter_4x4_8bpc_neon:    596.3   359.9   379.7
cdef_filter_4x8_8bpc_neon:   1091.0   670.4   698.5
cdef_filter_8x8_8bpc_neon:   1998.7  1207.2  1218.4
ARM32:
cdef_dir_8bpc_neon:           528.5   329.1   337.4
cdef_filter_4x4_8bpc_neon:    632.5   482.5   432.2
cdef_filter_4x8_8bpc_neon:   1107.2   854.8   782.3
cdef_filter_8x8_8bpc_neon:   1984.8  1381.0  1414.4

Relative speedup over C code:
                        Cortex A7     A8     A9    A53    A72    A73
cdef_dir_8bpc_neon:          2.92   2.54   2.67   3.87   3.37   3.83
cdef_filter_4x4_8bpc_neon:   5.09   7.61   6.10   6.85   4.94   7.41
cdef_filter_4x8_8bpc_neon:   5.53   8.23   6.77   7.67   5.60   8.01
cdef_filter_8x8_8bpc_neon:   6.26  10.14   8.49   8.54   6.94   4.27

3489a9c1

arm: Support PIC loading of non-global symbols in the movrel macro on apple platforms · 32ae5dd0
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

32ae5dd0
Remove branch when changing bit in LR edges mask · 7bbc5e3d
Luc Trudeau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

7bbc5e3d

arm64: cdef: Improve find_dir · dfaa2a10

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Only add .4h elements to the upper half of sum_alt, as only 11
elements are needed, and .8h + .4h gives 12 in total.

Fuse two consecutive ext #8 + ext #2 into ext #10.

Move a few stores further away from where they are calculated.

Before:         Cortex A53     A72     A73
cdef_dir_8bpc_neon:  404.0   278.2   302.4
After:
cdef_dir_8bpc_neon:  400.0   269.3   282.5

dfaa2a10

arm64: cdef: Calculate two initial parameters in the same vector · fa6a0924

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

As there's only two individual parameters, we can insert them into
the same vector, reducing the number of actual calculation instructions,
but adding a few more instructions to dup the results to the final
vectors instead.

fa6a0924

arm64: cdef: Use loads with postincrement in more places in the padding function · 1f835750
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

1f835750

arm64: cdef: Rewrite an expression slightly · bc26e300

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Instead of apply_sign(imin(abs(diff), clip), diff), do
imax(imin(diff, clip), -clip).

Before:                  Cortex A53     A72     A73
cdef_filter_4x4_8bpc_neon:    592.7   374.5   384.5
cdef_filter_4x8_8bpc_neon:   1093.0   704.4   706.6
cdef_filter_8x8_8bpc_neon:   1962.6  1239.4  1252.1
After:
cdef_filter_4x4_8bpc_neon:    593.7   355.5   373.2
cdef_filter_4x8_8bpc_neon:   1091.6   663.2   685.3
cdef_filter_8x8_8bpc_neon:   1964.2  1182.5  1210.8

bc26e300

Oct 07, 2019

Don't backup pixels if next restoration unit is NONE · d2c94ee1
Luc Trudeau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

d2c94ee1

Add AVX2 version of generate_grain_uv (4:2:0) · 4e22ef3a

Ronald S. Bultje authored 5 years ago

gen_grain_uv_ar0_8bpc_420_c: 30131.8
gen_grain_uv_ar0_8bpc_420_avx2: 6600.4
gen_grain_uv_ar1_8bpc_420_c: 46110.5
gen_grain_uv_ar1_8bpc_420_avx2: 17887.2
gen_grain_uv_ar2_8bpc_420_c: 73593.2
gen_grain_uv_ar2_8bpc_420_avx2: 26918.6
gen_grain_uv_ar3_8bpc_420_c: 114499.3
gen_grain_uv_ar3_8bpc_420_avx2: 29804.6

4e22ef3a

arm64: mc: Schedule instructions better in the warp8x8 functions · ff41197b

Martin Storsjö authored 5 years ago

Before:           Cortex A53     A72     A73
warp_8x8_8bpc_neon:   1997.3  1170.1  1199.9
warp_8x8t_8bpc_neon:  1982.4  1171.5  1192.6
After:
warp_8x8_8bpc_neon:   1954.6  1159.2  1153.3
warp_8x8t_8bpc_neon:  1938.5  1146.2  1136.7

ff41197b

Oct 03, 2019

Check for RESTORATION_NONE once per frame · e570088d

Luc Trudeau authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Prior checks were done at the sbrow level. This now allows to call
dav1d_lr_sbrow and dav1d_lr_copy_lpf only when there's something
for them to do.

e570088d

Oct 02, 2019
- arm64: mc: Use sbfx instead of ubfx+sxth in the warp function · a4ceff6f
  Martin Storsjö authored 5 years ago
  
  a4ceff6f
- x86: Increase precision of SSSE3 IDCT intermediates · d4dfa85c
  Henrik Gramner authored 5 years ago
  
  d4dfa85c
- x86: Increase precision of AVX2 IDCT intermediates · de561b3b
  Henrik Gramner authored 5 years ago
```
The existing code was using 16-bit intermediate precision for certain
calculations which is insufficient for some esoteric edge cases.
```
  de561b3b
- checkasm: Add a function listing feature · f404c722
  Henrik Gramner authored 5 years ago
```
--list-functions now prints a list of all function names. Uses stdout
for easy grepping/piping. Can be combined with the --test option to
only list functions within a specific test.

Also rename --list to --list-tests and make it print to stdout as well
for consistency.
```
  f404c722
Oct 01, 2019

Simplify README build instructions · 16e0741a
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago

16e0741a
Minor cleanup · f6a8cc0c
Ronald S. Bultje authored 5 years ago

f6a8cc0c

arm64: ipred: NEON implementation of dc/h/v prediction modes · f7743da1

Martin Storsjö authored 5 years ago

Relative speedups over the C code:
                              Cortex A53    A72    A73
intra_pred_dc_128_w4_8bpc_neon:     2.08   1.47   2.17
intra_pred_dc_128_w8_8bpc_neon:     3.33   2.49   4.03
intra_pred_dc_128_w16_8bpc_neon:    3.93   3.86   3.75
intra_pred_dc_128_w32_8bpc_neon:    3.14   3.79   2.90
intra_pred_dc_128_w64_8bpc_neon:    3.68   1.97   2.42
intra_pred_dc_left_w4_8bpc_neon:    2.41   1.70   2.23
intra_pred_dc_left_w8_8bpc_neon:    3.53   2.41   3.32
intra_pred_dc_left_w16_8bpc_neon:   3.87   3.54   3.34
intra_pred_dc_left_w32_8bpc_neon:   4.10   3.60   2.76
intra_pred_dc_left_w64_8bpc_neon:   3.72   2.00   2.39
intra_pred_dc_top_w4_8bpc_neon:     2.27   1.66   2.07
intra_pred_dc_top_w8_8bpc_neon:     3.83   2.69   3.43
intra_pred_dc_top_w16_8bpc_neon:    3.66   3.60   3.20
intra_pred_dc_top_w32_8bpc_neon:    3.92   3.54   2.66
intra_pred_dc_top_w64_8bpc_neon:    3.60   1.98   2.30
intra_pred_dc_w4_8bpc_neon:         2.29   1.42   2.16
intra_pred_dc_w8_8bpc_neon:         3.56   2.83   3.05
intra_pred_dc_w16_8bpc_neon:        3.46   3.37   3.15
intra_pred_dc_w32_8bpc_neon:        3.79   3.41   2.74
intra_pred_dc_w64_8bpc_neon:        3.52   2.01   2.41
intra_pred_h_w4_8bpc_neon:         10.34   5.74   5.94
intra_pred_h_w8_8bpc_neon:         12.13   6.33   6.43
intra_pred_h_w16_8bpc_neon:        10.66   7.31   5.85
intra_pred_h_w32_8bpc_neon:         6.28   4.18   2.88
intra_pred_h_w64_8bpc_neon:         3.96   1.85   1.75
intra_pred_v_w4_8bpc_neon:         11.44   6.12   7.57
intra_pred_v_w8_8bpc_neon:         14.76   7.58   7.95
intra_pred_v_w16_8bpc_neon:        11.34   6.28   5.88
intra_pred_v_w32_8bpc_neon:         6.56   3.33   3.34
intra_pred_v_w64_8bpc_neon:         4.57   1.24   1.97

f7743da1

Sep 30, 2019

x86: add warp_affine SSE4 and SSSE3 asm · a91a03b0

Victorien Le Couviour--Tuffet authored 5 years ago

------------------------------------------
x86_64: warp_8x8_8bpc_c: 1773.4
x86_32: warp_8x8_8bpc_c: 1740.4
----------
x86_64: warp_8x8_8bpc_ssse3: 317.5
x86_32: warp_8x8_8bpc_ssse3: 378.4
----------
x86_64: warp_8x8_8bpc_sse4: 303.7
x86_32: warp_8x8_8bpc_sse4: 367.7
----------
x86_64: warp_8x8_8bpc_avx2: 224.9
---------------------
---------------------
x86_64: warp_8x8t_8bpc_c: 1664.6
x86_32: warp_8x8t_8bpc_c: 1674.0
----------
x86_64: warp_8x8t_8bpc_ssse3: 320.7
x86_32: warp_8x8t_8bpc_ssse3: 379.5
----------
x86_64: warp_8x8t_8bpc_sse4: 304.8
x86_32: warp_8x8t_8bpc_sse4: 369.8
----------
x86_64: warp_8x8t_8bpc_avx2: 228.5
------------------------------------------

a91a03b0

Sep 29, 2019

arm64: itx: Fix overflows in idct · 713aa34c

Martin Storsjö authored 5 years ago

Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed
to be clipped.

This fixes mismatches for some samples, see issue #299.

Before:                                Cortex A53       A72       A73
inv_txfm_add_4x4_dct_dct_1_8bpc_neon:        93.0      52.8      49.5
inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       260.0     186.0     196.4
inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1371.0     953.4    1028.6
inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7363.2    4887.5    5135.8
inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25029.0   17492.3   18404.5
After:
inv_txfm_add_4x4_dct_dct_1_8bpc_neon:       105.0      58.7      55.2
inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       294.0     211.5     209.9
inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1495.8    1050.4    1070.6
inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7866.7    5197.8    5321.4
inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25807.2   18619.3   18526.9

713aa34c

arm64: itx: Consistently use the factor 2896 in adst · 0ed3ad19
Martin Storsjö authored 5 years ago
```
The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
```
0ed3ad19

arm64: itx: Use smull+smlal instead of addl+mul · a4950bce

Martin Storsjö authored 5 years ago

Even though smull+smlal does two multiplications instead of one,
the combination seems to be better handled by actual cores.

Before:                                 Cortex A53      A72      A73
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      356.0    279.2    278.0
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1785.0   1329.5   1308.8
After:
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      360.0    253.2    269.3
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1793.1   1300.9   1254.0

(In this particular cases, it seems like it is a minor regression
on A53, which is probably more due to having to change the ordering
of some instructions, due to how smull+smlal+smull2+smlal2 overwrites
the second output register sooner than an addl+addl2 would have, but
in general, smull+smlal seems to be equally good or better than
addl+mul on A53 as well.)

a4950bce

Sep 27, 2019

dav1dplay: initial support for --zerocopy · 490a1420

Niklas Haas authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Right now this just allocates a new buffer for every frame, uses it,
then discards it immediately. This is not optimal, either dav1d should
start reusing buffers internally or we need to pool them in dav1dplay.

As it stands, this is not really a performance gain. I'll have to
investigate why, but my suspicion is that seeing any gains might require
reusing buffers somewhere.

Note: Thrashing buffers is not as bad as it seems, initially. Not only
does libplacebo pool and reuse GPU memory and buffer state objects
internally, but this also absolves us from having to do any manual
polling to figure out when the buffer is reusable again. Creating, using
and immediately destroying buffers actually isn't as bad an approach as
it might otherwise seem.

It's entirely possible that this is only bad because of lock contention.
As said, I'll have to investigate further...

490a1420

dav1dplay: add --untimed for benchmarking purposes · 3f35ef1f
Niklas Haas authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
```
Useful to test the effects of performance changes to the
decoding/rendering loop as a whole.
```
3f35ef1f

dav1dplay: add --highquality to toggle render quality · f6ae8c9c

Niklas Haas authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Only meaningful with libplacebo. The defaults are higher quality than
SDL so it's an unfair comparison and definitely too much for slow iGPUs
at 4K res. Make the defaults fast/dumb processing only, and guard the
debanding/dithering/upscaling/etc. behind a new --highquality flag.

f6ae8c9c

Sep 19, 2019

x86: add 32-bit support to SSSE3 deblock lpf · c0865f35

Victorien Le Couviour--Tuffet authored 5 years ago

------------------------------------------
x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6
x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6
x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4
---------------------
x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9
x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6
x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6
---------------------
x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7
x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3
x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3
x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7
---------------------
x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7
x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8
x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9
x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9
---------------------
x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9
x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5
x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0
---------------------
x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4
x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6
x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0
---------------------
x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6
x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7
x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2
---------------------
x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4
x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5
x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8
x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7
---------------------
x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2
x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9
x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3
x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9
---------------------
x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6
x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7
x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8
------------------------------------------

c0865f35

x86: add deblocking loopfilters SSSE3 asm (64-bit) · 1e4e6c7a

Ronald S. Bultje authored 6 years ago and

Victorien Le Couviour--Tuffet committed 5 years ago

---------------------
x86_64:
------------------------------------------
lpf_h_sb_uv_w4_8bpc_c: 430.6
lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
lpf_h_sb_uv_w4_8bpc_avx2: 200.4
---------------------
lpf_h_sb_uv_w6_8bpc_c: 981.9
lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
lpf_h_sb_uv_w6_8bpc_avx2: 270.0
---------------------
lpf_h_sb_y_w4_8bpc_c: 3001.7
lpf_h_sb_y_w4_8bpc_ssse3: 466.3
lpf_h_sb_y_w4_8bpc_avx2: 383.1
---------------------
lpf_h_sb_y_w8_8bpc_c: 4457.7
lpf_h_sb_y_w8_8bpc_ssse3: 818.9
lpf_h_sb_y_w8_8bpc_avx2: 537.0
---------------------
lpf_h_sb_y_w16_8bpc_c: 1967.9
lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
lpf_h_sb_y_w16_8bpc_avx2: 1078.2
---------------------
lpf_v_sb_uv_w4_8bpc_c: 369.4
lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
lpf_v_sb_uv_w4_8bpc_avx2: 58.1
---------------------
lpf_v_sb_uv_w6_8bpc_c: 769.6
lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
lpf_v_sb_uv_w6_8bpc_avx2: 117.8
---------------------
lpf_v_sb_y_w4_8bpc_c: 772.4
lpf_v_sb_y_w4_8bpc_ssse3: 179.8
lpf_v_sb_y_w4_8bpc_avx2: 173.6
---------------------
lpf_v_sb_y_w8_8bpc_c: 1660.2
lpf_v_sb_y_w8_8bpc_ssse3: 468.3
lpf_v_sb_y_w8_8bpc_avx2: 345.8
---------------------
lpf_v_sb_y_w16_8bpc_c: 1889.6
lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
lpf_v_sb_y_w16_8bpc_avx2: 568.1
------------------------------------------

1e4e6c7a

Sep 10, 2019

AVX2 for chroma 4:2:0 film grain reconstruction · 556890be

Ronald S. Bultje authored 5 years ago

fguv_32x32xn_8bpc_420_csfl0_c: 8945.4
fguv_32x32xn_8bpc_420_csfl0_avx2: 1001.6
fguv_32x32xn_8bpc_420_csfl1_c: 6363.4
fguv_32x32xn_8bpc_420_csfl1_avx2: 1299.5

556890be

Remove luma width check in fguv_32x32xn · 6d363223

Ronald S. Bultje authored 5 years ago

This would affect the output in samples with an odd width and horizontal
chroma subsampling. The check does not exist in libaom, and might cause
mismatches.

This causes issues in the sample from #210, which uses super-resolution
and has odd width. To work around this, make super-resolution's resize()
always write an even number of pixels. This should not interfere with
SIMD in the future.

6d363223

Y grain AVX2 implementations · 99307bf3

Ronald S. Bultje authored 5 years ago

fgy_32x32xn_8bpc_c: 16181.8
fgy_32x32xn_8bpc_avx2: 3231.4
gen_grain_y_ar0_8bpc_c: 108857.6
gen_grain_y_ar0_8bpc_avx2: 22826.7
gen_grain_y_ar1_8bpc_c: 168239.8
gen_grain_y_ar1_8bpc_avx2: 72117.2
gen_grain_y_ar2_8bpc_c: 266165.9
gen_grain_y_ar2_8bpc_avx2: 126281.8
gen_grain_y_ar3_8bpc_c: 448139.4
gen_grain_y_ar3_8bpc_avx2: 137047.1

99307bf3

Add film grain checkasm tests · 04ca7112
Ronald S. Bultje authored 5 years ago

04ca7112
Split out film grain block functions into a DSPContext · b9d4630c
Ronald S. Bultje authored 5 years ago

b9d4630c

Sep 06, 2019
- obu: fix deriving render_width and render_height from reference frames · 79c4aa95
  James Almer authored 5 years ago
```
Both values can be independently coded in the bitstream, and are not
always equal to frame_width and frame_height.
```
  Verified
  
  79c4aa95
Sep 05, 2019

Silence some clang-cl warnings · acad1a99

Henrik Gramner authored 5 years ago

For some reason the MSVC CRT _wassert() function is not flagged as
 __declspec(noreturn), so when using those headers the compiler will
expect execution to continue after an assertion has been triggered
and will therefore complain about the use of uninitialized variables
when compiled in debug mode in certain code paths.

Reorder some case statements as a workaround.

acad1a99

x86: Fix buffer overead in mc put · 69dae683
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
For w <= 32 we can't process more than two rows per loop iteration.

Credit to OSS-Fuzz.
```
69dae683