- Oct 09, 2019
-
-
Jean-Baptiste Kempf authored
-
-
- Oct 08, 2019
-
-
Jean-Baptiste Kempf authored
-
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 warp_8x8_8bpc_neon: 2.79 5.45 4.18 3.96 4.16 4.51 warp_8x8t_8bpc_neon: 2.79 5.33 4.18 3.98 4.22 4.25 Comparison to original ARM64 assembly: ARM64: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1854.6 1072.5 1102.5 warp_8x8t_8bpc_neon: 1839.6 1069.4 1089.5 ARM32: warp_8x8_8bpc_neon: 2132.5 1160.3 1218.0 warp_8x8t_8bpc_neon: 2113.7 1148.0 1209.1
-
Before: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1952.8 1161.3 1151.1 warp_8x8t_8bpc_neon: 1937.1 1147.5 1139.0 After: warp_8x8_8bpc_neon: 1860.8 1068.6 1105.8 warp_8x8t_8bpc_neon: 1846.9 1056.4 1099.8
-
The relative speedup ranges from 2.5 to 3.8x for find_dir and around 5 to 10x for filter. The find_dir function is a bit restricted by barely having enough registers, leaving very few ones for temporaries, so less things can be done in parallel and many instructions end up depending on the result of the preceding instruction. The ported functions end up slightly slower than the corresponding ARM64 ones, but only marginally: ARM64: Cortex A53 A72 A73 cdef_dir_8bpc_neon: 400.0 268.8 282.2 cdef_filter_4x4_8bpc_neon: 596.3 359.9 379.7 cdef_filter_4x8_8bpc_neon: 1091.0 670.4 698.5 cdef_filter_8x8_8bpc_neon: 1998.7 1207.2 1218.4 ARM32: cdef_dir_8bpc_neon: 528.5 329.1 337.4 cdef_filter_4x4_8bpc_neon: 632.5 482.5 432.2 cdef_filter_4x8_8bpc_neon: 1107.2 854.8 782.3 cdef_filter_8x8_8bpc_neon: 1984.8 1381.0 1414.4 Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 cdef_dir_8bpc_neon: 2.92 2.54 2.67 3.87 3.37 3.83 cdef_filter_4x4_8bpc_neon: 5.09 7.61 6.10 6.85 4.94 7.41 cdef_filter_4x8_8bpc_neon: 5.53 8.23 6.77 7.67 5.60 8.01 cdef_filter_8x8_8bpc_neon: 6.26 10.14 8.49 8.54 6.94 4.27
-
-
-
Only add .4h elements to the upper half of sum_alt, as only 11 elements are needed, and .8h + .4h gives 12 in total. Fuse two consecutive ext #8 + ext #2 into ext #10. Move a few stores further away from where they are calculated. Before: Cortex A53 A72 A73 cdef_dir_8bpc_neon: 404.0 278.2 302.4 After: cdef_dir_8bpc_neon: 400.0 269.3 282.5
-
As there's only two individual parameters, we can insert them into the same vector, reducing the number of actual calculation instructions, but adding a few more instructions to dup the results to the final vectors instead.
-
-
Instead of apply_sign(imin(abs(diff), clip), diff), do imax(imin(diff, clip), -clip). Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 592.7 374.5 384.5 cdef_filter_4x8_8bpc_neon: 1093.0 704.4 706.6 cdef_filter_8x8_8bpc_neon: 1962.6 1239.4 1252.1 After: cdef_filter_4x4_8bpc_neon: 593.7 355.5 373.2 cdef_filter_4x8_8bpc_neon: 1091.6 663.2 685.3 cdef_filter_8x8_8bpc_neon: 1964.2 1182.5 1210.8
-
- Oct 07, 2019
-
-
-
Ronald S. Bultje authored
gen_grain_uv_ar0_8bpc_420_c: 30131.8 gen_grain_uv_ar0_8bpc_420_avx2: 6600.4 gen_grain_uv_ar1_8bpc_420_c: 46110.5 gen_grain_uv_ar1_8bpc_420_avx2: 17887.2 gen_grain_uv_ar2_8bpc_420_c: 73593.2 gen_grain_uv_ar2_8bpc_420_avx2: 26918.6 gen_grain_uv_ar3_8bpc_420_c: 114499.3 gen_grain_uv_ar3_8bpc_420_avx2: 29804.6
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1997.3 1170.1 1199.9 warp_8x8t_8bpc_neon: 1982.4 1171.5 1192.6 After: warp_8x8_8bpc_neon: 1954.6 1159.2 1153.3 warp_8x8t_8bpc_neon: 1938.5 1146.2 1136.7
-
- Oct 03, 2019
-
-
Prior checks were done at the sbrow level. This now allows to call dav1d_lr_sbrow and dav1d_lr_copy_lpf only when there's something for them to do.
-
- Oct 02, 2019
-
-
Martin Storsjö authored
-
Henrik Gramner authored
-
Henrik Gramner authored
The existing code was using 16-bit intermediate precision for certain calculations which is insufficient for some esoteric edge cases.
-
Henrik Gramner authored
--list-functions now prints a list of all function names. Uses stdout for easy grepping/piping. Can be combined with the --test option to only list functions within a specific test. Also rename --list to --list-tests and make it print to stdout as well for consistency.
-
- Oct 01, 2019
-
-
-
Ronald S. Bultje authored
-
Martin Storsjö authored
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_dc_128_w4_8bpc_neon: 2.08 1.47 2.17 intra_pred_dc_128_w8_8bpc_neon: 3.33 2.49 4.03 intra_pred_dc_128_w16_8bpc_neon: 3.93 3.86 3.75 intra_pred_dc_128_w32_8bpc_neon: 3.14 3.79 2.90 intra_pred_dc_128_w64_8bpc_neon: 3.68 1.97 2.42 intra_pred_dc_left_w4_8bpc_neon: 2.41 1.70 2.23 intra_pred_dc_left_w8_8bpc_neon: 3.53 2.41 3.32 intra_pred_dc_left_w16_8bpc_neon: 3.87 3.54 3.34 intra_pred_dc_left_w32_8bpc_neon: 4.10 3.60 2.76 intra_pred_dc_left_w64_8bpc_neon: 3.72 2.00 2.39 intra_pred_dc_top_w4_8bpc_neon: 2.27 1.66 2.07 intra_pred_dc_top_w8_8bpc_neon: 3.83 2.69 3.43 intra_pred_dc_top_w16_8bpc_neon: 3.66 3.60 3.20 intra_pred_dc_top_w32_8bpc_neon: 3.92 3.54 2.66 intra_pred_dc_top_w64_8bpc_neon: 3.60 1.98 2.30 intra_pred_dc_w4_8bpc_neon: 2.29 1.42 2.16 intra_pred_dc_w8_8bpc_neon: 3.56 2.83 3.05 intra_pred_dc_w16_8bpc_neon: 3.46 3.37 3.15 intra_pred_dc_w32_8bpc_neon: 3.79 3.41 2.74 intra_pred_dc_w64_8bpc_neon: 3.52 2.01 2.41 intra_pred_h_w4_8bpc_neon: 10.34 5.74 5.94 intra_pred_h_w8_8bpc_neon: 12.13 6.33 6.43 intra_pred_h_w16_8bpc_neon: 10.66 7.31 5.85 intra_pred_h_w32_8bpc_neon: 6.28 4.18 2.88 intra_pred_h_w64_8bpc_neon: 3.96 1.85 1.75 intra_pred_v_w4_8bpc_neon: 11.44 6.12 7.57 intra_pred_v_w8_8bpc_neon: 14.76 7.58 7.95 intra_pred_v_w16_8bpc_neon: 11.34 6.28 5.88 intra_pred_v_w32_8bpc_neon: 6.56 3.33 3.34 intra_pred_v_w64_8bpc_neon: 4.57 1.24 1.97
-
- Sep 30, 2019
-
-
Victorien Le Couviour--Tuffet authored
------------------------------------------ x86_64: warp_8x8_8bpc_c: 1773.4 x86_32: warp_8x8_8bpc_c: 1740.4 ---------- x86_64: warp_8x8_8bpc_ssse3: 317.5 x86_32: warp_8x8_8bpc_ssse3: 378.4 ---------- x86_64: warp_8x8_8bpc_sse4: 303.7 x86_32: warp_8x8_8bpc_sse4: 367.7 ---------- x86_64: warp_8x8_8bpc_avx2: 224.9 --------------------- --------------------- x86_64: warp_8x8t_8bpc_c: 1664.6 x86_32: warp_8x8t_8bpc_c: 1674.0 ---------- x86_64: warp_8x8t_8bpc_ssse3: 320.7 x86_32: warp_8x8t_8bpc_ssse3: 379.5 ---------- x86_64: warp_8x8t_8bpc_sse4: 304.8 x86_32: warp_8x8t_8bpc_sse4: 369.8 ---------- x86_64: warp_8x8t_8bpc_avx2: 228.5 ------------------------------------------
-
- Sep 29, 2019
-
-
Martin Storsjö authored
Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed to be clipped. This fixes mismatches for some samples, see issue #299. Before: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 93.0 52.8 49.5 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 260.0 186.0 196.4 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1371.0 953.4 1028.6 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7363.2 4887.5 5135.8 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25029.0 17492.3 18404.5 After: inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 105.0 58.7 55.2 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 294.0 211.5 209.9 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1495.8 1050.4 1070.6 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7866.7 5197.8 5321.4 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25807.2 18619.3 18526.9
-
Martin Storsjö authored
The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
-
Martin Storsjö authored
Even though smull+smlal does two multiplications instead of one, the combination seems to be better handled by actual cores. Before: Cortex A53 A72 A73 inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 356.0 279.2 278.0 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1785.0 1329.5 1308.8 After: inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 360.0 253.2 269.3 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1793.1 1300.9 1254.0 (In this particular cases, it seems like it is a minor regression on A53, which is probably more due to having to change the ordering of some instructions, due to how smull+smlal+smull2+smlal2 overwrites the second output register sooner than an addl+addl2 would have, but in general, smull+smlal seems to be equally good or better than addl+mul on A53 as well.)
-
- Sep 27, 2019
-
-
Right now this just allocates a new buffer for every frame, uses it, then discards it immediately. This is not optimal, either dav1d should start reusing buffers internally or we need to pool them in dav1dplay. As it stands, this is not really a performance gain. I'll have to investigate why, but my suspicion is that seeing any gains might require reusing buffers somewhere. Note: Thrashing buffers is not as bad as it seems, initially. Not only does libplacebo pool and reuse GPU memory and buffer state objects internally, but this also absolves us from having to do any manual polling to figure out when the buffer is reusable again. Creating, using and immediately destroying buffers actually isn't as bad an approach as it might otherwise seem. It's entirely possible that this is only bad because of lock contention. As said, I'll have to investigate further...
-
Useful to test the effects of performance changes to the decoding/rendering loop as a whole.
-
Only meaningful with libplacebo. The defaults are higher quality than SDL so it's an unfair comparison and definitely too much for slow iGPUs at 4K res. Make the defaults fast/dumb processing only, and guard the debanding/dithering/upscaling/etc. behind a new --highquality flag.
-
- Sep 19, 2019
-
-
Victorien Le Couviour--Tuffet authored
------------------------------------------ x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6 x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6 x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0 x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4 --------------------- x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9 x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6 x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5 x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6 --------------------- x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7 x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3 x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3 x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7 --------------------- x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7 x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8 x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9 x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9 --------------------- x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9 x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5 x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7 x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0 --------------------- x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4 x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6 x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9 x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0 --------------------- x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6 x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7 x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2 x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2 --------------------- x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4 x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5 x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8 x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7 --------------------- x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2 x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9 x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3 x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9 --------------------- x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6 x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7 x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0 x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8 ------------------------------------------
-
--------------------- x86_64: ------------------------------------------ lpf_h_sb_uv_w4_8bpc_c: 430.6 lpf_h_sb_uv_w4_8bpc_ssse3: 322.0 lpf_h_sb_uv_w4_8bpc_avx2: 200.4 --------------------- lpf_h_sb_uv_w6_8bpc_c: 981.9 lpf_h_sb_uv_w6_8bpc_ssse3: 421.5 lpf_h_sb_uv_w6_8bpc_avx2: 270.0 --------------------- lpf_h_sb_y_w4_8bpc_c: 3001.7 lpf_h_sb_y_w4_8bpc_ssse3: 466.3 lpf_h_sb_y_w4_8bpc_avx2: 383.1 --------------------- lpf_h_sb_y_w8_8bpc_c: 4457.7 lpf_h_sb_y_w8_8bpc_ssse3: 818.9 lpf_h_sb_y_w8_8bpc_avx2: 537.0 --------------------- lpf_h_sb_y_w16_8bpc_c: 1967.9 lpf_h_sb_y_w16_8bpc_ssse3: 1836.7 lpf_h_sb_y_w16_8bpc_avx2: 1078.2 --------------------- lpf_v_sb_uv_w4_8bpc_c: 369.4 lpf_v_sb_uv_w4_8bpc_ssse3: 110.9 lpf_v_sb_uv_w4_8bpc_avx2: 58.1 --------------------- lpf_v_sb_uv_w6_8bpc_c: 769.6 lpf_v_sb_uv_w6_8bpc_ssse3: 222.2 lpf_v_sb_uv_w6_8bpc_avx2: 117.8 --------------------- lpf_v_sb_y_w4_8bpc_c: 772.4 lpf_v_sb_y_w4_8bpc_ssse3: 179.8 lpf_v_sb_y_w4_8bpc_avx2: 173.6 --------------------- lpf_v_sb_y_w8_8bpc_c: 1660.2 lpf_v_sb_y_w8_8bpc_ssse3: 468.3 lpf_v_sb_y_w8_8bpc_avx2: 345.8 --------------------- lpf_v_sb_y_w16_8bpc_c: 1889.6 lpf_v_sb_y_w16_8bpc_ssse3: 1142.0 lpf_v_sb_y_w16_8bpc_avx2: 568.1 ------------------------------------------
-
- Sep 10, 2019
-
-
Ronald S. Bultje authored
fguv_32x32xn_8bpc_420_csfl0_c: 8945.4 fguv_32x32xn_8bpc_420_csfl0_avx2: 1001.6 fguv_32x32xn_8bpc_420_csfl1_c: 6363.4 fguv_32x32xn_8bpc_420_csfl1_avx2: 1299.5
-
Ronald S. Bultje authored
This would affect the output in samples with an odd width and horizontal chroma subsampling. The check does not exist in libaom, and might cause mismatches. This causes issues in the sample from #210, which uses super-resolution and has odd width. To work around this, make super-resolution's resize() always write an even number of pixels. This should not interfere with SIMD in the future.
-
Ronald S. Bultje authored
fgy_32x32xn_8bpc_c: 16181.8 fgy_32x32xn_8bpc_avx2: 3231.4 gen_grain_y_ar0_8bpc_c: 108857.6 gen_grain_y_ar0_8bpc_avx2: 22826.7 gen_grain_y_ar1_8bpc_c: 168239.8 gen_grain_y_ar1_8bpc_avx2: 72117.2 gen_grain_y_ar2_8bpc_c: 266165.9 gen_grain_y_ar2_8bpc_avx2: 126281.8 gen_grain_y_ar3_8bpc_c: 448139.4 gen_grain_y_ar3_8bpc_avx2: 137047.1
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
- Sep 06, 2019
-
-
James Almer authored
Both values can be independently coded in the bitstream, and are not always equal to frame_width and frame_height.
-
- Sep 05, 2019
-
-
Henrik Gramner authored
For some reason the MSVC CRT _wassert() function is not flagged as __declspec(noreturn), so when using those headers the compiler will expect execution to continue after an assertion has been triggered and will therefore complain about the use of uninitialized variables when compiled in debug mode in certain code paths. Reorder some case statements as a workaround.
-
For w <= 32 we can't process more than two rows per loop iteration. Credit to OSS-Fuzz.
-