- Oct 11, 2019
-
-
Jean-Baptiste Kempf authored
-
Martin Storsjö authored
-
- Oct 10, 2019
-
-
Luc Trudeau authored
-
Relative speedup over the C code: Cortex A53 A72 A73 cfl_ac_420_w4_8bpc_neon: 7.73 6.48 9.22 cfl_ac_420_w8_8bpc_neon: 6.70 5.56 6.95 cfl_ac_420_w16_8bpc_neon: 6.51 6.93 6.67 cfl_ac_422_w4_8bpc_neon: 9.25 7.70 9.75 cfl_ac_422_w8_8bpc_neon: 8.53 5.95 7.13 cfl_ac_422_w16_8bpc_neon: 7.08 6.87 6.06
-
Relative speedup over the C code: Cortex A53 A72 A73 cfl_pred_cfl_128_w4_8bpc_neon: 10.81 7.90 9.80 cfl_pred_cfl_128_w8_8bpc_neon: 18.38 11.15 13.24 cfl_pred_cfl_128_w16_8bpc_neon: 16.52 10.83 16.00 cfl_pred_cfl_128_w32_8bpc_neon: 3.27 3.60 3.70 cfl_pred_cfl_left_w4_8bpc_neon: 9.82 7.38 8.76 cfl_pred_cfl_left_w8_8bpc_neon: 17.22 10.63 11.97 cfl_pred_cfl_left_w16_8bpc_neon: 16.03 10.49 15.66 cfl_pred_cfl_left_w32_8bpc_neon: 3.28 3.61 3.72 cfl_pred_cfl_top_w4_8bpc_neon: 9.74 7.39 9.29 cfl_pred_cfl_top_w8_8bpc_neon: 17.48 10.89 12.58 cfl_pred_cfl_top_w16_8bpc_neon: 16.01 10.62 15.31 cfl_pred_cfl_top_w32_8bpc_neon: 3.25 3.62 3.75 cfl_pred_cfl_w4_8bpc_neon: 8.39 6.34 8.04 cfl_pred_cfl_w8_8bpc_neon: 15.99 10.12 12.42 cfl_pred_cfl_w16_8bpc_neon: 15.25 10.40 15.12 cfl_pred_cfl_w32_8bpc_neon: 3.23 3.58 3.71 The C code gets autovectorized for w >= 32, which is why the relative speedup looks strange (but the performance of the NEON functions is completely as expected).
-
Use a different layout of the filter_intra_taps depending on architecture; the current one is optimized for the x86 SIMD implementation. Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_filter_w4_8bpc_neon: 6.38 2.81 4.43 intra_pred_filter_w8_8bpc_neon: 9.30 3.62 5.71 intra_pred_filter_w16_8bpc_neon: 9.85 3.98 6.42 intra_pred_filter_w32_8bpc_neon: 10.77 4.08 7.09
-
Relative speedups over the C code: Cortex A53 A72 A73 pal_pred_w4_8bpc_neon: 8.75 6.15 7.60 pal_pred_w8_8bpc_neon: 19.93 11.79 10.98 pal_pred_w16_8bpc_neon: 24.68 13.28 16.06 pal_pred_w32_8bpc_neon: 23.56 11.81 16.74 pal_pred_w64_8bpc_neon: 23.16 12.19 17.60
-
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_smooth_h_w4_8bpc_neon: 8.02 4.53 7.09 intra_pred_smooth_h_w8_8bpc_neon: 16.59 5.91 9.32 intra_pred_smooth_h_w16_8bpc_neon: 18.80 5.54 10.10 intra_pred_smooth_h_w32_8bpc_neon: 5.07 4.43 4.60 intra_pred_smooth_h_w64_8bpc_neon: 5.03 4.26 4.34 intra_pred_smooth_v_w4_8bpc_neon: 9.11 5.51 7.75 intra_pred_smooth_v_w8_8bpc_neon: 17.07 6.86 10.55 intra_pred_smooth_v_w16_8bpc_neon: 17.98 6.38 11.52 intra_pred_smooth_v_w32_8bpc_neon: 11.69 5.66 8.09 intra_pred_smooth_v_w64_8bpc_neon: 8.44 4.34 5.72 intra_pred_smooth_w4_8bpc_neon: 9.81 4.85 6.93 intra_pred_smooth_w8_8bpc_neon: 16.05 5.60 9.26 intra_pred_smooth_w16_8bpc_neon: 14.01 5.02 8.96 intra_pred_smooth_w32_8bpc_neon: 9.29 5.02 7.25 intra_pred_smooth_w64_8bpc_neon: 6.53 3.94 5.26
-
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_paeth_w4_8bpc_neon: 8.36 6.55 7.27 intra_pred_paeth_w8_8bpc_neon: 15.24 11.36 11.34 intra_pred_paeth_w16_8bpc_neon: 16.63 13.20 14.17 intra_pred_paeth_w32_8bpc_neon: 10.83 9.21 9.87 intra_pred_paeth_w64_8bpc_neon: 8.37 7.07 7.45
-
-
-
-
James Almer authored
The uv argument is normally in a gpr, but in checkasm it's forcefully loaded from stack.
-
- Oct 09, 2019
-
-
Jean-Baptiste Kempf authored
-
-
- Oct 08, 2019
-
-
Jean-Baptiste Kempf authored
-
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 warp_8x8_8bpc_neon: 2.79 5.45 4.18 3.96 4.16 4.51 warp_8x8t_8bpc_neon: 2.79 5.33 4.18 3.98 4.22 4.25 Comparison to original ARM64 assembly: ARM64: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1854.6 1072.5 1102.5 warp_8x8t_8bpc_neon: 1839.6 1069.4 1089.5 ARM32: warp_8x8_8bpc_neon: 2132.5 1160.3 1218.0 warp_8x8t_8bpc_neon: 2113.7 1148.0 1209.1
-
Before: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1952.8 1161.3 1151.1 warp_8x8t_8bpc_neon: 1937.1 1147.5 1139.0 After: warp_8x8_8bpc_neon: 1860.8 1068.6 1105.8 warp_8x8t_8bpc_neon: 1846.9 1056.4 1099.8
-
The relative speedup ranges from 2.5 to 3.8x for find_dir and around 5 to 10x for filter. The find_dir function is a bit restricted by barely having enough registers, leaving very few ones for temporaries, so less things can be done in parallel and many instructions end up depending on the result of the preceding instruction. The ported functions end up slightly slower than the corresponding ARM64 ones, but only marginally: ARM64: Cortex A53 A72 A73 cdef_dir_8bpc_neon: 400.0 268.8 282.2 cdef_filter_4x4_8bpc_neon: 596.3 359.9 379.7 cdef_filter_4x8_8bpc_neon: 1091.0 670.4 698.5 cdef_filter_8x8_8bpc_neon: 1998.7 1207.2 1218.4 ARM32: cdef_dir_8bpc_neon: 528.5 329.1 337.4 cdef_filter_4x4_8bpc_neon: 632.5 482.5 432.2 cdef_filter_4x8_8bpc_neon: 1107.2 854.8 782.3 cdef_filter_8x8_8bpc_neon: 1984.8 1381.0 1414.4 Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 cdef_dir_8bpc_neon: 2.92 2.54 2.67 3.87 3.37 3.83 cdef_filter_4x4_8bpc_neon: 5.09 7.61 6.10 6.85 4.94 7.41 cdef_filter_4x8_8bpc_neon: 5.53 8.23 6.77 7.67 5.60 8.01 cdef_filter_8x8_8bpc_neon: 6.26 10.14 8.49 8.54 6.94 4.27
-
-
-
Only add .4h elements to the upper half of sum_alt, as only 11 elements are needed, and .8h + .4h gives 12 in total. Fuse two consecutive ext #8 + ext #2 into ext #10. Move a few stores further away from where they are calculated. Before: Cortex A53 A72 A73 cdef_dir_8bpc_neon: 404.0 278.2 302.4 After: cdef_dir_8bpc_neon: 400.0 269.3 282.5
-
As there's only two individual parameters, we can insert them into the same vector, reducing the number of actual calculation instructions, but adding a few more instructions to dup the results to the final vectors instead.
-
-
Instead of apply_sign(imin(abs(diff), clip), diff), do imax(imin(diff, clip), -clip). Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 592.7 374.5 384.5 cdef_filter_4x8_8bpc_neon: 1093.0 704.4 706.6 cdef_filter_8x8_8bpc_neon: 1962.6 1239.4 1252.1 After: cdef_filter_4x4_8bpc_neon: 593.7 355.5 373.2 cdef_filter_4x8_8bpc_neon: 1091.6 663.2 685.3 cdef_filter_8x8_8bpc_neon: 1964.2 1182.5 1210.8
-
- Oct 07, 2019
-
-
-
Ronald S. Bultje authored
gen_grain_uv_ar0_8bpc_420_c: 30131.8 gen_grain_uv_ar0_8bpc_420_avx2: 6600.4 gen_grain_uv_ar1_8bpc_420_c: 46110.5 gen_grain_uv_ar1_8bpc_420_avx2: 17887.2 gen_grain_uv_ar2_8bpc_420_c: 73593.2 gen_grain_uv_ar2_8bpc_420_avx2: 26918.6 gen_grain_uv_ar3_8bpc_420_c: 114499.3 gen_grain_uv_ar3_8bpc_420_avx2: 29804.6
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1997.3 1170.1 1199.9 warp_8x8t_8bpc_neon: 1982.4 1171.5 1192.6 After: warp_8x8_8bpc_neon: 1954.6 1159.2 1153.3 warp_8x8t_8bpc_neon: 1938.5 1146.2 1136.7
-
- Oct 03, 2019
-
-
Prior checks were done at the sbrow level. This now allows to call dav1d_lr_sbrow and dav1d_lr_copy_lpf only when there's something for them to do.
-
- Oct 02, 2019
-
-
Martin Storsjö authored
-
Henrik Gramner authored
-
Henrik Gramner authored
The existing code was using 16-bit intermediate precision for certain calculations which is insufficient for some esoteric edge cases.
-
Henrik Gramner authored
--list-functions now prints a list of all function names. Uses stdout for easy grepping/piping. Can be combined with the --test option to only list functions within a specific test. Also rename --list to --list-tests and make it print to stdout as well for consistency.
-
- Oct 01, 2019
-
-
-
Ronald S. Bultje authored
-
Martin Storsjö authored
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_dc_128_w4_8bpc_neon: 2.08 1.47 2.17 intra_pred_dc_128_w8_8bpc_neon: 3.33 2.49 4.03 intra_pred_dc_128_w16_8bpc_neon: 3.93 3.86 3.75 intra_pred_dc_128_w32_8bpc_neon: 3.14 3.79 2.90 intra_pred_dc_128_w64_8bpc_neon: 3.68 1.97 2.42 intra_pred_dc_left_w4_8bpc_neon: 2.41 1.70 2.23 intra_pred_dc_left_w8_8bpc_neon: 3.53 2.41 3.32 intra_pred_dc_left_w16_8bpc_neon: 3.87 3.54 3.34 intra_pred_dc_left_w32_8bpc_neon: 4.10 3.60 2.76 intra_pred_dc_left_w64_8bpc_neon: 3.72 2.00 2.39 intra_pred_dc_top_w4_8bpc_neon: 2.27 1.66 2.07 intra_pred_dc_top_w8_8bpc_neon: 3.83 2.69 3.43 intra_pred_dc_top_w16_8bpc_neon: 3.66 3.60 3.20 intra_pred_dc_top_w32_8bpc_neon: 3.92 3.54 2.66 intra_pred_dc_top_w64_8bpc_neon: 3.60 1.98 2.30 intra_pred_dc_w4_8bpc_neon: 2.29 1.42 2.16 intra_pred_dc_w8_8bpc_neon: 3.56 2.83 3.05 intra_pred_dc_w16_8bpc_neon: 3.46 3.37 3.15 intra_pred_dc_w32_8bpc_neon: 3.79 3.41 2.74 intra_pred_dc_w64_8bpc_neon: 3.52 2.01 2.41 intra_pred_h_w4_8bpc_neon: 10.34 5.74 5.94 intra_pred_h_w8_8bpc_neon: 12.13 6.33 6.43 intra_pred_h_w16_8bpc_neon: 10.66 7.31 5.85 intra_pred_h_w32_8bpc_neon: 6.28 4.18 2.88 intra_pred_h_w64_8bpc_neon: 3.96 1.85 1.75 intra_pred_v_w4_8bpc_neon: 11.44 6.12 7.57 intra_pred_v_w8_8bpc_neon: 14.76 7.58 7.95 intra_pred_v_w16_8bpc_neon: 11.34 6.28 5.88 intra_pred_v_w32_8bpc_neon: 6.56 3.33 3.34 intra_pred_v_w64_8bpc_neon: 4.57 1.24 1.97
-
- Sep 30, 2019
-
-
Victorien Le Couviour--Tuffet authored
------------------------------------------ x86_64: warp_8x8_8bpc_c: 1773.4 x86_32: warp_8x8_8bpc_c: 1740.4 ---------- x86_64: warp_8x8_8bpc_ssse3: 317.5 x86_32: warp_8x8_8bpc_ssse3: 378.4 ---------- x86_64: warp_8x8_8bpc_sse4: 303.7 x86_32: warp_8x8_8bpc_sse4: 367.7 ---------- x86_64: warp_8x8_8bpc_avx2: 224.9 --------------------- --------------------- x86_64: warp_8x8t_8bpc_c: 1664.6 x86_32: warp_8x8t_8bpc_c: 1674.0 ---------- x86_64: warp_8x8t_8bpc_ssse3: 320.7 x86_32: warp_8x8t_8bpc_ssse3: 379.5 ---------- x86_64: warp_8x8t_8bpc_sse4: 304.8 x86_32: warp_8x8t_8bpc_sse4: 369.8 ---------- x86_64: warp_8x8t_8bpc_avx2: 228.5 ------------------------------------------
-
- Sep 29, 2019
-
-
Martin Storsjö authored
Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed to be clipped. This fixes mismatches for some samples, see issue #299. Before: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 93.0 52.8 49.5 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 260.0 186.0 196.4 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1371.0 953.4 1028.6 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7363.2 4887.5 5135.8 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25029.0 17492.3 18404.5 After: inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 105.0 58.7 55.2 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 294.0 211.5 209.9 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1495.8 1050.4 1070.6 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7866.7 5197.8 5321.4 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25807.2 18619.3 18526.9
-
Martin Storsjö authored
The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
-
Martin Storsjö authored
Even though smull+smlal does two multiplications instead of one, the combination seems to be better handled by actual cores. Before: Cortex A53 A72 A73 inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 356.0 279.2 278.0 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1785.0 1329.5 1308.8 After: inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 360.0 253.2 269.3 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1793.1 1300.9 1254.0 (In this particular cases, it seems like it is a minor regression on A53, which is probably more due to having to change the ordering of some instructions, due to how smull+smlal+smull2+smlal2 overwrites the second output register sooner than an addl+addl2 would have, but in general, smull+smlal seems to be equally good or better than addl+mul on A53 as well.)
-