- Apr 19, 2019
-
-
Jean-Baptiste Kempf authored
-
- Apr 18, 2019
-
-
Liwei Wang authored
Cycle times: inv_txfm_add_16x64_dct_dct_0_8bpc_c: 3973.5 inv_txfm_add_16x64_dct_dct_0_8bpc_ssse3: 185.7 inv_txfm_add_16x64_dct_dct_1_8bpc_c: 37869.1 inv_txfm_add_16x64_dct_dct_1_8bpc_ssse3: 2103.1 inv_txfm_add_16x64_dct_dct_2_8bpc_c: 37822.9 inv_txfm_add_16x64_dct_dct_2_8bpc_ssse3: 2099.1 inv_txfm_add_16x64_dct_dct_3_8bpc_c: 37871.7 inv_txfm_add_16x64_dct_dct_3_8bpc_ssse3: 2663.5 inv_txfm_add_16x64_dct_dct_4_8bpc_c: 38002.9 inv_txfm_add_16x64_dct_dct_4_8bpc_ssse3: 2589.7 inv_txfm_add_32x64_dct_dct_0_8bpc_c: 8319.2 inv_txfm_add_32x64_dct_dct_0_8bpc_ssse3: 376.9 inv_txfm_add_32x64_dct_dct_1_8bpc_c: 85956.8 inv_txfm_add_32x64_dct_dct_1_8bpc_ssse3: 4298.1 inv_txfm_add_32x64_dct_dct_2_8bpc_c: 89906.2 inv_txfm_add_32x64_dct_dct_2_8bpc_ssse3: 4291.3 inv_txfm_add_32x64_dct_dct_3_8bpc_c: 83710.9 inv_txfm_add_32x64_dct_dct_3_8bpc_ssse3: 5589.5 inv_txfm_add_32x64_dct_dct_4_8bpc_c: 87733.5 inv_txfm_add_32x64_dct_dct_4_8bpc_ssse3: 5658.4 i...
-
- Apr 17, 2019
-
-
Ronald S. Bultje authored
This is a workaround so that the AVX2 implementation of deblock can index the levels array starting from the level type, which causes it to over-read by up to 3 bytes. This is intended to fix #269.
-
- Apr 16, 2019
-
-
Martin Storsjö authored
The exact relative speedup compared to C code is a bit vague and hard to measure, depending on eactly how many filtered blocks are skipped, as the NEON version always filters 16 pixels at a time, while the C code can skip processing individual 4 pixel blocks. Additionally, the checkasm benchmarking code runs the same function repeatedly on the same buffer, which can make the filter take different codepaths on each run, as the function updates the buffer which will be used as input for the next run. If tweaking the checkasm test data to try to avoid skipped blocks, the relative speedups compared to C is between 2x and 5x, while it is around 1x to 4x with the current checkasm test as such. Benchmark numbers from a tweaked checkasm that avoids skipped blocks: Cortex A53 A72 A73 lpf_h_sb_uv_w4_8bpc_c: 2954.7 1399.3 1655.3 lpf_h_sb_uv_w4_8bpc_neon: 895.5 650.8 692.0 lpf_h_sb_uv_w6_8bpc_c: 3879.2 1917.2 2257.7 lpf_h_sb_uv_w6_8bpc_neon: 1125.6 759.5 838.4 lpf_h_sb_y_w4_8bpc_c: 6711.0 3275.5 3913.7 lpf_h_sb_y_w4_8bpc_neon: 1744.0 1342.1 1351.5 lpf_h_sb_y_w8_8bpc_c: 10695.7 6155.8 6638.9 lpf_h_sb_y_w8_8bpc_neon: 2146.5 1560.4 1609.1 lpf_h_sb_y_w16_8bpc_c: 11355.8 6292.0 6995.9 lpf_h_sb_y_w16_8bpc_neon: 2475.4 1949.6 1968.4 lpf_v_sb_uv_w4_8bpc_c: 2639.7 1204.8 1425.9 lpf_v_sb_uv_w4_8bpc_neon: 510.7 351.4 334.7 lpf_v_sb_uv_w6_8bpc_c: 3468.3 1757.1 2021.5 lpf_v_sb_uv_w6_8bpc_neon: 625.0 415.0 397.8 lpf_v_sb_y_w4_8bpc_c: 5428.7 2731.7 3068.5 lpf_v_sb_y_w4_8bpc_neon: 1172.6 792.1 768.0 lpf_v_sb_y_w8_8bpc_c: 8946.1 4412.8 5121.0 lpf_v_sb_y_w8_8bpc_neon: 1565.5 1063.6 1062.7 lpf_v_sb_y_w16_8bpc_c: 8978.9 4411.7 5112.0 lpf_v_sb_y_w16_8bpc_neon: 1775.0 1288.1 1236.7
-
Relative speedup vs (autovectorized) C code: Cortex A53 A72 A73 selfguided_3x3_8bpc_neon: 2.91 2.12 2.68 selfguided_5x5_8bpc_neon: 3.18 2.65 3.39 selfguided_mix_8bpc_neon: 3.04 2.29 2.98 The relative speedup vs non-vectorized C code is around 2.6-4.6x.
-
Martin Storsjö authored
This fixes this compiler warning with MSVC: ../src/msac.c(148): warning C4267: '+=': conversion from 'size_t' to 'unsigned int', possible loss of data
-
- Apr 15, 2019
-
-
Also make various minor optimizations/style fixes to the MSAC C functions.
-
- Apr 10, 2019
-
-
intra_pred_paeth_w4_8bpc_c: 561.6 intra_pred_paeth_w4_8bpc_ssse3: 49.2 intra_pred_paeth_w8_8bpc_c: 1475.8 intra_pred_paeth_w8_8bpc_ssse3: 103.0 intra_pred_paeth_w16_8bpc_c: 4697.8 intra_pred_paeth_w16_8bpc_ssse3: 279.0 intra_pred_paeth_w32_8bpc_c: 13245.1 intra_pred_paeth_w32_8bpc_ssse3: 614.7 intra_pred_paeth_w64_8bpc_c: 32638.9 intra_pred_paeth_w64_8bpc_ssse3: 1477.6
-
- Apr 08, 2019
-
-
Martin Storsjö authored
This eases disambiguating these functions when looking at perf profiles.
-
- Apr 07, 2019
-
-
Martin Storsjö authored
The width register has been set to clz(w)-24, not the other way around. And the 32 bit prep function has got the h parameter in r4, not in r5.
-
- Apr 04, 2019
-
-
For cases with indented, nested .if/.macro in asm.S, ident those by 4 chars. Some initial assembly files were indented to 4/16 columns, while all the actual implementation files, starting with src/arm/64/mc.S, have used 8/24 for indentation.
-
cfl_ac_444_w4_8bpc_c: 978.2 cfl_ac_444_w4_8bpc_ssse3: 110.4 cfl_ac_444_w8_8bpc_c: 2312.3 cfl_ac_444_w8_8bpc_ssse3: 197.5 cfl_ac_444_w16_8bpc_c: 4081.1 cfl_ac_444_w16_8bpc_ssse3: 274.1 cfl_ac_444_w32_8bpc_c: 9544.3 cfl_ac_444_w32_8bpc_ssse3: 617.1
-
- Mar 28, 2019
-
-
-
Victorien Le Couviour--Tuffet authored
Port of 65ee1233 for AVX-2 from Kyle Siefring to SSE41, and optimize SSSE3. --------------------- x86_64: ------------------------------------------ before: cdef_dir_8bpc_ssse3: 110.3 after: cdef_dir_8bpc_ssse3: 105.9 new: cdef_dir_8bpc_sse4: 96.4 ------------------------------------------ --------------------- x86_32: ------------------------------------------ before: cdef_dir_8bpc_ssse3: 120.6 after: cdef_dir_8bpc_ssse3: 110.7 new: cdef_dir_8bpc_sse4: 106.5 ------------------------------------------
-
Victorien Le Couviour--Tuffet authored
Port of c204da0f for AVX-2 from Kyle Siefring. --------------------- x86_64: ------------------------------------------ before: cdef_filter_4x4_8bpc_ssse3: 141.7 after: cdef_filter_4x4_8bpc_ssse3: 131.6 before: cdef_filter_4x4_8bpc_sse4: 128.3 after: cdef_filter_4x4_8bpc_sse4: 119.0 ------------------------------------------ before: cdef_filter_4x8_8bpc_ssse3: 253.4 after: cdef_filter_4x8_8bpc_ssse3: 236.1 before: cdef_filter_4x8_8bpc_sse4: 228.5 after: cdef_filter_4x8_8bpc_sse4: 213.2 ------------------------------------------ before: cdef_filter_8x8_8bpc_ssse3: 429.6 after: cdef_filter_8x8_8bpc_ssse3: 386.9 before: cdef_filter_8x8_8bpc_sse4: 379.9 after: cdef_filter_8x8_8bpc_sse4: 335.9 ------------------------------------------ --------------------- x86_32: ------------------------------------------ before: cdef_filter_4x4_8bpc_ssse3: 184.3 after: cdef_filter_4x4_8bpc_ssse3: 163.3 before: cdef_filter_4x4_8bpc_sse4: 168.9 after: cdef_filter_4x4_8bpc_sse4: 146.1 ------------------------------------------ before: cdef_filter_4x8_8bpc_ssse3: 335.3 after: cdef_filter_4x8_8bpc_ssse3: 280.7 before: cdef_filter_4x8_8bpc_sse4: 305.1 after: cdef_filter_4x8_8bpc_sse4: 257.9 ------------------------------------------ before: cdef_filter_8x8_8bpc_ssse3: 579.1 after: cdef_filter_8x8_8bpc_ssse3: 500.5 before: cdef_filter_8x8_8bpc_sse4: 517.0 after: cdef_filter_8x8_8bpc_sse4: 455.8 ------------------------------------------
-
Victorien Le Couviour--Tuffet authored
Port of dc2ae517 for AVX-2 from Kyle Siefring. --------------------- x86_64: ------------------------------------------ cdef_filter_4x4_8bpc_ssse3: 141.7 cdef_filter_4x4_8bpc_sse4: 128.3 ------------------------------------------ cdef_filter_4x8_8bpc_ssse3: 253.4 cdef_filter_4x8_8bpc_sse4: 228.5 ------------------------------------------ cdef_filter_8x8_8bpc_ssse3: 429.6 cdef_filter_8x8_8bpc_sse4: 379.9 ------------------------------------------ --------------------- x86_32: ------------------------------------------ cdef_filter_4x4_8bpc_ssse3: 184.3 cdef_filter_4x4_8bpc_sse4: 168.9 ------------------------------------------ cdef_filter_4x8_8bpc_ssse3: 335.3 cdef_filter_4x8_8bpc_sse4: 305.1 ------------------------------------------ cdef_filter_8x8_8bpc_ssse3: 579.1 cdef_filter_8x8_8bpc_sse4: 517.0 ------------------------------------------
-
Victorien Le Couviour--Tuffet authored
-
- Mar 27, 2019
-
-
Liwei Wang authored
Cycle times: inv_txfm_add_16x32_dct_dct_0_8bpc_c: 2464.6 inv_txfm_add_16x32_dct_dct_0_8bpc_ssse3: 121.6 inv_txfm_add_16x32_dct_dct_1_8bpc_c: 24751.6 inv_txfm_add_16x32_dct_dct_1_8bpc_ssse3: 1101.9 inv_txfm_add_16x32_dct_dct_2_8bpc_c: 24377.0 inv_txfm_add_16x32_dct_dct_2_8bpc_ssse3: 1117.2 inv_txfm_add_16x32_dct_dct_3_8bpc_c: 24155.6 inv_txfm_add_16x32_dct_dct_3_8bpc_ssse3: 2349.3 inv_txfm_add_16x32_dct_dct_4_8bpc_c: 24175.6 inv_txfm_add_16x32_dct_dct_4_8bpc_ssse3: 1642.0 inv_txfm_add_16x32_identity_identity_0_8bpc_c: 10304.7 inv_txfm_add_16x32_identity_identity_0_8bpc_ssse3: 137.7 inv_txfm_add_16x32_identity_identity_1_8bpc_c: 10341.6 inv_txfm_add_16x32_identity_identity_1_8bpc_ssse3: 137.9 inv_txfm_add_16x32_identity_identity_2_8bpc_c: 10299.9 inv_txfm_add_16x32_identity_identity_2_8bpc_ssse3: 253.9 inv_txfm_add_16x32_identity_identity_3_8bpc_c: 10331.4 inv_txfm_add_16x32_identity_identity_3_8bpc_ssse3: 369.7 inv_txfm_add_16x32_identity_identity_4_8bpc_c: 10360.4 inv_txfm_add_16x32_identity_identity_4_8bpc_ssse3: 484.0 inv_txfm_add_32x16_dct_dct_0_8bpc_c: 2288.4 inv_txfm_add_32x16_dct_dct_0_8bpc_ssse3: 142.3 inv_txfm_add_32x16_dct_dct_1_8bpc_c: 23819.9 inv_txfm_add_32x16_dct_dct_1_8bpc_ssse3: 1740.1 inv_txfm_add_32x16_dct_dct_2_8bpc_c: 23755.8 inv_txfm_add_32x16_dct_dct_2_8bpc_ssse3: 1641.4 inv_txfm_add_32x16_dct_dct_3_8bpc_c: 23839.9 inv_txfm_add_32x16_dct_dct_3_8bpc_ssse3: 1559.0 inv_txfm_add_32x16_dct_dct_4_8bpc_c: 23757.7 inv_txfm_add_32x16_dct_dct_4_8bpc_ssse3: 1579.0 inv_txfm_add_32x16_identity_identity_0_8bpc_c: 10381.7 inv_txfm_add_32x16_identity_identity_0_8bpc_ssse3: 126.3 inv_txfm_add_32x16_identity_identity_1_8bpc_c: 10402.5 inv_txfm_add_32x16_identity_identity_1_8bpc_ssse3: 126.5 inv_txfm_add_32x16_identity_identity_2_8bpc_c: 10429.2 inv_txfm_add_32x16_identity_identity_2_8bpc_ssse3: 244.9 inv_txfm_add_32x16_identity_identity_3_8bpc_c: 10382.0 inv_txfm_add_32x16_identity_identity_3_8bpc_ssse3: 491.0 inv_txfm_add_32x16_identity_identity_4_8bpc_c: 10381.0 inv_txfm_add_32x16_identity_identity_4_8bpc_ssse3: 468.0 inv_txfm_add_32x32_dct_dct_0_8bpc_c: 4168.2 inv_txfm_add_32x32_dct_dct_0_8bpc_ssse3: 204.0 inv_txfm_add_32x32_dct_dct_1_8bpc_c: 46306.2 inv_txfm_add_32x32_dct_dct_1_8bpc_ssse3: 2216.0 inv_txfm_add_32x32_dct_dct_2_8bpc_c: 46300.2 inv_txfm_add_32x32_dct_dct_2_8bpc_ssse3: 2194.2 inv_txfm_add_32x32_dct_dct_3_8bpc_c: 46350.1 inv_txfm_add_32x32_dct_dct_3_8bpc_ssse3: 3484.4 inv_txfm_add_32x32_dct_dct_4_8bpc_c: 46318.1 inv_txfm_add_32x32_dct_dct_4_8bpc_ssse3: 3440.9 inv_txfm_add_32x32_identity_identity_0_8bpc_c: 14663.1 inv_txfm_add_32x32_identity_identity_0_8bpc_ssse3: 179.0 inv_txfm_add_32x32_identity_identity_1_8bpc_c: 14737.0 inv_txfm_add_32x32_identity_identity_1_8bpc_ssse3: 179.2 inv_txfm_add_32x32_identity_identity_2_8bpc_c: 14640.4 inv_txfm_add_32x32_identity_identity_2_8bpc_ssse3: 179.1 inv_txfm_add_32x32_identity_identity_3_8bpc_c: 14638.5 inv_txfm_add_32x32_identity_identity_3_8bpc_ssse3: 663.8 inv_txfm_add_32x32_identity_identity_4_8bpc_c: 14635.6 inv_txfm_add_32x32_identity_identity_4_8bpc_ssse3: 663.9
-
- Mar 26, 2019
-
-
Henrik Gramner authored
-
- Mar 24, 2019
-
-
Martin Storsjö authored
As meson still doesn't allow specifying different cflags between static and dynamic libraries, this still includes the dllexport in the static library when built with default_library=both, but it at least is avoided in static-only builds, and avoids defining these symbols as dllexport in the callers' translation units.
-
The second shift is constant.
-
- Mar 20, 2019
-
-
- Mar 19, 2019
-
-
Liwei Wang authored
Cycle times: inv_txfm_add_8x32_dct_dct_0_8bpc_c: 1164.7 inv_txfm_add_8x32_dct_dct_0_8bpc_ssse3: 79.5 inv_txfm_add_8x32_dct_dct_1_8bpc_c: 11291.6 inv_txfm_add_8x32_dct_dct_1_8bpc_ssse3: 508.5 inv_txfm_add_8x32_dct_dct_2_8bpc_c: 10720.4 inv_txfm_add_8x32_dct_dct_2_8bpc_ssse3: 507.9 inv_txfm_add_8x32_dct_dct_3_8bpc_c: 12351.5 inv_txfm_add_8x32_dct_dct_3_8bpc_ssse3: 687.2 inv_txfm_add_8x32_dct_dct_4_8bpc_c: 10402.3 inv_txfm_add_8x32_dct_dct_4_8bpc_ssse3: 687.9 inv_txfm_add_8x32_identity_identity_0_8bpc_c: 3485.0 inv_txfm_add_8x32_identity_identity_0_8bpc_ssse3: 97.7 inv_txfm_add_8x32_identity_identity_1_8bpc_c: 3495.7 inv_txfm_add_8x32_identity_identity_1_8bpc_ssse3: 97.7 inv_txfm_add_8x32_identity_identity_2_8bpc_c: 3503.7 inv_txfm_add_8x32_identity_identity_2_8bpc_ssse3: 97.8 inv_txfm_add_8x32_identity_identity_3_8bpc_c: 3489.5 inv_txfm_add_8x32_identity_identity_3_8bpc_ssse3: 184.4 inv_txfm_add_8x32_identity_identity_4_8bpc_c: 3498.1 inv_txfm_add_8x32_identity_identity_4_8bpc_ssse3: 182.8 inv_txfm_add_32x8_dct_dct_0_8bpc_c: 1220.4 inv_txfm_add_32x8_dct_dct_0_8bpc_ssse3: 65.6 inv_txfm_add_32x8_dct_dct_1_8bpc_c: 11120.7 inv_txfm_add_32x8_dct_dct_1_8bpc_ssse3: 623.8 inv_txfm_add_32x8_dct_dct_2_8bpc_c: 12236.3 inv_txfm_add_32x8_dct_dct_2_8bpc_ssse3: 624.7 inv_txfm_add_32x8_dct_dct_3_8bpc_c: 10866.3 inv_txfm_add_32x8_dct_dct_3_8bpc_ssse3: 694.1 inv_txfm_add_32x8_dct_dct_4_8bpc_c: 10322.8 inv_txfm_add_32x8_dct_dct_4_8bpc_ssse3: 692.5 inv_txfm_add_32x8_identity_identity_0_8bpc_c: 3368.1 inv_txfm_add_32x8_identity_identity_0_8bpc_ssse3: 98.6 inv_txfm_add_32x8_identity_identity_1_8bpc_c: 3381.1 inv_txfm_add_32x8_identity_identity_1_8bpc_ssse3: 98.3 inv_txfm_add_32x8_identity_identity_2_8bpc_c: 3376.6 inv_txfm_add_32x8_identity_identity_2_8bpc_ssse3: 98.3 inv_txfm_add_32x8_identity_identity_3_8bpc_c: 3364.3 inv_txfm_add_32x8_identity_identity_3_8bpc_ssse3: 182.2 inv_txfm_add_32x8_identity_identity_4_8bpc_c: 3390.0 inv_txfm_add_32x8_identity_identity_4_8bpc_ssse3: 182.2
-
- Mar 18, 2019
-
-
cfl_ac_420_w4_8bpc_c: 1621.0 cfl_ac_420_w4_8bpc_ssse3: 92.5 cfl_ac_420_w8_8bpc_c: 3344.1 cfl_ac_420_w8_8bpc_ssse3: 115.4 cfl_ac_420_w16_8bpc_c: 6024.9 cfl_ac_420_w16_8bpc_ssse3: 187.8 cfl_ac_422_w4_8bpc_c: 1762.5 cfl_ac_422_w4_8bpc_ssse3: 81.4 cfl_ac_422_w8_8bpc_c: 4941.2 cfl_ac_422_w8_8bpc_ssse3: 166.5 cfl_ac_422_w16_8bpc_c: 8261.8 cfl_ac_422_w16_8bpc_ssse3: 272.3
-
- Mar 16, 2019
-
-
James Almer authored
This check was already done in dav1d_parse_obus(), so it's added as an assert here for extra precaution.
-
James Almer authored
Its previous contents don't need to be preserved.
-
- Mar 14, 2019
-
-
-
Fixes tests on big endian architectures.
-
- Mar 13, 2019
-
-
Jean-Baptiste Kempf authored
-
- Mar 12, 2019
-
-
James Almer authored
And the API version as the file version.
-
- Mar 11, 2019
-
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
This optimization is so small 10 runs with a fixed seed were needed to get some relevant numbers. This has been done for 3x3 case only. before: mean=113265.42 stddev=954.392 after: mean=112654.71 stddev=884.833
-
Victorien Le Couviour--Tuffet authored
This optimization is so tiny we can't even see it in checkasm. The only actual difference being the removal of a memory load, it has to be better.
-
Jean-Baptiste Kempf authored
-
- Mar 09, 2019
-
-
Janne Grunau authored
Refs #241, Closes #255.
-
Jean-Baptiste Kempf authored
-
- Mar 08, 2019
-
-
Increments the soname revision number for this behavior change. Removes the DAV1D_VERSION and DAV1D_VERSION_INT defines and dav1d_version_vcs() and dav1d_version_int(). Also cleans up the version usage in dav1d CLI. Refs #241, #255.
-
Victorien Le Couviour--Tuffet authored
--------------------- x86_64: ------------------------------------------ cdef_dir_8bpc_c: 1023.1 cdef_dir_8bpc_ssse3: 110.3 cdef_dir_8bpc_avx2: 71.1 ------------------------------------------ --------------------- x86_32: ------------------------------------------ cdef_dir_8bpc_c: 1074.8 cdef_dir_8bpc_ssse3: 120.6 ------------------------------------------ Thanks to Ronald for the AVX2 XMM version which was a very good starting point.
-
- Mar 06, 2019
-
-
Martin Storsjö authored
-