Skip to content
Snippets Groups Projects
  1. Apr 19, 2019
  2. Apr 18, 2019
    • Liwei Wang's avatar
      Add SSSE3 implementation for the {16, 32, 64}x64 and 64 x{16, 32} blocks in itx · 589e96a1
      Liwei Wang authored
      Cycle times:
      inv_txfm_add_16x64_dct_dct_0_8bpc_c: 3973.5
      inv_txfm_add_16x64_dct_dct_0_8bpc_ssse3: 185.7
      inv_txfm_add_16x64_dct_dct_1_8bpc_c: 37869.1
      inv_txfm_add_16x64_dct_dct_1_8bpc_ssse3: 2103.1
      inv_txfm_add_16x64_dct_dct_2_8bpc_c: 37822.9
      inv_txfm_add_16x64_dct_dct_2_8bpc_ssse3: 2099.1
      inv_txfm_add_16x64_dct_dct_3_8bpc_c: 37871.7
      inv_txfm_add_16x64_dct_dct_3_8bpc_ssse3: 2663.5
      inv_txfm_add_16x64_dct_dct_4_8bpc_c: 38002.9
      inv_txfm_add_16x64_dct_dct_4_8bpc_ssse3: 2589.7
      inv_txfm_add_32x64_dct_dct_0_8bpc_c: 8319.2
      inv_txfm_add_32x64_dct_dct_0_8bpc_ssse3: 376.9
      inv_txfm_add_32x64_dct_dct_1_8bpc_c: 85956.8
      inv_txfm_add_32x64_dct_dct_1_8bpc_ssse3: 4298.1
      inv_txfm_add_32x64_dct_dct_2_8bpc_c: 89906.2
      inv_txfm_add_32x64_dct_dct_2_8bpc_ssse3: 4291.3
      inv_txfm_add_32x64_dct_dct_3_8bpc_c: 83710.9
      inv_txfm_add_32x64_dct_dct_3_8bpc_ssse3: 5589.5
      inv_txfm_add_32x64_dct_dct_4_8bpc_c: 87733.5
      inv_txfm_add_32x64_dct_dct_4_8bpc_ssse3: 5658.4
      i...
      589e96a1
  3. Apr 17, 2019
    • Ronald S. Bultje's avatar
      Over-allocate level array by 3-bytes · 36e1490b
      Ronald S. Bultje authored
      This is a workaround so that the AVX2 implementation of deblock can
      index the levels array starting from the level type, which causes it
      to over-read by up to 3 bytes. This is intended to fix #269.
      36e1490b
  4. Apr 16, 2019
    • Martin Storsjö's avatar
      arm64: loopfilter: Implement NEON loop filters · 0282f6f3
      Martin Storsjö authored
      The exact relative speedup compared to C code is a bit vague and hard
      to measure, depending on eactly how many filtered blocks are skipped,
      as the NEON version always filters 16 pixels at a time, while the
      C code can skip processing individual 4 pixel blocks.
      
      Additionally, the checkasm benchmarking code runs the same function
      repeatedly on the same buffer, which can make the filter take
      different codepaths on each run, as the function updates the buffer
      which will be used as input for the next run.
      
      If tweaking the checkasm test data to try to avoid skipped blocks,
      the relative speedups compared to C is between 2x and 5x, while
      it is around 1x to 4x with the current checkasm test as such.
      
      Benchmark numbers from a tweaked checkasm that avoids skipped
      blocks:
      
                              Cortex A53     A72     A73
      lpf_h_sb_uv_w4_8bpc_c:      2954.7  1399.3  1655.3
      lpf_h_sb_uv_w4_8bpc_neon:    895.5   650.8   692.0
      lpf_h_sb_uv_w6_8bpc_c:      3879.2  1917.2  2257.7
      lpf_h_sb_uv_w6_8bpc_neon:   1125.6   759.5   838.4
      lpf_h_sb_y_w4_8bpc_c:       6711.0  3275.5  3913.7
      lpf_h_sb_y_w4_8bpc_neon:    1744.0  1342.1  1351.5
      lpf_h_sb_y_w8_8bpc_c:      10695.7  6155.8  6638.9
      lpf_h_sb_y_w8_8bpc_neon:    2146.5  1560.4  1609.1
      lpf_h_sb_y_w16_8bpc_c:     11355.8  6292.0  6995.9
      lpf_h_sb_y_w16_8bpc_neon:   2475.4  1949.6  1968.4
      lpf_v_sb_uv_w4_8bpc_c:      2639.7  1204.8  1425.9
      lpf_v_sb_uv_w4_8bpc_neon:    510.7   351.4   334.7
      lpf_v_sb_uv_w6_8bpc_c:      3468.3  1757.1  2021.5
      lpf_v_sb_uv_w6_8bpc_neon:    625.0   415.0   397.8
      lpf_v_sb_y_w4_8bpc_c:       5428.7  2731.7  3068.5
      lpf_v_sb_y_w4_8bpc_neon:    1172.6   792.1   768.0
      lpf_v_sb_y_w8_8bpc_c:       8946.1  4412.8  5121.0
      lpf_v_sb_y_w8_8bpc_neon:    1565.5  1063.6  1062.7
      lpf_v_sb_y_w16_8bpc_c:      8978.9  4411.7  5112.0
      lpf_v_sb_y_w16_8bpc_neon:   1775.0  1288.1  1236.7
      0282f6f3
    • Martin Storsjö's avatar
      arm64: looprestoration: Add a NEON implementation of SGR · 204bf211
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Relative speedup vs (autovectorized) C code:
                            Cortex A53    A72    A73
      selfguided_3x3_8bpc_neon:   2.91   2.12   2.68
      selfguided_5x5_8bpc_neon:   3.18   2.65   3.39
      selfguided_mix_8bpc_neon:   3.04   2.29   2.98
      
      The relative speedup vs non-vectorized C code is around 2.6-4.6x.
      204bf211
    • Martin Storsjö's avatar
      msac: Add a cast to indicate intended narrowing from size_t to unsigned · 003fa104
      Martin Storsjö authored
      This fixes this compiler warning with MSVC:
      ../src/msac.c(148): warning C4267: '+=': conversion from 'size_t' to 'unsigned int', possible loss of data
      003fa104
  5. Apr 15, 2019
  6. Apr 10, 2019
    • Xuefeng Jiang's avatar
      Add SSSE3 implementation for ipred_paeth · 44d0de41
      Xuefeng Jiang authored and Henrik Gramner's avatar Henrik Gramner committed
      intra_pred_paeth_w4_8bpc_c: 561.6
      intra_pred_paeth_w4_8bpc_ssse3: 49.2
      intra_pred_paeth_w8_8bpc_c: 1475.8
      intra_pred_paeth_w8_8bpc_ssse3: 103.0
      intra_pred_paeth_w16_8bpc_c: 4697.8
      intra_pred_paeth_w16_8bpc_ssse3: 279.0
      intra_pred_paeth_w32_8bpc_c: 13245.1
      intra_pred_paeth_w32_8bpc_ssse3: 614.7
      intra_pred_paeth_w64_8bpc_c: 32638.9
      intra_pred_paeth_w64_8bpc_ssse3: 1477.6
      44d0de41
  7. Apr 08, 2019
  8. Apr 07, 2019
    • Martin Storsjö's avatar
      arm: Fix typos in comments · 556780b7
      Martin Storsjö authored
      The width register has been set to clz(w)-24, not the other way
      around. And the 32 bit prep function has got the h parameter in
      r4, not in r5.
      556780b7
  9. Apr 04, 2019
  10. Mar 28, 2019
    • Henrik Gramner's avatar
      CI: Check for newline at end of file · abb972a5
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      abb972a5
    • Victorien Le Couviour--Tuffet's avatar
      x86: cdef_dir: optimize best cost finding for SSE · 91568b2a
      Victorien Le Couviour--Tuffet authored
      Port of 65ee1233 for AVX-2
      from Kyle Siefring to SSE41, and optimize SSSE3.
      
      ---------------------
      x86_64:
      ------------------------------------------
      before: cdef_dir_8bpc_ssse3: 110.3
       after: cdef_dir_8bpc_ssse3: 105.9
         new: cdef_dir_8bpc_sse4:   96.4
      ------------------------------------------
      
      ---------------------
      x86_32:
      ------------------------------------------
      before: cdef_dir_8bpc_ssse3: 120.6
       after: cdef_dir_8bpc_ssse3: 110.7
         new: cdef_dir_8bpc_sse4:  106.5
      ------------------------------------------
      91568b2a
    • Victorien Le Couviour--Tuffet's avatar
      x86: cdef_filter: use 8-bit arithmetic for SSE · 75e88fab
      Victorien Le Couviour--Tuffet authored
      Port of c204da0f for AVX-2
      from Kyle Siefring.
      
      ---------------------
      x86_64:
      ------------------------------------------
      before: cdef_filter_4x4_8bpc_ssse3: 141.7
       after: cdef_filter_4x4_8bpc_ssse3: 131.6
      before: cdef_filter_4x4_8bpc_sse4: 128.3
       after: cdef_filter_4x4_8bpc_sse4: 119.0
      ------------------------------------------
      before: cdef_filter_4x8_8bpc_ssse3: 253.4
       after: cdef_filter_4x8_8bpc_ssse3: 236.1
      before: cdef_filter_4x8_8bpc_sse4: 228.5
       after: cdef_filter_4x8_8bpc_sse4: 213.2
      ------------------------------------------
      before: cdef_filter_8x8_8bpc_ssse3: 429.6
       after: cdef_filter_8x8_8bpc_ssse3: 386.9
      before: cdef_filter_8x8_8bpc_sse4: 379.9
       after: cdef_filter_8x8_8bpc_sse4: 335.9
      ------------------------------------------
      
      ---------------------
      x86_32:
      ------------------------------------------
      before: cdef_filter_4x4_8bpc_ssse3: 184.3
       after: cdef_filter_4x4_8bpc_ssse3: 163.3
      before: cdef_filter_4x4_8bpc_sse4: 168.9
       after: cdef_filter_4x4_8bpc_sse4: 146.1
      ------------------------------------------
      before: cdef_filter_4x8_8bpc_ssse3: 335.3
       after: cdef_filter_4x8_8bpc_ssse3: 280.7
      before: cdef_filter_4x8_8bpc_sse4: 305.1
       after: cdef_filter_4x8_8bpc_sse4: 257.9
      ------------------------------------------
      before: cdef_filter_8x8_8bpc_ssse3: 579.1
       after: cdef_filter_8x8_8bpc_ssse3: 500.5
      before: cdef_filter_8x8_8bpc_sse4: 517.0
       after: cdef_filter_8x8_8bpc_sse4: 455.8
      ------------------------------------------
      75e88fab
    • Victorien Le Couviour--Tuffet's avatar
      x86: cdef_filter: use a better constant for SSE4 · 22c3594d
      Victorien Le Couviour--Tuffet authored
      Port of dc2ae517 for AVX-2
      from Kyle Siefring.
      
      ---------------------
      x86_64:
      ------------------------------------------
      cdef_filter_4x4_8bpc_ssse3: 141.7
      cdef_filter_4x4_8bpc_sse4: 128.3
      ------------------------------------------
      cdef_filter_4x8_8bpc_ssse3: 253.4
      cdef_filter_4x8_8bpc_sse4: 228.5
      ------------------------------------------
      cdef_filter_8x8_8bpc_ssse3: 429.6
      cdef_filter_8x8_8bpc_sse4: 379.9
      ------------------------------------------
      
      ---------------------
      x86_32:
      ------------------------------------------
      cdef_filter_4x4_8bpc_ssse3: 184.3
      cdef_filter_4x4_8bpc_sse4: 168.9
      ------------------------------------------
      cdef_filter_4x8_8bpc_ssse3: 335.3
      cdef_filter_4x8_8bpc_sse4: 305.1
      ------------------------------------------
      cdef_filter_8x8_8bpc_ssse3: 579.1
      cdef_filter_8x8_8bpc_sse4: 517.0
      ------------------------------------------
      22c3594d
    • Victorien Le Couviour--Tuffet's avatar
  11. Mar 27, 2019
    • Liwei Wang's avatar
      Add SSSE3 implementation for the 16x32,32x16 and 32x32 blocks in itx · bd12b1ec
      Liwei Wang authored
      Cycle times:
      inv_txfm_add_16x32_dct_dct_0_8bpc_c: 2464.6
      inv_txfm_add_16x32_dct_dct_0_8bpc_ssse3: 121.6
      inv_txfm_add_16x32_dct_dct_1_8bpc_c: 24751.6
      inv_txfm_add_16x32_dct_dct_1_8bpc_ssse3: 1101.9
      inv_txfm_add_16x32_dct_dct_2_8bpc_c: 24377.0
      inv_txfm_add_16x32_dct_dct_2_8bpc_ssse3: 1117.2
      inv_txfm_add_16x32_dct_dct_3_8bpc_c: 24155.6
      inv_txfm_add_16x32_dct_dct_3_8bpc_ssse3: 2349.3
      inv_txfm_add_16x32_dct_dct_4_8bpc_c: 24175.6
      inv_txfm_add_16x32_dct_dct_4_8bpc_ssse3: 1642.0
      inv_txfm_add_16x32_identity_identity_0_8bpc_c: 10304.7
      inv_txfm_add_16x32_identity_identity_0_8bpc_ssse3: 137.7
      inv_txfm_add_16x32_identity_identity_1_8bpc_c: 10341.6
      inv_txfm_add_16x32_identity_identity_1_8bpc_ssse3: 137.9
      inv_txfm_add_16x32_identity_identity_2_8bpc_c: 10299.9
      inv_txfm_add_16x32_identity_identity_2_8bpc_ssse3: 253.9
      inv_txfm_add_16x32_identity_identity_3_8bpc_c: 10331.4
      inv_txfm_add_16x32_identity_identity_3_8bpc_ssse3: 369.7
      inv_txfm_add_16x32_identity_identity_4_8bpc_c: 10360.4
      inv_txfm_add_16x32_identity_identity_4_8bpc_ssse3: 484.0
      inv_txfm_add_32x16_dct_dct_0_8bpc_c: 2288.4
      inv_txfm_add_32x16_dct_dct_0_8bpc_ssse3: 142.3
      inv_txfm_add_32x16_dct_dct_1_8bpc_c: 23819.9
      inv_txfm_add_32x16_dct_dct_1_8bpc_ssse3: 1740.1
      inv_txfm_add_32x16_dct_dct_2_8bpc_c: 23755.8
      inv_txfm_add_32x16_dct_dct_2_8bpc_ssse3: 1641.4
      inv_txfm_add_32x16_dct_dct_3_8bpc_c: 23839.9
      inv_txfm_add_32x16_dct_dct_3_8bpc_ssse3: 1559.0
      inv_txfm_add_32x16_dct_dct_4_8bpc_c: 23757.7
      inv_txfm_add_32x16_dct_dct_4_8bpc_ssse3: 1579.0
      inv_txfm_add_32x16_identity_identity_0_8bpc_c: 10381.7
      inv_txfm_add_32x16_identity_identity_0_8bpc_ssse3: 126.3
      inv_txfm_add_32x16_identity_identity_1_8bpc_c: 10402.5
      inv_txfm_add_32x16_identity_identity_1_8bpc_ssse3: 126.5
      inv_txfm_add_32x16_identity_identity_2_8bpc_c: 10429.2
      inv_txfm_add_32x16_identity_identity_2_8bpc_ssse3: 244.9
      inv_txfm_add_32x16_identity_identity_3_8bpc_c: 10382.0
      inv_txfm_add_32x16_identity_identity_3_8bpc_ssse3: 491.0
      inv_txfm_add_32x16_identity_identity_4_8bpc_c: 10381.0
      inv_txfm_add_32x16_identity_identity_4_8bpc_ssse3: 468.0
      inv_txfm_add_32x32_dct_dct_0_8bpc_c: 4168.2
      inv_txfm_add_32x32_dct_dct_0_8bpc_ssse3: 204.0
      inv_txfm_add_32x32_dct_dct_1_8bpc_c: 46306.2
      inv_txfm_add_32x32_dct_dct_1_8bpc_ssse3: 2216.0
      inv_txfm_add_32x32_dct_dct_2_8bpc_c: 46300.2
      inv_txfm_add_32x32_dct_dct_2_8bpc_ssse3: 2194.2
      inv_txfm_add_32x32_dct_dct_3_8bpc_c: 46350.1
      inv_txfm_add_32x32_dct_dct_3_8bpc_ssse3: 3484.4
      inv_txfm_add_32x32_dct_dct_4_8bpc_c: 46318.1
      inv_txfm_add_32x32_dct_dct_4_8bpc_ssse3: 3440.9
      inv_txfm_add_32x32_identity_identity_0_8bpc_c: 14663.1
      inv_txfm_add_32x32_identity_identity_0_8bpc_ssse3: 179.0
      inv_txfm_add_32x32_identity_identity_1_8bpc_c: 14737.0
      inv_txfm_add_32x32_identity_identity_1_8bpc_ssse3: 179.2
      inv_txfm_add_32x32_identity_identity_2_8bpc_c: 14640.4
      inv_txfm_add_32x32_identity_identity_2_8bpc_ssse3: 179.1
      inv_txfm_add_32x32_identity_identity_3_8bpc_c: 14638.5
      inv_txfm_add_32x32_identity_identity_3_8bpc_ssse3: 663.8
      inv_txfm_add_32x32_identity_identity_4_8bpc_c: 14635.6
      inv_txfm_add_32x32_identity_identity_4_8bpc_ssse3: 663.9
      bd12b1ec
  12. Mar 26, 2019
  13. Mar 24, 2019
  14. Mar 20, 2019
  15. Mar 19, 2019
    • Liwei Wang's avatar
      Add SSSE3 implementation for the 8x32 and 32x8 blocks in itx · 585ac462
      Liwei Wang authored
      Cycle times:
      inv_txfm_add_8x32_dct_dct_0_8bpc_c: 1164.7
      inv_txfm_add_8x32_dct_dct_0_8bpc_ssse3: 79.5
      inv_txfm_add_8x32_dct_dct_1_8bpc_c: 11291.6
      inv_txfm_add_8x32_dct_dct_1_8bpc_ssse3: 508.5
      inv_txfm_add_8x32_dct_dct_2_8bpc_c: 10720.4
      inv_txfm_add_8x32_dct_dct_2_8bpc_ssse3: 507.9
      inv_txfm_add_8x32_dct_dct_3_8bpc_c: 12351.5
      inv_txfm_add_8x32_dct_dct_3_8bpc_ssse3: 687.2
      inv_txfm_add_8x32_dct_dct_4_8bpc_c: 10402.3
      inv_txfm_add_8x32_dct_dct_4_8bpc_ssse3: 687.9
      inv_txfm_add_8x32_identity_identity_0_8bpc_c: 3485.0
      inv_txfm_add_8x32_identity_identity_0_8bpc_ssse3: 97.7
      inv_txfm_add_8x32_identity_identity_1_8bpc_c: 3495.7
      inv_txfm_add_8x32_identity_identity_1_8bpc_ssse3: 97.7
      inv_txfm_add_8x32_identity_identity_2_8bpc_c: 3503.7
      inv_txfm_add_8x32_identity_identity_2_8bpc_ssse3: 97.8
      inv_txfm_add_8x32_identity_identity_3_8bpc_c: 3489.5
      inv_txfm_add_8x32_identity_identity_3_8bpc_ssse3: 184.4
      inv_txfm_add_8x32_identity_identity_4_8bpc_c: 3498.1
      inv_txfm_add_8x32_identity_identity_4_8bpc_ssse3: 182.8
      inv_txfm_add_32x8_dct_dct_0_8bpc_c: 1220.4
      inv_txfm_add_32x8_dct_dct_0_8bpc_ssse3: 65.6
      inv_txfm_add_32x8_dct_dct_1_8bpc_c: 11120.7
      inv_txfm_add_32x8_dct_dct_1_8bpc_ssse3: 623.8
      inv_txfm_add_32x8_dct_dct_2_8bpc_c: 12236.3
      inv_txfm_add_32x8_dct_dct_2_8bpc_ssse3: 624.7
      inv_txfm_add_32x8_dct_dct_3_8bpc_c: 10866.3
      inv_txfm_add_32x8_dct_dct_3_8bpc_ssse3: 694.1
      inv_txfm_add_32x8_dct_dct_4_8bpc_c: 10322.8
      inv_txfm_add_32x8_dct_dct_4_8bpc_ssse3: 692.5
      inv_txfm_add_32x8_identity_identity_0_8bpc_c: 3368.1
      inv_txfm_add_32x8_identity_identity_0_8bpc_ssse3: 98.6
      inv_txfm_add_32x8_identity_identity_1_8bpc_c: 3381.1
      inv_txfm_add_32x8_identity_identity_1_8bpc_ssse3: 98.3
      inv_txfm_add_32x8_identity_identity_2_8bpc_c: 3376.6
      inv_txfm_add_32x8_identity_identity_2_8bpc_ssse3: 98.3
      inv_txfm_add_32x8_identity_identity_3_8bpc_c: 3364.3
      inv_txfm_add_32x8_identity_identity_3_8bpc_ssse3: 182.2
      inv_txfm_add_32x8_identity_identity_4_8bpc_c: 3390.0
      inv_txfm_add_32x8_identity_identity_4_8bpc_ssse3: 182.2
      585ac462
  16. Mar 18, 2019
    • Xuefeng Jiang's avatar
      Add SSSE3 implementation for ipred_cfl_ac_420 and ipred_cfl_ac_422 · 5d944dc6
      Xuefeng Jiang authored and Henrik Gramner's avatar Henrik Gramner committed
      cfl_ac_420_w4_8bpc_c: 1621.0
      cfl_ac_420_w4_8bpc_ssse3: 92.5
      cfl_ac_420_w8_8bpc_c: 3344.1
      cfl_ac_420_w8_8bpc_ssse3: 115.4
      cfl_ac_420_w16_8bpc_c: 6024.9
      cfl_ac_420_w16_8bpc_ssse3: 187.8
      cfl_ac_422_w4_8bpc_c: 1762.5
      cfl_ac_422_w4_8bpc_ssse3: 81.4
      cfl_ac_422_w8_8bpc_c: 4941.2
      cfl_ac_422_w8_8bpc_ssse3: 166.5
      cfl_ac_422_w16_8bpc_c: 8261.8
      cfl_ac_422_w16_8bpc_ssse3: 272.3
      5d944dc6
  17. Mar 16, 2019
  18. Mar 14, 2019
  19. Mar 13, 2019
  20. Mar 12, 2019
  21. Mar 11, 2019
  22. Mar 09, 2019
  23. Mar 08, 2019
    • Janne Grunau's avatar
      let dav1d_version() return the project version · 754487c0
      Janne Grunau authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Increments the soname revision number for this behavior change.
      Removes the DAV1D_VERSION and DAV1D_VERSION_INT defines and
      dav1d_version_vcs() and dav1d_version_int().
      Also cleans up the version usage in dav1d CLI.
      Refs #241, #255.
      754487c0
    • Victorien Le Couviour--Tuffet's avatar
      x86: add SSSE3 cdef dir implementation · d67e3476
      Victorien Le Couviour--Tuffet authored
      ---------------------
      x86_64:
      ------------------------------------------
      cdef_dir_8bpc_c: 1023.1
      cdef_dir_8bpc_ssse3: 110.3
      cdef_dir_8bpc_avx2: 71.1
      ------------------------------------------
      
      ---------------------
      x86_32:
      ------------------------------------------
      cdef_dir_8bpc_c: 1074.8
      cdef_dir_8bpc_ssse3: 120.6
      ------------------------------------------
      
      Thanks to Ronald for the AVX2 XMM version which was a very good starting
      point.
      d67e3476
  24. Mar 06, 2019
Loading