1. 09 Oct, 2019 2 commits
  2. 08 Oct, 2019 10 commits
    • Jean-Baptiste Kempf's avatar
      Move snap to package/ subfolder · 3e0f1508
      Jean-Baptiste Kempf authored
      3e0f1508
    • Martin Storsjö's avatar
      arm: mc: Port the ARM64 warp filter to arm32 · 61442bee
      Martin Storsjö authored
      Relative speedup over C code:
                        Cortex A7     A8     A9    A53    A72    A73
      warp_8x8_8bpc_neon:    2.79   5.45   4.18   3.96   4.16   4.51
      warp_8x8t_8bpc_neon:   2.79   5.33   4.18   3.98   4.22   4.25
      
      Comparison to original ARM64 assembly:
      
      ARM64:            Cortex A53     A72     A73
      warp_8x8_8bpc_neon:   1854.6  1072.5  1102.5
      warp_8x8t_8bpc_neon:  1839.6  1069.4  1089.5
      ARM32:
      warp_8x8_8bpc_neon:   2132.5  1160.3  1218.0
      warp_8x8t_8bpc_neon:  2113.7  1148.0  1209.1
      61442bee
    • Martin Storsjö's avatar
      arm64: mc: Use addp instead of addv+trn1 in warp · 5647a57e
      Martin Storsjö authored
      Before:           Cortex A53     A72     A73
      warp_8x8_8bpc_neon:   1952.8  1161.3  1151.1
      warp_8x8t_8bpc_neon:  1937.1  1147.5  1139.0
      After:
      warp_8x8_8bpc_neon:   1860.8  1068.6  1105.8
      warp_8x8t_8bpc_neon:  1846.9  1056.4  1099.8
      5647a57e
    • Martin Storsjö's avatar
      arm: cdef: Port the ARM64 CDEF NEON assembly to 32 bit arm · 3489a9c1
      Martin Storsjö authored
      The relative speedup ranges from 2.5 to 3.8x for find_dir and
      around 5 to 10x for filter.
      
      The find_dir function is a bit restricted by barely having enough
      registers, leaving very few ones for temporaries, so less things can
      be done in parallel and many instructions end up depending on the
      result of the preceding instruction.
      
      The ported functions end up slightly slower than the corresponding
      ARM64 ones, but only marginally:
      
      ARM64:                   Cortex A53     A72     A73
      cdef_dir_8bpc_neon:           400.0   268.8   282.2
      cdef_filter_4x4_8bpc_neon:    596.3   359.9   379.7
      cdef_filter_4x8_8bpc_neon:   1091.0   670.4   698.5
      cdef_filter_8x8_8bpc_neon:   1998.7  1207.2  1218.4
      ARM32:
      cdef_dir_8bpc_neon:           528.5   329.1   337.4
      cdef_filter_4x4_8bpc_neon:    632.5   482.5   432.2
      cdef_filter_4x8_8bpc_neon:   1107.2   854.8   782.3
      cdef_filter_8x8_8bpc_neon:   1984.8  1381.0  1414.4
      
      Relative speedup over C code:
                              Cortex A7     A8     A9    A53    A72    A73
      cdef_dir_8bpc_neon:          2.92   2.54   2.67   3.87   3.37   3.83
      cdef_filter_4x4_8bpc_neon:   5.09   7.61   6.10   6.85   4.94   7.41
      cdef_filter_4x8_8bpc_neon:   5.53   8.23   6.77   7.67   5.60   8.01
      cdef_filter_8x8_8bpc_neon:   6.26  10.14   8.49   8.54   6.94   4.27
      3489a9c1
    • Martin Storsjö's avatar
    • Luc Trudeau's avatar
      7bbc5e3d
    • Martin Storsjö's avatar
      arm64: cdef: Improve find_dir · dfaa2a10
      Martin Storsjö authored
      Only add .4h elements to the upper half of sum_alt, as only 11
      elements are needed, and .8h + .4h gives 12 in total.
      
      Fuse two consecutive ext #8 + ext #2 into ext #10.
      
      Move a few stores further away from where they are calculated.
      
      Before:         Cortex A53     A72     A73
      cdef_dir_8bpc_neon:  404.0   278.2   302.4
      After:
      cdef_dir_8bpc_neon:  400.0   269.3   282.5
      dfaa2a10
    • Martin Storsjö's avatar
      arm64: cdef: Calculate two initial parameters in the same vector · fa6a0924
      Martin Storsjö authored
      As there's only two individual parameters, we can insert them into
      the same vector, reducing the number of actual calculation instructions,
      but adding a few more instructions to dup the results to the final
      vectors instead.
      fa6a0924
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm64: cdef: Rewrite an expression slightly · bc26e300
      Martin Storsjö authored
      Instead of apply_sign(imin(abs(diff), clip), diff), do
      imax(imin(diff, clip), -clip).
      
      Before:                  Cortex A53     A72     A73
      cdef_filter_4x4_8bpc_neon:    592.7   374.5   384.5
      cdef_filter_4x8_8bpc_neon:   1093.0   704.4   706.6
      cdef_filter_8x8_8bpc_neon:   1962.6  1239.4  1252.1
      After:
      cdef_filter_4x4_8bpc_neon:    593.7   355.5   373.2
      cdef_filter_4x8_8bpc_neon:   1091.6   663.2   685.3
      cdef_filter_8x8_8bpc_neon:   1964.2  1182.5  1210.8
      bc26e300
  3. 07 Oct, 2019 3 commits
  4. 03 Oct, 2019 1 commit
  5. 02 Oct, 2019 4 commits
  6. 01 Oct, 2019 3 commits
    • Henrik Gramner's avatar
      Simplify README build instructions · 16e0741a
      Henrik Gramner authored
      16e0741a
    • Ronald S. Bultje's avatar
      Minor cleanup · f6a8cc0c
      Ronald S. Bultje authored
      f6a8cc0c
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of dc/h/v prediction modes · f7743da1
      Martin Storsjö authored
      Relative speedups over the C code:
                                    Cortex A53    A72    A73
      intra_pred_dc_128_w4_8bpc_neon:     2.08   1.47   2.17
      intra_pred_dc_128_w8_8bpc_neon:     3.33   2.49   4.03
      intra_pred_dc_128_w16_8bpc_neon:    3.93   3.86   3.75
      intra_pred_dc_128_w32_8bpc_neon:    3.14   3.79   2.90
      intra_pred_dc_128_w64_8bpc_neon:    3.68   1.97   2.42
      intra_pred_dc_left_w4_8bpc_neon:    2.41   1.70   2.23
      intra_pred_dc_left_w8_8bpc_neon:    3.53   2.41   3.32
      intra_pred_dc_left_w16_8bpc_neon:   3.87   3.54   3.34
      intra_pred_dc_left_w32_8bpc_neon:   4.10   3.60   2.76
      intra_pred_dc_left_w64_8bpc_neon:   3.72   2.00   2.39
      intra_pred_dc_top_w4_8bpc_neon:     2.27   1.66   2.07
      intra_pred_dc_top_w8_8bpc_neon:     3.83   2.69   3.43
      intra_pred_dc_top_w16_8bpc_neon:    3.66   3.60   3.20
      intra_pred_dc_top_w32_8bpc_neon:    3.92   3.54   2.66
      intra_pred_dc_top_w64_8bpc_neon:    3.60   1.98   2.30
      intra_pred_dc_w4_8bpc_neon:         2.29   1.42   2.16
      intra_pred_dc_w8_8bpc_neon:         3.56   2.83   3.05
      intra_pred_dc_w16_8bpc_neon:        3.46   3.37   3.15
      intra_pred_dc_w32_8bpc_neon:        3.79   3.41   2.74
      intra_pred_dc_w64_8bpc_neon:        3.52   2.01   2.41
      intra_pred_h_w4_8bpc_neon:         10.34   5.74   5.94
      intra_pred_h_w8_8bpc_neon:         12.13   6.33   6.43
      intra_pred_h_w16_8bpc_neon:        10.66   7.31   5.85
      intra_pred_h_w32_8bpc_neon:         6.28   4.18   2.88
      intra_pred_h_w64_8bpc_neon:         3.96   1.85   1.75
      intra_pred_v_w4_8bpc_neon:         11.44   6.12   7.57
      intra_pred_v_w8_8bpc_neon:         14.76   7.58   7.95
      intra_pred_v_w16_8bpc_neon:        11.34   6.28   5.88
      intra_pred_v_w32_8bpc_neon:         6.56   3.33   3.34
      intra_pred_v_w64_8bpc_neon:         4.57   1.24   1.97
      f7743da1
  7. 30 Sep, 2019 1 commit
    • Victorien Le Couviour--Tuffet's avatar
      x86: add warp_affine SSE4 and SSSE3 asm · a91a03b0
      Victorien Le Couviour--Tuffet authored
      ```---------------------------------------
      x86_64: warp_8x8_8bpc_c: 1773.4
      x86_32: warp_8x8_8bpc_c: 1740.4
      ```
      
      -------
      x86_64: warp_8x8_8bpc_ssse3: 317.5
      x86_32: warp_8x8_8bpc_ssse3: 378.4
      ----------
      x86_64: warp_8x8_8bpc_sse4: 303.7
      x86_32: warp_8x8_8bpc_sse4: 367.7
      ----------
      x86_64: warp_8x8_8bpc_avx2: 224.9
      ---------------------
      ---------------------
      x86_64: warp_8x8t_8bpc_c: 1664.6
      x86_32: warp_8x8t_8bpc_c: 1674.0
      ----------
      x86_64: warp_8x8t_8bpc_ssse3: 320.7
      x86_32: warp_8x8t_8bpc_ssse3: 379.5
      ----------
      x86_64: warp_8x8t_8bpc_sse4: 304.8
      x86_32: warp_8x8t_8bpc_sse4: 369.8
      ----------
      x86_64: warp_8x8t_8bpc_avx2: 228.5
      ------------------------------------------
      a91a03b0
  8. 29 Sep, 2019 3 commits
    • Martin Storsjö's avatar
      arm64: itx: Fix overflows in idct · 713aa34c
      Martin Storsjö authored
      Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed
      to be clipped.
      
      This fixes mismatches for some samples, see issue #299.
      
      Before:                                Cortex A53       A72       A73
      inv_txfm_add_4x4_dct_dct_1_8bpc_neon:        93.0      52.8      49.5
      inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       260.0     186.0     196.4
      inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1371.0     953.4    1028.6
      inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7363.2    4887.5    5135.8
      inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25029.0   17492.3   18404.5
      After:
      inv_txfm_add_4x4_dct_dct_1_8bpc_neon:       105.0      58.7      55.2
      inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       294.0     211.5     209.9
      inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1495.8    1050.4    1070.6
      inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7866.7    5197.8    5321.4
      inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25807.2   18619.3   18526.9
      713aa34c
    • Martin Storsjö's avatar
      arm64: itx: Consistently use the factor 2896 in adst · 0ed3ad19
      Martin Storsjö authored
      The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
      0ed3ad19
    • Martin Storsjö's avatar
      arm64: itx: Use smull+smlal instead of addl+mul · a4950bce
      Martin Storsjö authored
      Even though smull+smlal does two multiplications instead of one,
      the combination seems to be better handled by actual cores.
      
      Before:                                 Cortex A53      A72      A73
      inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      356.0    279.2    278.0
      inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1785.0   1329.5   1308.8
      After:
      inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      360.0    253.2    269.3
      inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1793.1   1300.9   1254.0
      
      (In this particular cases, it seems like it is a minor regression
      on A53, which is probably more due to having to change the ordering
      of some instructions, due to how smull+smlal+smull2+smlal2 overwrites
      the second output register sooner than an addl+addl2 would have, but
      in general, smull+smlal seems to be equally good or better than
      addl+mul on A53 as well.)
      a4950bce
  9. 27 Sep, 2019 3 commits
    • Niklas Haas's avatar
      dav1dplay: initial support for --zerocopy · 490a1420
      Niklas Haas authored
      Right now this just allocates a new buffer for every frame, uses it,
      then discards it immediately. This is not optimal, either dav1d should
      start reusing buffers internally or we need to pool them in dav1dplay.
      
      As it stands, this is not really a performance gain. I'll have to
      investigate why, but my suspicion is that seeing any gains might require
      reusing buffers somewhere.
      
      Note: Thrashing buffers is not as bad as it seems, initially. Not only
      does libplacebo pool and reuse GPU memory and buffer state objects
      internally, but this also absolves us from having to do any manual
      polling to figure out when the buffer is reusable again. Creating, using
      and immediately destroying buffers actually isn't as bad an approach as
      it might otherwise seem.
      
      It's entirely possible that this is only bad because of lock contention.
      As said, I'll have to investigate further...
      490a1420
    • Niklas Haas's avatar
      dav1dplay: add --untimed for benchmarking purposes · 3f35ef1f
      Niklas Haas authored
      Useful to test the effects of performance changes to the
      decoding/rendering loop as a whole.
      3f35ef1f
    • Niklas Haas's avatar
      dav1dplay: add --highquality to toggle render quality · f6ae8c9c
      Niklas Haas authored
      Only meaningful with libplacebo. The defaults are higher quality than
      SDL so it's an unfair comparison and definitely too much for slow iGPUs
      at 4K res. Make the defaults fast/dumb processing only, and guard the
      debanding/dithering/upscaling/etc. behind a new --highquality flag.
      f6ae8c9c
  10. 19 Sep, 2019 2 commits
    • Victorien Le Couviour--Tuffet's avatar
      x86: add 32-bit support to SSSE3 deblock lpf · c0865f35
      Victorien Le Couviour--Tuffet authored
      ```---------------------------------------
      x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6
      x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6
      x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
      x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4
      ```
      
      ------------------
      x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9
      x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6
      x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
      x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6
      ---------------------
      x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7
      x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3
      x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3
      x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7
      ---------------------
      x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7
      x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8
      x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9
      x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9
      ---------------------
      x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9
      x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5
      x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
      x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0
      ---------------------
      x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4
      x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6
      x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
      x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0
      ---------------------
      x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6
      x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7
      x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
      x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2
      ---------------------
      x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4
      x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5
      x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8
      x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7
      ---------------------
      x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2
      x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9
      x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3
      x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9
      ---------------------
      x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6
      x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7
      x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
      x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8
      ------------------------------------------
      c0865f35
    • Ronald S. Bultje's avatar
      x86: add deblocking loopfilters SSSE3 asm (64-bit) · 1e4e6c7a
      Ronald S. Bultje authored
      ```------------------
      x86_64:
      ```
      
      ---------------------------------------
      lpf_h_sb_uv_w4_8bpc_c: 430.6
      lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
      lpf_h_sb_uv_w4_8bpc_avx2: 200.4
      ---------------------
      lpf_h_sb_uv_w6_8bpc_c: 981.9
      lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
      lpf_h_sb_uv_w6_8bpc_avx2: 270.0
      ---------------------
      lpf_h_sb_y_w4_8bpc_c: 3001.7
      lpf_h_sb_y_w4_8bpc_ssse3: 466.3
      lpf_h_sb_y_w4_8bpc_avx2: 383.1
      ---------------------
      lpf_h_sb_y_w8_8bpc_c: 4457.7
      lpf_h_sb_y_w8_8bpc_ssse3: 818.9
      lpf_h_sb_y_w8_8bpc_avx2: 537.0
      ---------------------
      lpf_h_sb_y_w16_8bpc_c: 1967.9
      lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
      lpf_h_sb_y_w16_8bpc_avx2: 1078.2
      ---------------------
      lpf_v_sb_uv_w4_8bpc_c: 369.4
      lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
      lpf_v_sb_uv_w4_8bpc_avx2: 58.1
      ---------------------
      lpf_v_sb_uv_w6_8bpc_c: 769.6
      lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
      lpf_v_sb_uv_w6_8bpc_avx2: 117.8
      ---------------------
      lpf_v_sb_y_w4_8bpc_c: 772.4
      lpf_v_sb_y_w4_8bpc_ssse3: 179.8
      lpf_v_sb_y_w4_8bpc_avx2: 173.6
      ---------------------
      lpf_v_sb_y_w8_8bpc_c: 1660.2
      lpf_v_sb_y_w8_8bpc_ssse3: 468.3
      lpf_v_sb_y_w8_8bpc_avx2: 345.8
      ---------------------
      lpf_v_sb_y_w16_8bpc_c: 1889.6
      lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
      lpf_v_sb_y_w16_8bpc_avx2: 568.1
      ------------------------------------------
      1e4e6c7a
  11. 10 Sep, 2019 5 commits
  12. 06 Sep, 2019 1 commit
  13. 05 Sep, 2019 2 commits
    • Henrik Gramner's avatar
      Silence some clang-cl warnings · acad1a99
      Henrik Gramner authored
      For some reason the MSVC CRT _wassert() function is not flagged as
       __declspec(noreturn), so when using those headers the compiler will
      expect execution to continue after an assertion has been triggered
      and will therefore complain about the use of uninitialized variables
      when compiled in debug mode in certain code paths.
      
      Reorder some case statements as a workaround.
      acad1a99
    • Henrik Gramner's avatar
      x86: Fix buffer overead in mc put · 69dae683
      Henrik Gramner authored
      For w <= 32 we can't process more than two rows per loop iteration.
      
      Credit to OSS-Fuzz.
      69dae683