1. 11 Oct, 2019 4 commits
  2. 10 Oct, 2019 11 commits
    • Luc Trudeau's avatar
      b7d7c8ce
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of the cfl_ac functions · 57dd0aae
      Martin Storsjö authored
      Relative speedup over the C code:
                            Cortex A53    A72    A73
      cfl_ac_420_w4_8bpc_neon:    7.73   6.48   9.22
      cfl_ac_420_w8_8bpc_neon:    6.70   5.56   6.95
      cfl_ac_420_w16_8bpc_neon:   6.51   6.93   6.67
      cfl_ac_422_w4_8bpc_neon:    9.25   7.70   9.75
      cfl_ac_422_w8_8bpc_neon:    8.53   5.95   7.13
      cfl_ac_422_w16_8bpc_neon:   7.08   6.87   6.06
      57dd0aae
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of the cfl_pred functions · c7693386
      Martin Storsjö authored
      Relative speedup over the C code:
                                   Cortex A53    A72    A73
      cfl_pred_cfl_128_w4_8bpc_neon:    10.81   7.90   9.80
      cfl_pred_cfl_128_w8_8bpc_neon:    18.38  11.15  13.24
      cfl_pred_cfl_128_w16_8bpc_neon:   16.52  10.83  16.00
      cfl_pred_cfl_128_w32_8bpc_neon:    3.27   3.60   3.70
      cfl_pred_cfl_left_w4_8bpc_neon:    9.82   7.38   8.76
      cfl_pred_cfl_left_w8_8bpc_neon:   17.22  10.63  11.97
      cfl_pred_cfl_left_w16_8bpc_neon:  16.03  10.49  15.66
      cfl_pred_cfl_left_w32_8bpc_neon:   3.28   3.61   3.72
      cfl_pred_cfl_top_w4_8bpc_neon:     9.74   7.39   9.29
      cfl_pred_cfl_top_w8_8bpc_neon:    17.48  10.89  12.58
      cfl_pred_cfl_top_w16_8bpc_neon:   16.01  10.62  15.31
      cfl_pred_cfl_top_w32_8bpc_neon:    3.25   3.62   3.75
      cfl_pred_cfl_w4_8bpc_neon:         8.39   6.34   8.04
      cfl_pred_cfl_w8_8bpc_neon:        15.99  10.12  12.42
      cfl_pred_cfl_w16_8bpc_neon:       15.25  10.40  15.12
      cfl_pred_cfl_w32_8bpc_neon:        3.23   3.58   3.71
      
      The C code gets autovectorized for w >= 32, which is why the
      relative speedup looks strange (but the performance of the NEON
      functions is completely as expected).
      c7693386
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of the filter function · d322d451
      Martin Storsjö authored
      Use a different layout of the filter_intra_taps depending on
      architecture; the current one is optimized for the x86 SIMD
      implementation.
      
      Relative speedups over the C code:
                                   Cortex A53    A72    A73
      intra_pred_filter_w4_8bpc_neon:    6.38   2.81   4.43
      intra_pred_filter_w8_8bpc_neon:    9.30   3.62   5.71
      intra_pred_filter_w16_8bpc_neon:   9.85   3.98   6.42
      intra_pred_filter_w32_8bpc_neon:  10.77   4.08   7.09
      d322d451
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of palette prediction · 4f14573c
      Martin Storsjö authored
      Relative speedups over the C code:
                          Cortex A53    A72    A73
      pal_pred_w4_8bpc_neon:    8.75   6.15   7.60
      pal_pred_w8_8bpc_neon:   19.93  11.79  10.98
      pal_pred_w16_8bpc_neon:  24.68  13.28  16.06
      pal_pred_w32_8bpc_neon:  23.56  11.81  16.74
      pal_pred_w64_8bpc_neon:  23.16  12.19  17.60
      4f14573c
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of smooth prediction · 4318600e
      Martin Storsjö authored
      Relative speedups over the C code:
                                     Cortex A53    A72    A73
      intra_pred_smooth_h_w4_8bpc_neon:    8.02   4.53   7.09
      intra_pred_smooth_h_w8_8bpc_neon:   16.59   5.91   9.32
      intra_pred_smooth_h_w16_8bpc_neon:  18.80   5.54  10.10
      intra_pred_smooth_h_w32_8bpc_neon:   5.07   4.43   4.60
      intra_pred_smooth_h_w64_8bpc_neon:   5.03   4.26   4.34
      intra_pred_smooth_v_w4_8bpc_neon:    9.11   5.51   7.75
      intra_pred_smooth_v_w8_8bpc_neon:   17.07   6.86  10.55
      intra_pred_smooth_v_w16_8bpc_neon:  17.98   6.38  11.52
      intra_pred_smooth_v_w32_8bpc_neon:  11.69   5.66   8.09
      intra_pred_smooth_v_w64_8bpc_neon:   8.44   4.34   5.72
      intra_pred_smooth_w4_8bpc_neon:      9.81   4.85   6.93
      intra_pred_smooth_w8_8bpc_neon:     16.05   5.60   9.26
      intra_pred_smooth_w16_8bpc_neon:    14.01   5.02   8.96
      intra_pred_smooth_w32_8bpc_neon:     9.29   5.02   7.25
      intra_pred_smooth_w64_8bpc_neon:     6.53   3.94   5.26
      4318600e
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of paeth prediction · 8ab69afb
      Martin Storsjö authored
      Relative speedups over the C code:
                                  Cortex A53    A72    A73
      intra_pred_paeth_w4_8bpc_neon:    8.36   6.55   7.27
      intra_pred_paeth_w8_8bpc_neon:   15.24  11.36  11.34
      intra_pred_paeth_w16_8bpc_neon:  16.63  13.20  14.17
      intra_pred_paeth_w32_8bpc_neon:  10.83   9.21   9.87
      intra_pred_paeth_w64_8bpc_neon:   8.37   7.07   7.45
      8ab69afb
    • Henrik Gramner's avatar
      x86: Add ipred_z2 AVX2 asm · ea9fc9d9
      Henrik Gramner authored
      ea9fc9d9
    • Henrik Gramner's avatar
      Simplify ipred_z C code · afe901a6
      Henrik Gramner authored
      afe901a6
    • Henrik Gramner's avatar
      checkasm: Improve ipred_z tests · dfadb6df
      Henrik Gramner authored
      dfadb6df
    • James Almer's avatar
      x86: fix generate_grain_uv checkasm crashes on Windows x64 · a7c024ce
      James Almer authored
      The uv argument is normally in a gpr, but in checkasm it's forcefully
      loaded from stack.
      a7c024ce
  3. 09 Oct, 2019 2 commits
  4. 08 Oct, 2019 10 commits
    • Jean-Baptiste Kempf's avatar
      Move snap to package/ subfolder · 3e0f1508
      Jean-Baptiste Kempf authored
      3e0f1508
    • Martin Storsjö's avatar
      arm: mc: Port the ARM64 warp filter to arm32 · 61442bee
      Martin Storsjö authored
      Relative speedup over C code:
                        Cortex A7     A8     A9    A53    A72    A73
      warp_8x8_8bpc_neon:    2.79   5.45   4.18   3.96   4.16   4.51
      warp_8x8t_8bpc_neon:   2.79   5.33   4.18   3.98   4.22   4.25
      
      Comparison to original ARM64 assembly:
      
      ARM64:            Cortex A53     A72     A73
      warp_8x8_8bpc_neon:   1854.6  1072.5  1102.5
      warp_8x8t_8bpc_neon:  1839.6  1069.4  1089.5
      ARM32:
      warp_8x8_8bpc_neon:   2132.5  1160.3  1218.0
      warp_8x8t_8bpc_neon:  2113.7  1148.0  1209.1
      61442bee
    • Martin Storsjö's avatar
      arm64: mc: Use addp instead of addv+trn1 in warp · 5647a57e
      Martin Storsjö authored
      Before:           Cortex A53     A72     A73
      warp_8x8_8bpc_neon:   1952.8  1161.3  1151.1
      warp_8x8t_8bpc_neon:  1937.1  1147.5  1139.0
      After:
      warp_8x8_8bpc_neon:   1860.8  1068.6  1105.8
      warp_8x8t_8bpc_neon:  1846.9  1056.4  1099.8
      5647a57e
    • Martin Storsjö's avatar
      arm: cdef: Port the ARM64 CDEF NEON assembly to 32 bit arm · 3489a9c1
      Martin Storsjö authored
      The relative speedup ranges from 2.5 to 3.8x for find_dir and
      around 5 to 10x for filter.
      
      The find_dir function is a bit restricted by barely having enough
      registers, leaving very few ones for temporaries, so less things can
      be done in parallel and many instructions end up depending on the
      result of the preceding instruction.
      
      The ported functions end up slightly slower than the corresponding
      ARM64 ones, but only marginally:
      
      ARM64:                   Cortex A53     A72     A73
      cdef_dir_8bpc_neon:           400.0   268.8   282.2
      cdef_filter_4x4_8bpc_neon:    596.3   359.9   379.7
      cdef_filter_4x8_8bpc_neon:   1091.0   670.4   698.5
      cdef_filter_8x8_8bpc_neon:   1998.7  1207.2  1218.4
      ARM32:
      cdef_dir_8bpc_neon:           528.5   329.1   337.4
      cdef_filter_4x4_8bpc_neon:    632.5   482.5   432.2
      cdef_filter_4x8_8bpc_neon:   1107.2   854.8   782.3
      cdef_filter_8x8_8bpc_neon:   1984.8  1381.0  1414.4
      
      Relative speedup over C code:
                              Cortex A7     A8     A9    A53    A72    A73
      cdef_dir_8bpc_neon:          2.92   2.54   2.67   3.87   3.37   3.83
      cdef_filter_4x4_8bpc_neon:   5.09   7.61   6.10   6.85   4.94   7.41
      cdef_filter_4x8_8bpc_neon:   5.53   8.23   6.77   7.67   5.60   8.01
      cdef_filter_8x8_8bpc_neon:   6.26  10.14   8.49   8.54   6.94   4.27
      3489a9c1
    • Martin Storsjö's avatar
    • Luc Trudeau's avatar
      7bbc5e3d
    • Martin Storsjö's avatar
      arm64: cdef: Improve find_dir · dfaa2a10
      Martin Storsjö authored
      Only add .4h elements to the upper half of sum_alt, as only 11
      elements are needed, and .8h + .4h gives 12 in total.
      
      Fuse two consecutive ext #8 + ext #2 into ext #10.
      
      Move a few stores further away from where they are calculated.
      
      Before:         Cortex A53     A72     A73
      cdef_dir_8bpc_neon:  404.0   278.2   302.4
      After:
      cdef_dir_8bpc_neon:  400.0   269.3   282.5
      dfaa2a10
    • Martin Storsjö's avatar
      arm64: cdef: Calculate two initial parameters in the same vector · fa6a0924
      Martin Storsjö authored
      As there's only two individual parameters, we can insert them into
      the same vector, reducing the number of actual calculation instructions,
      but adding a few more instructions to dup the results to the final
      vectors instead.
      fa6a0924
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm64: cdef: Rewrite an expression slightly · bc26e300
      Martin Storsjö authored
      Instead of apply_sign(imin(abs(diff), clip), diff), do
      imax(imin(diff, clip), -clip).
      
      Before:                  Cortex A53     A72     A73
      cdef_filter_4x4_8bpc_neon:    592.7   374.5   384.5
      cdef_filter_4x8_8bpc_neon:   1093.0   704.4   706.6
      cdef_filter_8x8_8bpc_neon:   1962.6  1239.4  1252.1
      After:
      cdef_filter_4x4_8bpc_neon:    593.7   355.5   373.2
      cdef_filter_4x8_8bpc_neon:   1091.6   663.2   685.3
      cdef_filter_8x8_8bpc_neon:   1964.2  1182.5  1210.8
      bc26e300
  5. 07 Oct, 2019 3 commits
  6. 03 Oct, 2019 1 commit
  7. 02 Oct, 2019 4 commits
  8. 01 Oct, 2019 3 commits
    • Henrik Gramner's avatar
      Simplify README build instructions · 16e0741a
      Henrik Gramner authored
      16e0741a
    • Ronald S. Bultje's avatar
      Minor cleanup · f6a8cc0c
      Ronald S. Bultje authored
      f6a8cc0c
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of dc/h/v prediction modes · f7743da1
      Martin Storsjö authored
      Relative speedups over the C code:
                                    Cortex A53    A72    A73
      intra_pred_dc_128_w4_8bpc_neon:     2.08   1.47   2.17
      intra_pred_dc_128_w8_8bpc_neon:     3.33   2.49   4.03
      intra_pred_dc_128_w16_8bpc_neon:    3.93   3.86   3.75
      intra_pred_dc_128_w32_8bpc_neon:    3.14   3.79   2.90
      intra_pred_dc_128_w64_8bpc_neon:    3.68   1.97   2.42
      intra_pred_dc_left_w4_8bpc_neon:    2.41   1.70   2.23
      intra_pred_dc_left_w8_8bpc_neon:    3.53   2.41   3.32
      intra_pred_dc_left_w16_8bpc_neon:   3.87   3.54   3.34
      intra_pred_dc_left_w32_8bpc_neon:   4.10   3.60   2.76
      intra_pred_dc_left_w64_8bpc_neon:   3.72   2.00   2.39
      intra_pred_dc_top_w4_8bpc_neon:     2.27   1.66   2.07
      intra_pred_dc_top_w8_8bpc_neon:     3.83   2.69   3.43
      intra_pred_dc_top_w16_8bpc_neon:    3.66   3.60   3.20
      intra_pred_dc_top_w32_8bpc_neon:    3.92   3.54   2.66
      intra_pred_dc_top_w64_8bpc_neon:    3.60   1.98   2.30
      intra_pred_dc_w4_8bpc_neon:         2.29   1.42   2.16
      intra_pred_dc_w8_8bpc_neon:         3.56   2.83   3.05
      intra_pred_dc_w16_8bpc_neon:        3.46   3.37   3.15
      intra_pred_dc_w32_8bpc_neon:        3.79   3.41   2.74
      intra_pred_dc_w64_8bpc_neon:        3.52   2.01   2.41
      intra_pred_h_w4_8bpc_neon:         10.34   5.74   5.94
      intra_pred_h_w8_8bpc_neon:         12.13   6.33   6.43
      intra_pred_h_w16_8bpc_neon:        10.66   7.31   5.85
      intra_pred_h_w32_8bpc_neon:         6.28   4.18   2.88
      intra_pred_h_w64_8bpc_neon:         3.96   1.85   1.75
      intra_pred_v_w4_8bpc_neon:         11.44   6.12   7.57
      intra_pred_v_w8_8bpc_neon:         14.76   7.58   7.95
      intra_pred_v_w16_8bpc_neon:        11.34   6.28   5.88
      intra_pred_v_w32_8bpc_neon:         6.56   3.33   3.34
      intra_pred_v_w64_8bpc_neon:         4.57   1.24   1.97
      f7743da1
  9. 30 Sep, 2019 1 commit
    • Victorien Le Couviour--Tuffet's avatar
      x86: add warp_affine SSE4 and SSSE3 asm · a91a03b0
      Victorien Le Couviour--Tuffet authored
      ```---------------------------------------
      x86_64: warp_8x8_8bpc_c: 1773.4
      x86_32: warp_8x8_8bpc_c: 1740.4
      ```
      
      -------
      x86_64: warp_8x8_8bpc_ssse3: 317.5
      x86_32: warp_8x8_8bpc_ssse3: 378.4
      ----------
      x86_64: warp_8x8_8bpc_sse4: 303.7
      x86_32: warp_8x8_8bpc_sse4: 367.7
      ----------
      x86_64: warp_8x8_8bpc_avx2: 224.9
      ---------------------
      ---------------------
      x86_64: warp_8x8t_8bpc_c: 1664.6
      x86_32: warp_8x8t_8bpc_c: 1674.0
      ----------
      x86_64: warp_8x8t_8bpc_ssse3: 320.7
      x86_32: warp_8x8t_8bpc_ssse3: 379.5
      ----------
      x86_64: warp_8x8t_8bpc_sse4: 304.8
      x86_32: warp_8x8t_8bpc_sse4: 369.8
      ----------
      x86_64: warp_8x8t_8bpc_avx2: 228.5
      ------------------------------------------
      a91a03b0
  10. 29 Sep, 2019 1 commit
    • Martin Storsjö's avatar
      arm64: itx: Fix overflows in idct · 713aa34c
      Martin Storsjö authored
      Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed
      to be clipped.
      
      This fixes mismatches for some samples, see issue #299.
      
      Before:                                Cortex A53       A72       A73
      inv_txfm_add_4x4_dct_dct_1_8bpc_neon:        93.0      52.8      49.5
      inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       260.0     186.0     196.4
      inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1371.0     953.4    1028.6
      inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7363.2    4887.5    5135.8
      inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25029.0   17492.3   18404.5
      After:
      inv_txfm_add_4x4_dct_dct_1_8bpc_neon:       105.0      58.7      55.2
      inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       294.0     211.5     209.9
      inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1495.8    1050.4    1070.6
      inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7866.7    5197.8    5321.4
      inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25807.2   18619.3   18526.9
      713aa34c