Skip to content
Snippets Groups Projects
  1. Oct 11, 2019
  2. Oct 10, 2019
    • Luc Trudeau's avatar
      b7d7c8ce
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of the cfl_ac functions · 57dd0aae
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedup over the C code:
                            Cortex A53    A72    A73
      cfl_ac_420_w4_8bpc_neon:    7.73   6.48   9.22
      cfl_ac_420_w8_8bpc_neon:    6.70   5.56   6.95
      cfl_ac_420_w16_8bpc_neon:   6.51   6.93   6.67
      cfl_ac_422_w4_8bpc_neon:    9.25   7.70   9.75
      cfl_ac_422_w8_8bpc_neon:    8.53   5.95   7.13
      cfl_ac_422_w16_8bpc_neon:   7.08   6.87   6.06
      57dd0aae
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of the cfl_pred functions · c7693386
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedup over the C code:
                                   Cortex A53    A72    A73
      cfl_pred_cfl_128_w4_8bpc_neon:    10.81   7.90   9.80
      cfl_pred_cfl_128_w8_8bpc_neon:    18.38  11.15  13.24
      cfl_pred_cfl_128_w16_8bpc_neon:   16.52  10.83  16.00
      cfl_pred_cfl_128_w32_8bpc_neon:    3.27   3.60   3.70
      cfl_pred_cfl_left_w4_8bpc_neon:    9.82   7.38   8.76
      cfl_pred_cfl_left_w8_8bpc_neon:   17.22  10.63  11.97
      cfl_pred_cfl_left_w16_8bpc_neon:  16.03  10.49  15.66
      cfl_pred_cfl_left_w32_8bpc_neon:   3.28   3.61   3.72
      cfl_pred_cfl_top_w4_8bpc_neon:     9.74   7.39   9.29
      cfl_pred_cfl_top_w8_8bpc_neon:    17.48  10.89  12.58
      cfl_pred_cfl_top_w16_8bpc_neon:   16.01  10.62  15.31
      cfl_pred_cfl_top_w32_8bpc_neon:    3.25   3.62   3.75
      cfl_pred_cfl_w4_8bpc_neon:         8.39   6.34   8.04
      cfl_pred_cfl_w8_8bpc_neon:        15.99  10.12  12.42
      cfl_pred_cfl_w16_8bpc_neon:       15.25  10.40  15.12
      cfl_pred_cfl_w32_8bpc_neon:        3.23   3.58   3.71
      
      The C code gets autovectorized for w >= 32, which is why the
      relative speedup looks strange (but the performance of the NEON
      functions is completely as expected).
      c7693386
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of the filter function · d322d451
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Use a different layout of the filter_intra_taps depending on
      architecture; the current one is optimized for the x86 SIMD
      implementation.
      
      Relative speedups over the C code:
                                   Cortex A53    A72    A73
      intra_pred_filter_w4_8bpc_neon:    6.38   2.81   4.43
      intra_pred_filter_w8_8bpc_neon:    9.30   3.62   5.71
      intra_pred_filter_w16_8bpc_neon:   9.85   3.98   6.42
      intra_pred_filter_w32_8bpc_neon:  10.77   4.08   7.09
      d322d451
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of palette prediction · 4f14573c
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedups over the C code:
                          Cortex A53    A72    A73
      pal_pred_w4_8bpc_neon:    8.75   6.15   7.60
      pal_pred_w8_8bpc_neon:   19.93  11.79  10.98
      pal_pred_w16_8bpc_neon:  24.68  13.28  16.06
      pal_pred_w32_8bpc_neon:  23.56  11.81  16.74
      pal_pred_w64_8bpc_neon:  23.16  12.19  17.60
      4f14573c
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of smooth prediction · 4318600e
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedups over the C code:
                                     Cortex A53    A72    A73
      intra_pred_smooth_h_w4_8bpc_neon:    8.02   4.53   7.09
      intra_pred_smooth_h_w8_8bpc_neon:   16.59   5.91   9.32
      intra_pred_smooth_h_w16_8bpc_neon:  18.80   5.54  10.10
      intra_pred_smooth_h_w32_8bpc_neon:   5.07   4.43   4.60
      intra_pred_smooth_h_w64_8bpc_neon:   5.03   4.26   4.34
      intra_pred_smooth_v_w4_8bpc_neon:    9.11   5.51   7.75
      intra_pred_smooth_v_w8_8bpc_neon:   17.07   6.86  10.55
      intra_pred_smooth_v_w16_8bpc_neon:  17.98   6.38  11.52
      intra_pred_smooth_v_w32_8bpc_neon:  11.69   5.66   8.09
      intra_pred_smooth_v_w64_8bpc_neon:   8.44   4.34   5.72
      intra_pred_smooth_w4_8bpc_neon:      9.81   4.85   6.93
      intra_pred_smooth_w8_8bpc_neon:     16.05   5.60   9.26
      intra_pred_smooth_w16_8bpc_neon:    14.01   5.02   8.96
      intra_pred_smooth_w32_8bpc_neon:     9.29   5.02   7.25
      intra_pred_smooth_w64_8bpc_neon:     6.53   3.94   5.26
      4318600e
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of paeth prediction · 8ab69afb
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedups over the C code:
                                  Cortex A53    A72    A73
      intra_pred_paeth_w4_8bpc_neon:    8.36   6.55   7.27
      intra_pred_paeth_w8_8bpc_neon:   15.24  11.36  11.34
      intra_pred_paeth_w16_8bpc_neon:  16.63  13.20  14.17
      intra_pred_paeth_w32_8bpc_neon:  10.83   9.21   9.87
      intra_pred_paeth_w64_8bpc_neon:   8.37   7.07   7.45
      8ab69afb
    • Henrik Gramner's avatar
      x86: Add ipred_z2 AVX2 asm · ea9fc9d9
      Henrik Gramner authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      ea9fc9d9
    • Henrik Gramner's avatar
      Simplify ipred_z C code · afe901a6
      Henrik Gramner authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      afe901a6
    • Henrik Gramner's avatar
      checkasm: Improve ipred_z tests · dfadb6df
      Henrik Gramner authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      dfadb6df
    • James Almer's avatar
      x86: fix generate_grain_uv checkasm crashes on Windows x64 · a7c024ce
      James Almer authored
      The uv argument is normally in a gpr, but in checkasm it's forcefully
      loaded from stack.
      a7c024ce
  3. Oct 09, 2019
  4. Oct 08, 2019
  5. Oct 07, 2019
  6. Oct 03, 2019
  7. Oct 02, 2019
  8. Oct 01, 2019
    • Henrik Gramner's avatar
      Simplify README build instructions · 16e0741a
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      16e0741a
    • Ronald S. Bultje's avatar
      Minor cleanup · f6a8cc0c
      Ronald S. Bultje authored
      f6a8cc0c
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of dc/h/v prediction modes · f7743da1
      Martin Storsjö authored
      Relative speedups over the C code:
                                    Cortex A53    A72    A73
      intra_pred_dc_128_w4_8bpc_neon:     2.08   1.47   2.17
      intra_pred_dc_128_w8_8bpc_neon:     3.33   2.49   4.03
      intra_pred_dc_128_w16_8bpc_neon:    3.93   3.86   3.75
      intra_pred_dc_128_w32_8bpc_neon:    3.14   3.79   2.90
      intra_pred_dc_128_w64_8bpc_neon:    3.68   1.97   2.42
      intra_pred_dc_left_w4_8bpc_neon:    2.41   1.70   2.23
      intra_pred_dc_left_w8_8bpc_neon:    3.53   2.41   3.32
      intra_pred_dc_left_w16_8bpc_neon:   3.87   3.54   3.34
      intra_pred_dc_left_w32_8bpc_neon:   4.10   3.60   2.76
      intra_pred_dc_left_w64_8bpc_neon:   3.72   2.00   2.39
      intra_pred_dc_top_w4_8bpc_neon:     2.27   1.66   2.07
      intra_pred_dc_top_w8_8bpc_neon:     3.83   2.69   3.43
      intra_pred_dc_top_w16_8bpc_neon:    3.66   3.60   3.20
      intra_pred_dc_top_w32_8bpc_neon:    3.92   3.54   2.66
      intra_pred_dc_top_w64_8bpc_neon:    3.60   1.98   2.30
      intra_pred_dc_w4_8bpc_neon:         2.29   1.42   2.16
      intra_pred_dc_w8_8bpc_neon:         3.56   2.83   3.05
      intra_pred_dc_w16_8bpc_neon:        3.46   3.37   3.15
      intra_pred_dc_w32_8bpc_neon:        3.79   3.41   2.74
      intra_pred_dc_w64_8bpc_neon:        3.52   2.01   2.41
      intra_pred_h_w4_8bpc_neon:         10.34   5.74   5.94
      intra_pred_h_w8_8bpc_neon:         12.13   6.33   6.43
      intra_pred_h_w16_8bpc_neon:        10.66   7.31   5.85
      intra_pred_h_w32_8bpc_neon:         6.28   4.18   2.88
      intra_pred_h_w64_8bpc_neon:         3.96   1.85   1.75
      intra_pred_v_w4_8bpc_neon:         11.44   6.12   7.57
      intra_pred_v_w8_8bpc_neon:         14.76   7.58   7.95
      intra_pred_v_w16_8bpc_neon:        11.34   6.28   5.88
      intra_pred_v_w32_8bpc_neon:         6.56   3.33   3.34
      intra_pred_v_w64_8bpc_neon:         4.57   1.24   1.97
      f7743da1
  9. Sep 30, 2019
    • Victorien Le Couviour--Tuffet's avatar
      x86: add warp_affine SSE4 and SSSE3 asm · a91a03b0
      Victorien Le Couviour--Tuffet authored
      ------------------------------------------
      x86_64: warp_8x8_8bpc_c: 1773.4
      x86_32: warp_8x8_8bpc_c: 1740.4
      ----------
      x86_64: warp_8x8_8bpc_ssse3: 317.5
      x86_32: warp_8x8_8bpc_ssse3: 378.4
      ----------
      x86_64: warp_8x8_8bpc_sse4: 303.7
      x86_32: warp_8x8_8bpc_sse4: 367.7
      ----------
      x86_64: warp_8x8_8bpc_avx2: 224.9
      ---------------------
      ---------------------
      x86_64: warp_8x8t_8bpc_c: 1664.6
      x86_32: warp_8x8t_8bpc_c: 1674.0
      ----------
      x86_64: warp_8x8t_8bpc_ssse3: 320.7
      x86_32: warp_8x8t_8bpc_ssse3: 379.5
      ----------
      x86_64: warp_8x8t_8bpc_sse4: 304.8
      x86_32: warp_8x8t_8bpc_sse4: 369.8
      ----------
      x86_64: warp_8x8t_8bpc_avx2: 228.5
      ------------------------------------------
      a91a03b0
  10. Sep 29, 2019
    • Martin Storsjö's avatar
      arm64: itx: Fix overflows in idct · 713aa34c
      Martin Storsjö authored
      Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed
      to be clipped.
      
      This fixes mismatches for some samples, see issue #299.
      
      Before:                                Cortex A53       A72       A73
      inv_txfm_add_4x4_dct_dct_1_8bpc_neon:        93.0      52.8      49.5
      inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       260.0     186.0     196.4
      inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1371.0     953.4    1028.6
      inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7363.2    4887.5    5135.8
      inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25029.0   17492.3   18404.5
      After:
      inv_txfm_add_4x4_dct_dct_1_8bpc_neon:       105.0      58.7      55.2
      inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       294.0     211.5     209.9
      inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1495.8    1050.4    1070.6
      inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7866.7    5197.8    5321.4
      inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25807.2   18619.3   18526.9
      713aa34c
    • Martin Storsjö's avatar
      arm64: itx: Consistently use the factor 2896 in adst · 0ed3ad19
      Martin Storsjö authored
      The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
      0ed3ad19
    • Martin Storsjö's avatar
      arm64: itx: Use smull+smlal instead of addl+mul · a4950bce
      Martin Storsjö authored
      Even though smull+smlal does two multiplications instead of one,
      the combination seems to be better handled by actual cores.
      
      Before:                                 Cortex A53      A72      A73
      inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      356.0    279.2    278.0
      inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1785.0   1329.5   1308.8
      After:
      inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      360.0    253.2    269.3
      inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1793.1   1300.9   1254.0
      
      (In this particular cases, it seems like it is a minor regression
      on A53, which is probably more due to having to change the ordering
      of some instructions, due to how smull+smlal+smull2+smlal2 overwrites
      the second output register sooner than an addl+addl2 would have, but
      in general, smull+smlal seems to be equally good or better than
      addl+mul on A53 as well.)
      a4950bce
Loading