1. 12 Nov, 2019 4 commits
    • Martin Storsjö's avatar
      arm: 64: loopfilter: Fix a typo in a macro parameter condition · 564482b6
      Martin Storsjö authored
      This removes one redundant instruction for loop filters smaller
      than 16.
      564482b6
    • Martin Storsjö's avatar
      arm64: loopfilter: Reorder instructions and tweak register use to match the arm32 port · 3069ab94
      Martin Storsjö authored
      This doesn't change performance measurably, but eases potential
      future maintainance of the code.
      3069ab94
    • Martin Storsjö's avatar
      abd07c67
    • Martin Storsjö's avatar
      arm: 32: Port the arm64 NEON loopfilter to arm32 · 9a100261
      Martin Storsjö authored
      The code is a fairly exact 1:1 port of the ARM64 code, but operating
      on 8 pixels at a time, instead of 16.
      
      Relative speedup over C code according to checkasm:
                             Cortex A7     A8     A9    A53    A72    A73
      lpf_h_sb_uv_w4_8bpc_neon:   1.36   1.40   1.25   1.71   1.55   1.59
      lpf_h_sb_uv_w6_8bpc_neon:   2.18   2.11   1.74   2.65   2.32   2.34
      lpf_h_sb_y_w4_8bpc_neon:    1.48   1.43   1.20   1.91   1.49   1.64
      lpf_h_sb_y_w8_8bpc_neon:    2.34   2.05   1.78   2.84   2.35   2.69
      lpf_h_sb_y_w16_8bpc_neon:   2.13   1.83   1.63   2.51   2.10   2.35
      lpf_v_sb_uv_w4_8bpc_neon:   1.69   1.66   1.60   2.16   2.24   2.24
      lpf_v_sb_uv_w6_8bpc_neon:   2.68   2.43   2.22   3.53   3.44   3.35
      lpf_v_sb_y_w4_8bpc_neon:    1.74   1.74   1.43   2.34   2.14   2.18
      lpf_v_sb_y_w8_8bpc_neon:    2.92   2.47   2.19   3.55   3.22   3.54
      lpf_v_sb_y_w16_8bpc_neon:   2.62   2.19   1.98   3.25   2.80   3.10
      
      Comparison to the original ARM64 assembly:
      ARM64:                        A53     A72     A73
      lpf_h_sb_uv_w4_8bpc_neon:   702.5   518.2   529.1
      lpf_h_sb_uv_w6_8bpc_neon:  1007.3   672.6   736.6
      lpf_h_sb_y_w4_8bpc_neon:   1652.8  1261.2  1276.5
      lpf_h_sb_y_w8_8bpc_neon:   2144.7  1559.8  1638.7
      lpf_h_sb_y_w16_8bpc_neon:  2318.3  1757.2  1792.8
      lpf_v_sb_uv_w4_8bpc_neon:   447.1   302.0   292.4
      lpf_v_sb_uv_w6_8bpc_neon:   600.0   397.7   406.9
      lpf_v_sb_y_w4_8bpc_neon:   1212.6   840.1   818.4
      lpf_v_sb_y_w8_8bpc_neon:   1623.3  1167.4  1156.7
      lpf_v_sb_y_w16_8bpc_neon:  1694.9  1237.9  1182.3
      ARM32:
      lpf_h_sb_uv_w4_8bpc_neon:   821.2   501.1   500.8
      lpf_h_sb_uv_w6_8bpc_neon:  1232.0   715.7   746.6
      lpf_h_sb_y_w4_8bpc_neon:   2208.1  1373.2  1414.7
      lpf_h_sb_y_w8_8bpc_neon:   3138.3  1843.1  1915.2
      lpf_h_sb_y_w16_8bpc_neon:  3293.1  1842.5  1975.9
      lpf_v_sb_uv_w4_8bpc_neon:   619.9   326.7   324.9
      lpf_v_sb_uv_w6_8bpc_neon:   855.9   446.7   468.2
      lpf_v_sb_y_w4_8bpc_neon:   1737.6   935.5  1007.0
      lpf_v_sb_y_w8_8bpc_neon:   2346.7  1232.8  1298.3
      lpf_v_sb_y_w16_8bpc_neon:  2353.4  1283.4  1379.9
      9a100261
  2. 10 Nov, 2019 1 commit
  3. 01 Nov, 2019 1 commit
  4. 28 Oct, 2019 1 commit
  5. 25 Oct, 2019 2 commits
  6. 24 Oct, 2019 12 commits
  7. 22 Oct, 2019 2 commits
  8. 21 Oct, 2019 1 commit
    • Victorien Le Couviour--Tuffet's avatar
      x86inc: fix LOAD_MM_PERMUTATION for AVX512 · 47790541
      Victorien Le Couviour--Tuffet authored and Henrik Gramner's avatar Henrik Gramner committed
      Pre-permuting the registers in INIT_*MM avx512 (AVX512_MM_PERMUTATION)
      is redondant. It causes the register mapping to be the same as without
      the initial AVX512_MM_PERMUTATION, with the user SWAPs applied.
      
      For example...
      
      INIT_YMM avx512
      SWAP m0, m16
      SAVE_MM_PERMUTATION
      ; do whatever
      LOAD_MM_PERMUTATION
      
      ... would result in m0 mapping to ymm16 instead of ymm0 and m1 to ymm1
      instead of ymm17.
      47790541
  9. 18 Oct, 2019 1 commit
    • Victorien Le Couviour--Tuffet's avatar
      x86: adapt SSSE3 cdef_filter_{4x4,4x8,8x8} to SSE2 · 3e9f9676
      Victorien Le Couviour--Tuffet authored
      ```------------------:frontmatter
      x86_64:
      ```
      
      ---------------------------------------
      cdef_filter_4x4_8bpc_c: 1376.0
      cdef_filter_4x4_8bpc_sse2: 177.6
      cdef_filter_4x4_8bpc_ssse3: 132.5
      ---------------------
      cdef_filter_4x8_8bpc_c: 2725.0
      cdef_filter_4x8_8bpc_sse2: 327.6
      cdef_filter_4x8_8bpc_ssse3: 234.9
      ---------------------
      cdef_filter_8x8_8bpc_c: 5938.8
      cdef_filter_8x8_8bpc_sse2: 556.8
      cdef_filter_8x8_8bpc_ssse3: 388.1
      ------------------------------------------
      
      ---------------------
      x86_32:
      ------------------------------------------
      cdef_filter_4x4_8bpc_c: 1569.5
      cdef_filter_4x4_8bpc_sse2: 201.9
      cdef_filter_4x4_8bpc_ssse3: 162.3
      ---------------------
      cdef_filter_4x8_8bpc_c: 3141.6
      cdef_filter_4x8_8bpc_sse2: 368.3
      cdef_filter_4x8_8bpc_ssse3: 283.4
      ---------------------
      cdef_filter_8x8_8bpc_c: 6534.5
      cdef_filter_8x8_8bpc_sse2: 666.7
      cdef_filter_8x8_8bpc_ssse3: 503.5
      ------------------------------------------
      3e9f9676
  10. 16 Oct, 2019 1 commit
  11. 11 Oct, 2019 4 commits
  12. 10 Oct, 2019 10 commits
    • Luc Trudeau's avatar
      b7d7c8ce
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of the cfl_ac functions · 57dd0aae
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedup over the C code:
                            Cortex A53    A72    A73
      cfl_ac_420_w4_8bpc_neon:    7.73   6.48   9.22
      cfl_ac_420_w8_8bpc_neon:    6.70   5.56   6.95
      cfl_ac_420_w16_8bpc_neon:   6.51   6.93   6.67
      cfl_ac_422_w4_8bpc_neon:    9.25   7.70   9.75
      cfl_ac_422_w8_8bpc_neon:    8.53   5.95   7.13
      cfl_ac_422_w16_8bpc_neon:   7.08   6.87   6.06
      57dd0aae
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of the cfl_pred functions · c7693386
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedup over the C code:
                                   Cortex A53    A72    A73
      cfl_pred_cfl_128_w4_8bpc_neon:    10.81   7.90   9.80
      cfl_pred_cfl_128_w8_8bpc_neon:    18.38  11.15  13.24
      cfl_pred_cfl_128_w16_8bpc_neon:   16.52  10.83  16.00
      cfl_pred_cfl_128_w32_8bpc_neon:    3.27   3.60   3.70
      cfl_pred_cfl_left_w4_8bpc_neon:    9.82   7.38   8.76
      cfl_pred_cfl_left_w8_8bpc_neon:   17.22  10.63  11.97
      cfl_pred_cfl_left_w16_8bpc_neon:  16.03  10.49  15.66
      cfl_pred_cfl_left_w32_8bpc_neon:   3.28   3.61   3.72
      cfl_pred_cfl_top_w4_8bpc_neon:     9.74   7.39   9.29
      cfl_pred_cfl_top_w8_8bpc_neon:    17.48  10.89  12.58
      cfl_pred_cfl_top_w16_8bpc_neon:   16.01  10.62  15.31
      cfl_pred_cfl_top_w32_8bpc_neon:    3.25   3.62   3.75
      cfl_pred_cfl_w4_8bpc_neon:         8.39   6.34   8.04
      cfl_pred_cfl_w8_8bpc_neon:        15.99  10.12  12.42
      cfl_pred_cfl_w16_8bpc_neon:       15.25  10.40  15.12
      cfl_pred_cfl_w32_8bpc_neon:        3.23   3.58   3.71
      
      The C code gets autovectorized for w >= 32, which is why the
      relative speedup looks strange (but the performance of the NEON
      functions is completely as expected).
      c7693386
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of the filter function · d322d451
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Use a different layout of the filter_intra_taps depending on
      architecture; the current one is optimized for the x86 SIMD
      implementation.
      
      Relative speedups over the C code:
                                   Cortex A53    A72    A73
      intra_pred_filter_w4_8bpc_neon:    6.38   2.81   4.43
      intra_pred_filter_w8_8bpc_neon:    9.30   3.62   5.71
      intra_pred_filter_w16_8bpc_neon:   9.85   3.98   6.42
      intra_pred_filter_w32_8bpc_neon:  10.77   4.08   7.09
      d322d451
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of palette prediction · 4f14573c
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedups over the C code:
                          Cortex A53    A72    A73
      pal_pred_w4_8bpc_neon:    8.75   6.15   7.60
      pal_pred_w8_8bpc_neon:   19.93  11.79  10.98
      pal_pred_w16_8bpc_neon:  24.68  13.28  16.06
      pal_pred_w32_8bpc_neon:  23.56  11.81  16.74
      pal_pred_w64_8bpc_neon:  23.16  12.19  17.60
      4f14573c
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of smooth prediction · 4318600e
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedups over the C code:
                                     Cortex A53    A72    A73
      intra_pred_smooth_h_w4_8bpc_neon:    8.02   4.53   7.09
      intra_pred_smooth_h_w8_8bpc_neon:   16.59   5.91   9.32
      intra_pred_smooth_h_w16_8bpc_neon:  18.80   5.54  10.10
      intra_pred_smooth_h_w32_8bpc_neon:   5.07   4.43   4.60
      intra_pred_smooth_h_w64_8bpc_neon:   5.03   4.26   4.34
      intra_pred_smooth_v_w4_8bpc_neon:    9.11   5.51   7.75
      intra_pred_smooth_v_w8_8bpc_neon:   17.07   6.86  10.55
      intra_pred_smooth_v_w16_8bpc_neon:  17.98   6.38  11.52
      intra_pred_smooth_v_w32_8bpc_neon:  11.69   5.66   8.09
      intra_pred_smooth_v_w64_8bpc_neon:   8.44   4.34   5.72
      intra_pred_smooth_w4_8bpc_neon:      9.81   4.85   6.93
      intra_pred_smooth_w8_8bpc_neon:     16.05   5.60   9.26
      intra_pred_smooth_w16_8bpc_neon:    14.01   5.02   8.96
      intra_pred_smooth_w32_8bpc_neon:     9.29   5.02   7.25
      intra_pred_smooth_w64_8bpc_neon:     6.53   3.94   5.26
      4318600e
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of paeth prediction · 8ab69afb
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Relative speedups over the C code:
                                  Cortex A53    A72    A73
      intra_pred_paeth_w4_8bpc_neon:    8.36   6.55   7.27
      intra_pred_paeth_w8_8bpc_neon:   15.24  11.36  11.34
      intra_pred_paeth_w16_8bpc_neon:  16.63  13.20  14.17
      intra_pred_paeth_w32_8bpc_neon:  10.83   9.21   9.87
      intra_pred_paeth_w64_8bpc_neon:   8.37   7.07   7.45
      8ab69afb
    • Henrik Gramner's avatar
      x86: Add ipred_z2 AVX2 asm · ea9fc9d9
      Henrik Gramner authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      ea9fc9d9
    • Henrik Gramner's avatar
      Simplify ipred_z C code · afe901a6
      Henrik Gramner authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      afe901a6
    • Henrik Gramner's avatar
      checkasm: Improve ipred_z tests · dfadb6df
      Henrik Gramner authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      dfadb6df