Skip to content
Snippets Groups Projects
  1. Dec 02, 2019
  2. Nov 30, 2019
  3. Nov 27, 2019
    • Henrik Gramner's avatar
      Avoid excessive L2 collisions with certain frame widths · 82eda83a
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Memory addresses with certain power-of-two offsets will map to the
      same set of cache lines. Using such offsets as strides will cause
      excessive cache evictions resulting in more cache misses.
      
      Avoid this by adding a small padding when the stride is a multiple
      of 1024 (somewhat arbitrarily chosen as the specific number depends
      on the hardware implementation) when allocating picture buffers.
      82eda83a
  4. Nov 26, 2019
  5. Nov 24, 2019
  6. Nov 23, 2019
  7. Nov 22, 2019
  8. Nov 21, 2019
  9. Nov 17, 2019
  10. Nov 16, 2019
  11. Nov 15, 2019
  12. Nov 12, 2019
    • Martin Storsjö's avatar
      arm: 64: loopfilter: Avoid nested ifdefs where easily possible · dcbbf775
      Martin Storsjö authored
      This was requested in the review of the arm32 version of the same.
      dcbbf775
    • Martin Storsjö's avatar
      arm: 64: loopfilter: Fix a typo in a macro parameter condition · 564482b6
      Martin Storsjö authored
      This removes one redundant instruction for loop filters smaller
      than 16.
      564482b6
    • Martin Storsjö's avatar
      arm64: loopfilter: Reorder instructions and tweak register use to match the arm32 port · 3069ab94
      Martin Storsjö authored
      This doesn't change performance measurably, but eases potential
      future maintainance of the code.
      3069ab94
    • Martin Storsjö's avatar
      abd07c67
    • Martin Storsjö's avatar
      arm: 32: Port the arm64 NEON loopfilter to arm32 · 9a100261
      Martin Storsjö authored
      The code is a fairly exact 1:1 port of the ARM64 code, but operating
      on 8 pixels at a time, instead of 16.
      
      Relative speedup over C code according to checkasm:
                             Cortex A7     A8     A9    A53    A72    A73
      lpf_h_sb_uv_w4_8bpc_neon:   1.36   1.40   1.25   1.71   1.55   1.59
      lpf_h_sb_uv_w6_8bpc_neon:   2.18   2.11   1.74   2.65   2.32   2.34
      lpf_h_sb_y_w4_8bpc_neon:    1.48   1.43   1.20   1.91   1.49   1.64
      lpf_h_sb_y_w8_8bpc_neon:    2.34   2.05   1.78   2.84   2.35   2.69
      lpf_h_sb_y_w16_8bpc_neon:   2.13   1.83   1.63   2.51   2.10   2.35
      lpf_v_sb_uv_w4_8bpc_neon:   1.69   1.66   1.60   2.16   2.24   2.24
      lpf_v_sb_uv_w6_8bpc_neon:   2.68   2.43   2.22   3.53   3.44   3.35
      lpf_v_sb_y_w4_8bpc_neon:    1.74   1.74   1.43   2.34   2.14   2.18
      lpf_v_sb_y_w8_8bpc_neon:    2.92   2.47   2.19   3.55   3.22   3.54
      lpf_v_sb_y_w16_8bpc_neon:   2.62   2.19   1.98   3.25   2.80   3.10
      
      Comparison to the original ARM64 assembly:
      ARM64:                        A53     A72     A73
      lpf_h_sb_uv_w4_8bpc_neon:   702.5   518.2   529.1
      lpf_h_sb_uv_w6_8bpc_neon:  1007.3   672.6   736.6
      lpf_h_sb_y_w4_8bpc_neon:   1652.8  1261.2  1276.5
      lpf_h_sb_y_w8_8bpc_neon:   2144.7  1559.8  1638.7
      lpf_h_sb_y_w16_8bpc_neon:  2318.3  1757.2  1792.8
      lpf_v_sb_uv_w4_8bpc_neon:   447.1   302.0   292.4
      lpf_v_sb_uv_w6_8bpc_neon:   600.0   397.7   406.9
      lpf_v_sb_y_w4_8bpc_neon:   1212.6   840.1   818.4
      lpf_v_sb_y_w8_8bpc_neon:   1623.3  1167.4  1156.7
      lpf_v_sb_y_w16_8bpc_neon:  1694.9  1237.9  1182.3
      ARM32:
      lpf_h_sb_uv_w4_8bpc_neon:   821.2   501.1   500.8
      lpf_h_sb_uv_w6_8bpc_neon:  1232.0   715.7   746.6
      lpf_h_sb_y_w4_8bpc_neon:   2208.1  1373.2  1414.7
      lpf_h_sb_y_w8_8bpc_neon:   3138.3  1843.1  1915.2
      lpf_h_sb_y_w16_8bpc_neon:  3293.1  1842.5  1975.9
      lpf_v_sb_uv_w4_8bpc_neon:   619.9   326.7   324.9
      lpf_v_sb_uv_w6_8bpc_neon:   855.9   446.7   468.2
      lpf_v_sb_y_w4_8bpc_neon:   1737.6   935.5  1007.0
      lpf_v_sb_y_w8_8bpc_neon:   2346.7  1232.8  1298.3
      lpf_v_sb_y_w16_8bpc_neon:  2353.4  1283.4  1379.9
      9a100261
  13. Nov 10, 2019
  14. Nov 01, 2019
  15. Oct 28, 2019
  16. Oct 25, 2019
  17. Oct 24, 2019
    • Henrik Gramner's avatar
      x86: Fix overflows in inverse identity SSSE3 transforms · 103cd220
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      103cd220
    • Henrik Gramner's avatar
      x86: Fix overflows in inverse identity AVX2 transforms · a20b5757
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      a20b5757
    • Victorien Le Couviour--Tuffet's avatar
      x86: adapt SSSE3 wiener filter to SSE2 · 36d615d1
      Victorien Le Couviour--Tuffet authored
      Also slightly optimized the 32-bit SSSE3, especially by the removal of
      an XMM store/load.
      
      ---------------------
      x86_64:
      ------------------------------------------
      wiener_chroma_8bpc_c: 193155.1
      wiener_chroma_8bpc_sse2: 48973.4
      wiener_chroma_8bpc_ssse3: 31486.3
      ---------------------
      wiener_luma_8bpc_c: 192787.5
      wiener_luma_8bpc_sse2: 48674.9
      wiener_luma_8bpc_ssse3: 30446.3
      ------------------------------------------
      
      ---------------------
      x86_32:
      ------------------------------------------
      wiener_chroma_8bpc_c: 309861.0
      wiener_chroma_8bpc_sse2: 52345.9
      wiener_chroma_8bpc_ssse3: 32983.2
      ---------------------
      wiener_luma_8bpc_c: 317909.1
      wiener_luma_8bpc_sse2: 52522.1
      wiener_luma_8bpc_ssse3: 33323.1
      ------------------------------------------
      36d615d1
    • Victorien Le Couviour--Tuffet's avatar
      x86: adapt SSSE3 warp_affine_8x8{,t} to SSE2 · 4866abab
      Victorien Le Couviour--Tuffet authored
      ---------------------
      x86_64:
      ------------------------------------------
      warp_8x8_8bpc_c: 1761.5
      warp_8x8_8bpc_sse2: 583.0
      warp_8x8_8bpc_ssse3: 329.3
      ---------------------
      warp_8x8t_8bpc_c: 1694.3
      warp_8x8t_8bpc_sse2: 577.6
      warp_8x8t_8bpc_ssse3: 334.1
      ------------------------------------------
      
      ---------------------
      x86_32:
      ------------------------------------------
      warp_8x8_8bpc_c: 1842.6
      warp_8x8_8bpc_sse2: 677.1
      warp_8x8_8bpc_ssse3: 394.9
      ---------------------
      warp_8x8t_8bpc_c: 1741.1
      warp_8x8t_8bpc_sse2: 648.5
      warp_8x8t_8bpc_ssse3: 372.6
      ------------------------------------------
      4866abab
    • Martin Storsjö's avatar
      arm: looprestoration: Fix register names in a comment · 0526e1ea
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      0526e1ea
Loading