1. 09 Mar, 2019 1 commit
    • Martin Storsjö's avatar
      arm64: looprestoration: Use individual ldrb for loading from the table · e7db58c9
      Martin Storsjö authored
      Before:                 Cortex A53     A72      A73
      selfguided_3x3_8bpc_neon:   3260.6  2175.4   2284.6
      selfguided_5x5_8bpc_neon:   2553.2  1694.4   1809.2
      selfguided_mix_8bpc_neon:   5720.0  3776.8   4000.5
      After:
      selfguided_3x3_8bpc_neon:   3514.1  2388.5   2335.9
      selfguided_5x5_8bpc_neon:   2692.2  1789.5   1835.1
      selfguided_mix_8bpc_neon:   6091.1  4089.1   4083.2
      e7db58c9
  2. 07 Mar, 2019 1 commit
    • Martin Storsjö's avatar
      arm64: looprestoration: Add a NEON implementation of SGR · 313717da
      Martin Storsjö authored
      Relative speedup vs (autovectorized) C code:
                            Cortex A53    A72    A73
      selfguided_3x3_8bpc_neon:   2.91   2.12   2.68
      selfguided_5x5_8bpc_neon:   3.18   2.65   3.39
      selfguided_mix_8bpc_neon:   3.04   2.29   2.98
      
      The relative speedup vs non-vectorized C code is around 2.6-4.6x.
      313717da
  3. 06 Mar, 2019 1 commit
  4. 04 Feb, 2019 1 commit
  5. 31 Jan, 2019 4 commits
  6. 25 Nov, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: looprestoration: NEON optimized wiener filter · 513dfa99
      Martin Storsjö authored
      The relative speedup compared to C code is around 4.2 for a Cortex A53
      and 5.1 for a Snapdragon 835 (compared to GCC's autovectorized code),
      6-7x compared to GCC's output without autovectorization, and ~8x
      compared to clang's output (which doesn't seem to try to vectorize
      this function).
      513dfa99