1. 08 Apr, 2019 1 commit
  2. 07 Apr, 2019 1 commit
    • Martin Storsjö's avatar
      arm: Fix typos in comments · 556780b7
      Martin Storsjö authored
      The width register has been set to clz(w)-24, not the other way
      around. And the 32 bit prep function has got the h parameter in
      r4, not in r5.
      556780b7
  3. 13 Feb, 2019 2 commits
  4. 24 Jan, 2019 6 commits
  5. 18 Nov, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: mc: Implement 8tap and bilin functions · 4aa0363a
      Martin Storsjö authored
      These functions have been tuned against Cortex A53 and Snapdragon
      835. The bilin functions have mainly been written with code size
      in mind, as they aren't used much in practice.
      
      Relative speedups for the actual filtering fuctions (that don't
      just do a plain copy) are around 4-15x, some over 20x. This is
      in comparison with GCC 5.4 with autovectorization disabled; the
      actual real-world speedup against autovectorized C code is around
      4-10x.
      
      Relative speedups measured with checkasm:
                                      Cortex A53   Snapdragon 835
      mc_8tap_regular_w2_0_8bpc_neon:       6.96   5.28
      mc_8tap_regular_w2_h_8bpc_neon:       5.16   4.35
      mc_8tap_regular_w2_hv_8bpc_neon:      5.37   4.98
      mc_8tap_regular_w2_v_8bpc_neon:       6.35   4.85
      mc_8tap_regular_w4_0_8bpc_neon:       6.78   5.73
      mc_8tap_regular_w4_h_8bpc_neon:       8.40   6.60
      mc_8tap_regular_w4_hv_8bpc_neon:      7.23   7.10
      mc_8tap_regular_w4_v_8bpc_neon:       9.06   7.76
      mc_8tap_regular_w8_0_8bpc_neon:       6.96   5.55
      mc_8tap_regular_w8_h_8bpc_neon:      10.36   6.88
      mc_8tap_regular_w8_hv_8bpc_neon:      9.49   6.86
      mc_8tap_regular_w8_v_8bpc_neon:      12.06   9.61
      mc_8tap_regular_w16_0_8bpc_neon:      6.68   4.51
      mc_8tap_regular_w16_h_8bpc_neon:     12.30   7.77
      mc_8tap_regular_w16_hv_8bpc_neon:     9.50   6.68
      mc_8tap_regular_w16_v_8bpc_neon:     12.93   9.68
      mc_8tap_regular_w32_0_8bpc_neon:      3.91   2.93
      mc_8tap_regular_w32_h_8bpc_neon:     13.06   7.89
      mc_8tap_regular_w32_hv_8bpc_neon:     9.37   6.70
      mc_8tap_regular_w32_v_8bpc_neon:     12.88   9.49
      mc_8tap_regular_w64_0_8bpc_neon:      2.89   1.68
      mc_8tap_regular_w64_h_8bpc_neon:     13.48   8.00
      mc_8tap_regular_w64_hv_8bpc_neon:     9.23   6.53
      mc_8tap_regular_w64_v_8bpc_neon:     13.11   9.68
      mc_8tap_regular_w128_0_8bpc_neon:     1.89   1.24
      mc_8tap_regular_w128_h_8bpc_neon:    13.58   7.98
      mc_8tap_regular_w128_hv_8bpc_neon:    8.86   6.53
      mc_8tap_regular_w128_v_8bpc_neon:    12.46   9.63
      mc_bilinear_w2_0_8bpc_neon:           7.02   5.40
      mc_bilinear_w2_h_8bpc_neon:           3.65   3.14
      mc_bilinear_w2_hv_8bpc_neon:          4.36   4.84
      mc_bilinear_w2_v_8bpc_neon:           5.22   4.28
      mc_bilinear_w4_0_8bpc_neon:           6.87   5.99
      mc_bilinear_w4_h_8bpc_neon:           6.50   8.61
      mc_bilinear_w4_hv_8bpc_neon:          7.70   7.99
      mc_bilinear_w4_v_8bpc_neon:           7.04   9.10
      mc_bilinear_w8_0_8bpc_neon:           7.03   5.70
      mc_bilinear_w8_h_8bpc_neon:          11.30  15.14
      mc_bilinear_w8_hv_8bpc_neon:         15.74  13.50
      mc_bilinear_w8_v_8bpc_neon:          13.40  17.54
      mc_bilinear_w16_0_8bpc_neon:          6.75   4.48
      mc_bilinear_w16_h_8bpc_neon:         17.02  13.95
      mc_bilinear_w16_hv_8bpc_neon:        17.37  13.78
      mc_bilinear_w16_v_8bpc_neon:         23.69  22.98
      mc_bilinear_w32_0_8bpc_neon:          3.88   3.18
      mc_bilinear_w32_h_8bpc_neon:         18.80  14.97
      mc_bilinear_w32_hv_8bpc_neon:        17.74  14.02
      mc_bilinear_w32_v_8bpc_neon:         24.46  23.04
      mc_bilinear_w64_0_8bpc_neon:          2.87   1.66
      mc_bilinear_w64_h_8bpc_neon:         19.54  16.02
      mc_bilinear_w64_hv_8bpc_neon:        17.80  14.32
      mc_bilinear_w64_v_8bpc_neon:         24.79  23.63
      mc_bilinear_w128_0_8bpc_neon:         2.13   1.23
      mc_bilinear_w128_h_8bpc_neon:        19.89  16.24
      mc_bilinear_w128_hv_8bpc_neon:       17.55  14.15
      mc_bilinear_w128_v_8bpc_neon:        24.45  23.54
      mct_8tap_regular_w4_0_8bpc_neon:      5.56   5.51
      mct_8tap_regular_w4_h_8bpc_neon:      7.48   5.80
      mct_8tap_regular_w4_hv_8bpc_neon:     7.27   7.09
      mct_8tap_regular_w4_v_8bpc_neon:      7.80   6.84
      mct_8tap_regular_w8_0_8bpc_neon:      9.54   9.25
      mct_8tap_regular_w8_h_8bpc_neon:      9.08   6.55
      mct_8tap_regular_w8_hv_8bpc_neon:     9.16   6.30
      mct_8tap_regular_w8_v_8bpc_neon:     10.79   8.66
      mct_8tap_regular_w16_0_8bpc_neon:    15.35  10.50
      mct_8tap_regular_w16_h_8bpc_neon:    10.18   6.76
      mct_8tap_regular_w16_hv_8bpc_neon:    9.17   6.11
      mct_8tap_regular_w16_v_8bpc_neon:    11.52   8.72
      mct_8tap_regular_w32_0_8bpc_neon:    15.82  10.09
      mct_8tap_regular_w32_h_8bpc_neon:    10.75   6.85
      mct_8tap_regular_w32_hv_8bpc_neon:    9.00   6.22
      mct_8tap_regular_w32_v_8bpc_neon:    11.58   8.67
      mct_8tap_regular_w64_0_8bpc_neon:    15.28   9.68
      mct_8tap_regular_w64_h_8bpc_neon:    10.93   6.96
      mct_8tap_regular_w64_hv_8bpc_neon:    8.81   6.53
      mct_8tap_regular_w64_v_8bpc_neon:    11.42   8.73
      mct_8tap_regular_w128_0_8bpc_neon:   14.41   7.67
      mct_8tap_regular_w128_h_8bpc_neon:   10.92   6.96
      mct_8tap_regular_w128_hv_8bpc_neon:   8.56   6.51
      mct_8tap_regular_w128_v_8bpc_neon:   11.16   8.70
      mct_bilinear_w4_0_8bpc_neon:          5.66   5.77
      mct_bilinear_w4_h_8bpc_neon:          5.16   6.40
      mct_bilinear_w4_hv_8bpc_neon:         6.86   6.82
      mct_bilinear_w4_v_8bpc_neon:          4.75   6.09
      mct_bilinear_w8_0_8bpc_neon:          9.78  10.00
      mct_bilinear_w8_h_8bpc_neon:          8.98  11.37
      mct_bilinear_w8_hv_8bpc_neon:        14.42  10.83
      mct_bilinear_w8_v_8bpc_neon:          9.12  11.62
      mct_bilinear_w16_0_8bpc_neon:        15.59  10.76
      mct_bilinear_w16_h_8bpc_neon:        11.98   8.77
      mct_bilinear_w16_hv_8bpc_neon:       15.83  10.73
      mct_bilinear_w16_v_8bpc_neon:        14.70  14.60
      mct_bilinear_w32_0_8bpc_neon:        15.89  10.32
      mct_bilinear_w32_h_8bpc_neon:        13.47   9.07
      mct_bilinear_w32_hv_8bpc_neon:       16.01  10.95
      mct_bilinear_w32_v_8bpc_neon:        14.85  14.16
      mct_bilinear_w64_0_8bpc_neon:        15.36  10.51
      mct_bilinear_w64_h_8bpc_neon:        14.00   9.61
      mct_bilinear_w64_hv_8bpc_neon:       15.82  11.27
      mct_bilinear_w64_v_8bpc_neon:        14.61  14.76
      mct_bilinear_w128_0_8bpc_neon:       14.41   7.92
      mct_bilinear_w128_h_8bpc_neon:       13.31   9.58
      mct_bilinear_w128_hv_8bpc_neon:      14.07  11.18
      mct_bilinear_w128_v_8bpc_neon:       11.57  14.42
      4aa0363a
  6. 25 Oct, 2018 2 commits
  7. 23 Oct, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: mc: Make the jump tables local symbols · 0bb53898
      Martin Storsjö authored
      For MachO, this makes sure that the label difference actually is
      evaluated at assembly time (as it already was for ELF and COFF);
      evaluating it at link time failed when the difference is stored in
      a .hword.
      
      This fixes linking errors like these:
      ld: in section __TEXT,__text reloc 0: ARM64_RELOC_SUBTRACTOR must have r_length of 2 or 3 file 'src/src@@dav1d_bitdepth_8@sta/arm_64_mc.S.o' for architecture arm64
      
      This adds an asm.S macro for decorating a symbol for making a
      local symbol. For armasm64 with gas-preprocessor, this doesn't
      actually create a local label (but neither do the local numbered
      labels either currently), which might be slightly inconsistent
      in it would be necessary to make the distinction for that assembler
      as well.
      
      Alternatively, the table symbol could be made into a plain local
      numbered label as all the other labels.
      0bb53898
  8. 21 Oct, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: Don't use uxth for extending a register · 91e0b478
      Martin Storsjö authored
      armasm64 fails to assemble this:
      error A2173: syntax error in expression
              sub             x7,  x7,  w4, uxth
      
      This clearly is a bug in armasm64, and will be reported. For now,
      this workaround should be harmless though, as we've just loaded
      the register with ldrh, so the upper parts of the register should
      be zeroed.
      91e0b478
  9. 20 Oct, 2018 1 commit
    • Janne Grunau's avatar
      arm64/mc: add 8-bit neon asm for avg, w_avg and mask · 80e47425
      Janne Grunau authored
      checkasm --bench on a Qualcomm Kryo (Sanpdragon 820):
      nop: 33.0
      avg_w4_8bpc_c: 450.5
      avg_w4_8bpc_neon: 20.1
      avg_w8_8bpc_c: 438.6
      avg_w8_8bpc_neon: 45.2
      avg_w16_8bpc_c: 1003.7
      avg_w16_8bpc_neon: 112.8
      avg_w32_8bpc_c: 3249.6
      avg_w32_8bpc_neon: 429.9
      avg_w64_8bpc_c: 7213.3
      avg_w64_8bpc_neon: 1299.4
      avg_w128_8bpc_c: 16791.3
      avg_w128_8bpc_neon: 2978.4
      w_avg_w4_8bpc_c: 605.7
      w_avg_w4_8bpc_neon: 30.9
      w_avg_w8_8bpc_c: 545.8
      w_avg_w8_8bpc_neon: 72.9
      w_avg_w16_8bpc_c: 1430.1
      w_avg_w16_8bpc_neon: 193.5
      w_avg_w32_8bpc_c: 4876.3
      w_avg_w32_8bpc_neon: 715.3
      w_avg_w64_8bpc_c: 11338.0
      w_avg_w64_8bpc_neon: 2147.0
      w_avg_w128_8bpc_c: 26822.0
      w_avg_w128_8bpc_neon: 4596.3
      mask_w4_8bpc_c: 604.6
      mask_w4_8bpc_neon: 37.2
      mask_w8_8bpc_c: 654.8
      mask_w8_8bpc_neon: 96.0
      mask_w16_8bpc_c: 1663.0
      mask_w16_8bpc_neon: 272.4
      mask_w32_8bpc_c: 5707.6
      mask_w32_8bpc_neon: 1028.9
      mask_w64_8bpc_c: 12735.3
      mask_w64_8bpc_neon: 2533.2
      mask_w128_8bpc_c: 31027.6
      mask_w128_8bpc_neon: 6247.2
      80e47425