1. 07 Oct, 2019 1 commit
  2. 02 Oct, 2019 1 commit
  3. 15 Aug, 2019 1 commit
    • B Krishnan Iyer's avatar
      arm64: mc: NEON implementation of w_mask_444/422/420 function · 3d94fb9a
      B Krishnan Iyer authored
      	                        A73	        A53
      
      w_mask_420_w4_8bpc_c:	        818	        1082.9
      w_mask_420_w4_8bpc_neon:	79	        126.6
      w_mask_420_w8_8bpc_c:	        2486	        3399.8
      w_mask_420_w8_8bpc_neon:	200.2	        343.7
      w_mask_420_w16_8bpc_c:	        8022.3	        10989.6
      w_mask_420_w16_8bpc_neon:	528.1   	889
      w_mask_420_w32_8bpc_c:	        31851.8	        42808.6
      w_mask_420_w32_8bpc_neon:	2062.5	        3380.8
      w_mask_420_w64_8bpc_c:	        79268.5	        102683.9
      w_mask_420_w64_8bpc_neon:	5252.9	        8575.4
      w_mask_420_w128_8bpc_c:	        193704.1	255586.5
      w_mask_420_w128_8bpc_neon:	14602.3	        22167.7
      
      w_mask_422_w4_8bpc_c:	        777.3	        1038.5
      w_mask_422_w4_8bpc_neon:	72.1	        112.9
      w_mask_422_w8_8bpc_c:	        2405.7	        3168
      w_mask_422_w8_8bpc_neon:	191.9	        314.1
      w_mask_422_w16_8bpc_c:	        7783.7	        10543.9
      w_mask_422_w16_8bpc_neon:	559.8	        835.5
      w_mask_422_w32_8bpc_c:	        30895.7	        41141.2
      w_mask_422_w32_8bpc_neon:	2089.7	        3187.2
      w_mask_422_w64_8bpc_c:	        75500.2	        98766.3
      w_mask_422_w64_8bpc_neon:	5379	        8208.2
      w_mask_422_w128_8bpc_c:	        186967.1	245809.1
      w_mask_422_w128_8bpc_neon:	15159.9	        21474.5
      
      w_mask_444_w4_8bpc_c:	        850.1	        1136.6
      w_mask_444_w4_8bpc_neon:	66.5	        104.7
      w_mask_444_w8_8bpc_c:	        2373.5	        3262.9
      w_mask_444_w8_8bpc_neon:	180.5	        290.2
      w_mask_444_w16_8bpc_c:	        7291.6	        10590.7
      w_mask_444_w16_8bpc_neon:	550.9	        809.7
      w_mask_444_w32_8bpc_c:	        8048.3	        10140.8
      w_mask_444_w32_8bpc_neon:	2136.2	        3095
      w_mask_444_w64_8bpc_c:	        18055.3	        23060
      w_mask_444_w64_8bpc_neon:	5522.5	        8124.8
      w_mask_444_w128_8bpc_c:	        42754.3	        56072
      w_mask_444_w128_8bpc_neon:	15569.5	        21531.5
      3d94fb9a
  4. 14 Aug, 2019 1 commit
    • B Krishnan Iyer's avatar
      arm64: mc: NEON implementation of blend, blend_h and blend_v function · 1dc2dc7d
      B Krishnan Iyer authored
                         	A73	A53
      blend_h_w2_8bpc_c:	184.7	301.5
      blend_h_w2_8bpc_neon:	58.8	104.1
      blend_h_w4_8bpc_c:	291.4	507.3
      blend_h_w4_8bpc_neon:	48.7	108.9
      blend_h_w8_8bpc_c:	510.1	992.7
      blend_h_w8_8bpc_neon:	66.5	99.3
      blend_h_w16_8bpc_c:	972	1835.3
      blend_h_w16_8bpc_neon:	82.7	145.2
      blend_h_w32_8bpc_c:	776.7	912.9
      blend_h_w32_8bpc_neon:	155.1	266.9
      blend_h_w64_8bpc_c:	1424.3	1635.4
      blend_h_w64_8bpc_neon:	273.4	480.9
      blend_h_w128_8bpc_c:	3318.1	3774
      blend_h_w128_8bpc_neon:	614.1	1097.9
      blend_v_w2_8bpc_c:	278.8	427.5
      blend_v_w2_8bpc_neon:	113.7	170.4
      blend_v_w4_8bpc_c:	960.2	1597.7
      blend_v_w4_8bpc_neon:	222.9	351.4
      blend_v_w8_8bpc_c:	1694.2	3333.5
      blend_v_w8_8bpc_neon:	200.9	333.6
      blend_v_w16_8bpc_c:	3115.2	5971.6
      blend_v_w16_8bpc_neon:	233.2	494.8
      blend_v_w32_8bpc_c:	3949.7	6070.6
      blend_v_w32_8bpc_neon:	460.4	841.6
      blend_w4_8bpc_c:	244.2	388.3
      blend_w4_8bpc_neon:	25.5	66.7
      blend_w8_8bpc_c:	616.3	1120.8
      blend_w8_8bpc_neon:	46	110.7
      blend_w16_8bpc_c:	2193.1	4056.4
      blend_w16_8bpc_neon:	140.7	299.3
      blend_w32_8bpc_c:	2502.8	2998.5
      blend_w32_8bpc_neon:	381.4	725.3
      1dc2dc7d
  5. 19 May, 2019 1 commit
    • Martin Storsjö's avatar
      arm: mc: Fix 8tap_v w8 with OBMC 3/4 heights · bf920fba
      Martin Storsjö authored
      Also make sure that the w4 case can exit after processing 12 pixels,
      where it is convenient.
      
      This gives a small slowdown for in-order cores like A7, A8, A53, but
      acutally seems to give a small speedup for out-of-order cores like
      A9, A72 and A73.
      
      AArch64:
      Before:                      Cortex A53     A72     A73
      mc_8tap_regular_w8_v_8bpc_neon:   223.8   247.3   228.5
      After:
      mc_8tap_regular_w8_v_8bpc_neon:   232.5   243.9   223.4
      
      AArch32:
      Before:                       Cortex A7      A8      A9     A53     A72     A73
      mc_8tap_regular_w8_v_8bpc_neon:   550.2   470.7   520.5   257.0   256.4   248.2
      After:
      mc_8tap_regular_w8_v_8bpc_neon:   554.3   474.2   511.6   267.5   252.6   246.8
      bf920fba
  6. 09 May, 2019 1 commit
  7. 08 May, 2019 1 commit
  8. 08 Apr, 2019 1 commit
  9. 07 Apr, 2019 1 commit
    • Martin Storsjö's avatar
      arm: Fix typos in comments · 556780b7
      Martin Storsjö authored
      The width register has been set to clz(w)-24, not the other way
      around. And the 32 bit prep function has got the h parameter in
      r4, not in r5.
      556780b7
  10. 13 Feb, 2019 2 commits
  11. 24 Jan, 2019 6 commits
  12. 18 Nov, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: mc: Implement 8tap and bilin functions · 4aa0363a
      Martin Storsjö authored
      These functions have been tuned against Cortex A53 and Snapdragon
      835. The bilin functions have mainly been written with code size
      in mind, as they aren't used much in practice.
      
      Relative speedups for the actual filtering fuctions (that don't
      just do a plain copy) are around 4-15x, some over 20x. This is
      in comparison with GCC 5.4 with autovectorization disabled; the
      actual real-world speedup against autovectorized C code is around
      4-10x.
      
      Relative speedups measured with checkasm:
                                      Cortex A53   Snapdragon 835
      mc_8tap_regular_w2_0_8bpc_neon:       6.96   5.28
      mc_8tap_regular_w2_h_8bpc_neon:       5.16   4.35
      mc_8tap_regular_w2_hv_8bpc_neon:      5.37   4.98
      mc_8tap_regular_w2_v_8bpc_neon:       6.35   4.85
      mc_8tap_regular_w4_0_8bpc_neon:       6.78   5.73
      mc_8tap_regular_w4_h_8bpc_neon:       8.40   6.60
      mc_8tap_regular_w4_hv_8bpc_neon:      7.23   7.10
      mc_8tap_regular_w4_v_8bpc_neon:       9.06   7.76
      mc_8tap_regular_w8_0_8bpc_neon:       6.96   5.55
      mc_8tap_regular_w8_h_8bpc_neon:      10.36   6.88
      mc_8tap_regular_w8_hv_8bpc_neon:      9.49   6.86
      mc_8tap_regular_w8_v_8bpc_neon:      12.06   9.61
      mc_8tap_regular_w16_0_8bpc_neon:      6.68   4.51
      mc_8tap_regular_w16_h_8bpc_neon:     12.30   7.77
      mc_8tap_regular_w16_hv_8bpc_neon:     9.50   6.68
      mc_8tap_regular_w16_v_8bpc_neon:     12.93   9.68
      mc_8tap_regular_w32_0_8bpc_neon:      3.91   2.93
      mc_8tap_regular_w32_h_8bpc_neon:     13.06   7.89
      mc_8tap_regular_w32_hv_8bpc_neon:     9.37   6.70
      mc_8tap_regular_w32_v_8bpc_neon:     12.88   9.49
      mc_8tap_regular_w64_0_8bpc_neon:      2.89   1.68
      mc_8tap_regular_w64_h_8bpc_neon:     13.48   8.00
      mc_8tap_regular_w64_hv_8bpc_neon:     9.23   6.53
      mc_8tap_regular_w64_v_8bpc_neon:     13.11   9.68
      mc_8tap_regular_w128_0_8bpc_neon:     1.89   1.24
      mc_8tap_regular_w128_h_8bpc_neon:    13.58   7.98
      mc_8tap_regular_w128_hv_8bpc_neon:    8.86   6.53
      mc_8tap_regular_w128_v_8bpc_neon:    12.46   9.63
      mc_bilinear_w2_0_8bpc_neon:           7.02   5.40
      mc_bilinear_w2_h_8bpc_neon:           3.65   3.14
      mc_bilinear_w2_hv_8bpc_neon:          4.36   4.84
      mc_bilinear_w2_v_8bpc_neon:           5.22   4.28
      mc_bilinear_w4_0_8bpc_neon:           6.87   5.99
      mc_bilinear_w4_h_8bpc_neon:           6.50   8.61
      mc_bilinear_w4_hv_8bpc_neon:          7.70   7.99
      mc_bilinear_w4_v_8bpc_neon:           7.04   9.10
      mc_bilinear_w8_0_8bpc_neon:           7.03   5.70
      mc_bilinear_w8_h_8bpc_neon:          11.30  15.14
      mc_bilinear_w8_hv_8bpc_neon:         15.74  13.50
      mc_bilinear_w8_v_8bpc_neon:          13.40  17.54
      mc_bilinear_w16_0_8bpc_neon:          6.75   4.48
      mc_bilinear_w16_h_8bpc_neon:         17.02  13.95
      mc_bilinear_w16_hv_8bpc_neon:        17.37  13.78
      mc_bilinear_w16_v_8bpc_neon:         23.69  22.98
      mc_bilinear_w32_0_8bpc_neon:          3.88   3.18
      mc_bilinear_w32_h_8bpc_neon:         18.80  14.97
      mc_bilinear_w32_hv_8bpc_neon:        17.74  14.02
      mc_bilinear_w32_v_8bpc_neon:         24.46  23.04
      mc_bilinear_w64_0_8bpc_neon:          2.87   1.66
      mc_bilinear_w64_h_8bpc_neon:         19.54  16.02
      mc_bilinear_w64_hv_8bpc_neon:        17.80  14.32
      mc_bilinear_w64_v_8bpc_neon:         24.79  23.63
      mc_bilinear_w128_0_8bpc_neon:         2.13   1.23
      mc_bilinear_w128_h_8bpc_neon:        19.89  16.24
      mc_bilinear_w128_hv_8bpc_neon:       17.55  14.15
      mc_bilinear_w128_v_8bpc_neon:        24.45  23.54
      mct_8tap_regular_w4_0_8bpc_neon:      5.56   5.51
      mct_8tap_regular_w4_h_8bpc_neon:      7.48   5.80
      mct_8tap_regular_w4_hv_8bpc_neon:     7.27   7.09
      mct_8tap_regular_w4_v_8bpc_neon:      7.80   6.84
      mct_8tap_regular_w8_0_8bpc_neon:      9.54   9.25
      mct_8tap_regular_w8_h_8bpc_neon:      9.08   6.55
      mct_8tap_regular_w8_hv_8bpc_neon:     9.16   6.30
      mct_8tap_regular_w8_v_8bpc_neon:     10.79   8.66
      mct_8tap_regular_w16_0_8bpc_neon:    15.35  10.50
      mct_8tap_regular_w16_h_8bpc_neon:    10.18   6.76
      mct_8tap_regular_w16_hv_8bpc_neon:    9.17   6.11
      mct_8tap_regular_w16_v_8bpc_neon:    11.52   8.72
      mct_8tap_regular_w32_0_8bpc_neon:    15.82  10.09
      mct_8tap_regular_w32_h_8bpc_neon:    10.75   6.85
      mct_8tap_regular_w32_hv_8bpc_neon:    9.00   6.22
      mct_8tap_regular_w32_v_8bpc_neon:    11.58   8.67
      mct_8tap_regular_w64_0_8bpc_neon:    15.28   9.68
      mct_8tap_regular_w64_h_8bpc_neon:    10.93   6.96
      mct_8tap_regular_w64_hv_8bpc_neon:    8.81   6.53
      mct_8tap_regular_w64_v_8bpc_neon:    11.42   8.73
      mct_8tap_regular_w128_0_8bpc_neon:   14.41   7.67
      mct_8tap_regular_w128_h_8bpc_neon:   10.92   6.96
      mct_8tap_regular_w128_hv_8bpc_neon:   8.56   6.51
      mct_8tap_regular_w128_v_8bpc_neon:   11.16   8.70
      mct_bilinear_w4_0_8bpc_neon:          5.66   5.77
      mct_bilinear_w4_h_8bpc_neon:          5.16   6.40
      mct_bilinear_w4_hv_8bpc_neon:         6.86   6.82
      mct_bilinear_w4_v_8bpc_neon:          4.75   6.09
      mct_bilinear_w8_0_8bpc_neon:          9.78  10.00
      mct_bilinear_w8_h_8bpc_neon:          8.98  11.37
      mct_bilinear_w8_hv_8bpc_neon:        14.42  10.83
      mct_bilinear_w8_v_8bpc_neon:          9.12  11.62
      mct_bilinear_w16_0_8bpc_neon:        15.59  10.76
      mct_bilinear_w16_h_8bpc_neon:        11.98   8.77
      mct_bilinear_w16_hv_8bpc_neon:       15.83  10.73
      mct_bilinear_w16_v_8bpc_neon:        14.70  14.60
      mct_bilinear_w32_0_8bpc_neon:        15.89  10.32
      mct_bilinear_w32_h_8bpc_neon:        13.47   9.07
      mct_bilinear_w32_hv_8bpc_neon:       16.01  10.95
      mct_bilinear_w32_v_8bpc_neon:        14.85  14.16
      mct_bilinear_w64_0_8bpc_neon:        15.36  10.51
      mct_bilinear_w64_h_8bpc_neon:        14.00   9.61
      mct_bilinear_w64_hv_8bpc_neon:       15.82  11.27
      mct_bilinear_w64_v_8bpc_neon:        14.61  14.76
      mct_bilinear_w128_0_8bpc_neon:       14.41   7.92
      mct_bilinear_w128_h_8bpc_neon:       13.31   9.58
      mct_bilinear_w128_hv_8bpc_neon:      14.07  11.18
      mct_bilinear_w128_v_8bpc_neon:       11.57  14.42
      4aa0363a
  13. 25 Oct, 2018 2 commits
  14. 23 Oct, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: mc: Make the jump tables local symbols · 0bb53898
      Martin Storsjö authored
      For MachO, this makes sure that the label difference actually is
      evaluated at assembly time (as it already was for ELF and COFF);
      evaluating it at link time failed when the difference is stored in
      a .hword.
      
      This fixes linking errors like these:
      ld: in section __TEXT,__text reloc 0: ARM64_RELOC_SUBTRACTOR must have r_length of 2 or 3 file 'src/src@@dav1d_bitdepth_8@sta/arm_64_mc.S.o' for architecture arm64
      
      This adds an asm.S macro for decorating a symbol for making a
      local symbol. For armasm64 with gas-preprocessor, this doesn't
      actually create a local label (but neither do the local numbered
      labels either currently), which might be slightly inconsistent
      in it would be necessary to make the distinction for that assembler
      as well.
      
      Alternatively, the table symbol could be made into a plain local
      numbered label as all the other labels.
      0bb53898
  15. 21 Oct, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: Don't use uxth for extending a register · 91e0b478
      Martin Storsjö authored
      armasm64 fails to assemble this:
      error A2173: syntax error in expression
              sub             x7,  x7,  w4, uxth
      
      This clearly is a bug in armasm64, and will be reported. For now,
      this workaround should be harmless though, as we've just loaded
      the register with ldrh, so the upper parts of the register should
      be zeroed.
      91e0b478
  16. 20 Oct, 2018 1 commit
    • Janne Grunau's avatar
      arm64/mc: add 8-bit neon asm for avg, w_avg and mask · 80e47425
      Janne Grunau authored
      checkasm --bench on a Qualcomm Kryo (Sanpdragon 820):
      nop: 33.0
      avg_w4_8bpc_c: 450.5
      avg_w4_8bpc_neon: 20.1
      avg_w8_8bpc_c: 438.6
      avg_w8_8bpc_neon: 45.2
      avg_w16_8bpc_c: 1003.7
      avg_w16_8bpc_neon: 112.8
      avg_w32_8bpc_c: 3249.6
      avg_w32_8bpc_neon: 429.9
      avg_w64_8bpc_c: 7213.3
      avg_w64_8bpc_neon: 1299.4
      avg_w128_8bpc_c: 16791.3
      avg_w128_8bpc_neon: 2978.4
      w_avg_w4_8bpc_c: 605.7
      w_avg_w4_8bpc_neon: 30.9
      w_avg_w8_8bpc_c: 545.8
      w_avg_w8_8bpc_neon: 72.9
      w_avg_w16_8bpc_c: 1430.1
      w_avg_w16_8bpc_neon: 193.5
      w_avg_w32_8bpc_c: 4876.3
      w_avg_w32_8bpc_neon: 715.3
      w_avg_w64_8bpc_c: 11338.0
      w_avg_w64_8bpc_neon: 2147.0
      w_avg_w128_8bpc_c: 26822.0
      w_avg_w128_8bpc_neon: 4596.3
      mask_w4_8bpc_c: 604.6
      mask_w4_8bpc_neon: 37.2
      mask_w8_8bpc_c: 654.8
      mask_w8_8bpc_neon: 96.0
      mask_w16_8bpc_c: 1663.0
      mask_w16_8bpc_neon: 272.4
      mask_w32_8bpc_c: 5707.6
      mask_w32_8bpc_neon: 1028.9
      mask_w64_8bpc_c: 12735.3
      mask_w64_8bpc_neon: 2533.2
      mask_w128_8bpc_c: 31027.6
      mask_w128_8bpc_neon: 6247.2
      80e47425