1. 09 Feb, 2021 1 commit
  2. 05 Feb, 2021 1 commit
    • Kyle Siefring's avatar
      arm64: warped motion: Various optimizations · a3b8157e
      Kyle Siefring authored
      - Reorder loads of filters to benifit in order cores.
      - Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the
         first stage which will hurt performance on some older big cores.
      - Rework horz stage for 8 bit mode:
          * Use smull instead of mul
          * Replace existing narrow and long instructions
          * Replace mov after calling with right shift
      
      Before:            Cortex A55    A53     A72     A73
      warp_8x8_8bpc_neon:    1683.2  1860.6  1065.0  1102.6
      warp_8x8t_8bpc_neon:   1673.2  1846.4  1057.0  1098.4
      warp_8x8_16bpc_neon:   1870.7  2031.7  1147.3  1220.7
      warp_8x8t_16bpc_neon:  1848.0  2006.2  1121.6  1188.0
      After:
      warp_8x8_8bpc_neon:    1267.2  1446.2   807.0   871.5
      warp_8x8t_8bpc_neon:   1245.4  1422.0   810.2   868.4
      warp_8x8_16bpc_neon:   1769.8  1929.3  1132.0  1238.2
      warp_8x8t_16bpc_neon:  1747.3  1904.1  1101.5  1207.9
      
      Cortex-A55
      Before:
      warp_8x8_8bpc_neon:   1683.2
      warp_8x8t_8bpc_neon:  1673.2
      warp_8x8_16bpc_neon:  1870.7
      warp_8x8t_16bpc_neon: 1848.0
      After:
      warp_8x8_8bpc_neon:   1267.2
      warp_8x8t_8bpc_neon:  1245.4
      warp_8x8_16bpc_neon:  1769.8
      warp_8x8t_16bpc_neon: 1747.3
      a3b8157e
  3. 03 Sep, 2020 1 commit
  4. 04 Apr, 2020 1 commit
    • Martin Storsjö's avatar
      arm64: mc: NEON implementation of emu_edge for 8bpc · ea54dbe2
      Martin Storsjö authored
      Relative speedups over C code:
                           Cortex A53    A72    A73
      emu_edge_w4_8bpc_neon:     3.82   2.93   2.41
      emu_edge_w8_8bpc_neon:     3.28   2.86   2.51
      emu_edge_w16_8bpc_neon:    3.58   3.27   2.63
      emu_edge_w32_8bpc_neon:    3.04   1.68   2.12
      emu_edge_w64_8bpc_neon:    2.58   1.45   1.48
      emu_edge_w128_8bpc_neon:   1.79   1.02   1.57
      
      The benchmark numbers for the larger size on A72 fluctuate a
      whole lot and thus seem very unreliable.
      ea54dbe2
  5. 04 Mar, 2020 4 commits
    • Martin Storsjö's avatar
      arm: mc: Optimize blend_v · 52e9b435
      Martin Storsjö authored
      Use a post-increment with a register on the last increment, avoiding
      a separate increment. Avoid processing the last 8 pixels in the w32
      case when we only output 24 pixels.
      
      Before:
      ARM32                Cortex A7      A8      A9     A53     A72     A73
      blend_v_w4_8bpc_neon:    450.4   574.7   538.7   374.6   199.3   260.5
      blend_v_w8_8bpc_neon:    559.6   351.3   552.5   357.6   214.8   204.3
      blend_v_w16_8bpc_neon:   926.3   511.6   787.9   593.0   271.0   246.8
      blend_v_w32_8bpc_neon:  1482.5   917.0  1149.5   991.9   354.0   368.9
      ARM64
      blend_v_w4_8bpc_neon:                            351.1   200.0   224.1
      blend_v_w8_8bpc_neon:                            333.0   212.4   203.8
      blend_v_w16_8bpc_neon:                           495.2   302.0   247.0
      blend_v_w32_8bpc_neon:                           840.0   557.8   514.0
      
      After:
      ARM32
      blend_v_w4_8bpc_neon:    435.5   575.8   537.6   356.2   198.3   259.5
      blend_v_w8_8bpc_neon:    545.2   347.9   553.5   339.1   207.8   204.2
      blend_v_w16_8bpc_neon:   913.7   511.0   788.1   573.7   275.4   243.3
      blend_v_w32_8bpc_neon:  1445.3   951.2  1079.1   920.4   352.2   361.6
      ARM64
      blend_v_w4_8bpc_neon:                            333.0   191.3   225.9
      blend_v_w8_8bpc_neon:                            314.9   199.3   203.5
      blend_v_w16_8bpc_neon:                           476.9   301.3   241.1
      blend_v_w32_8bpc_neon:                           766.9   432.8   416.9
      52e9b435
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm64: mc: Fix indentation · 48ffb05e
      Martin Storsjö authored
      48ffb05e
    • Martin Storsjö's avatar
      arm64: mc: Use more intuitive lane specifications for loads/stores · 83c62716
      Martin Storsjö authored
      For loads where we load/store a full or half register (instead of
      a lanewise load/store), the lane specification in itself doesn't
      matter, only its size.
      
      This doesn't change the generated code, but makes it more readable.
      83c62716
  6. 10 Feb, 2020 3 commits
    • Martin Storsjö's avatar
      arm64: mc: Reduce the width of a register copy · d4c5ad49
      Martin Storsjö authored
      Only copy as much as really is needed/used.
      d4c5ad49
    • Martin Storsjö's avatar
      arm64: mc: Use two regs for alternating output rows for w4/8 in avg/w_avg/mask · b1167ce1
      Martin Storsjö authored
      It was already done this way for w32/64. Not doing it for w16 as it
      didn't help there (and instead gave a small slowdown due to the two
      setup instructions).
      
      This gives a small speedup on in-order cores like A53.
      
      Before:         Cortex A53     A72     A73
      avg_w4_8bpc_neon:     60.9    25.6    29.0
      avg_w8_8bpc_neon:    143.6    52.8    64.0
      After:
      avg_w4_8bpc_neon:     56.7    26.7    28.5
      avg_w8_8bpc_neon:    137.2    54.5    64.4
      b1167ce1
    • Martin Storsjö's avatar
      arm64: mc: Simplify avg/w_avg/mask by always using the w16 macro · 0bad117e
      Martin Storsjö authored
      This shortens the source by 40 lines, and gives a significant
      speedup on A53, a small speedup on A72 and a very minor slowdown
      for avg/w_avg on A73.
      
      Before:           Cortex A53     A72     A73
      avg_w4_8bpc_neon:       67.4    26.1    25.4
      avg_w8_8bpc_neon:      158.7    56.3    59.1
      avg_w16_8bpc_neon:     382.9   154.1   160.7
      w_avg_w4_8bpc_neon:     99.9    43.6    39.4
      w_avg_w8_8bpc_neon:    253.2    98.3    99.0
      w_avg_w16_8bpc_neon:   543.1   285.0   301.8
      mask_w4_8bpc_neon:     110.6    51.4    45.1
      mask_w8_8bpc_neon:     295.0   129.9   114.0
      mask_w16_8bpc_neon:    654.6   365.8   369.7
      After:
      avg_w4_8bpc_neon:       60.8    26.3    29.0
      avg_w8_8bpc_neon:      142.8    52.9    64.1
      avg_w16_8bpc_neon:     378.2   153.4   160.8
      w_avg_w4_8bpc_neon:     78.7    41.0    40.9
      w_avg_w8_8bpc_neon:    190.6    90.1   105.1
      w_avg_w16_8bpc_neon:   531.1   279.3   301.4
      mask_w4_8bpc_neon:      86.6    47.2    44.9
      mask_w8_8bpc_neon:     222.0   114.3   114.9
      mask_w16_8bpc_neon:    639.5   356.0   369.8
      0bad117e
  7. 08 Oct, 2019 1 commit
  8. 07 Oct, 2019 1 commit
  9. 02 Oct, 2019 1 commit
  10. 15 Aug, 2019 1 commit
    • B Krishnan Iyer's avatar
      arm64: mc: NEON implementation of w_mask_444/422/420 function · 3d94fb9a
      B Krishnan Iyer authored
      	                        A73	        A53
      
      w_mask_420_w4_8bpc_c:	        818	        1082.9
      w_mask_420_w4_8bpc_neon:	79	        126.6
      w_mask_420_w8_8bpc_c:	        2486	        3399.8
      w_mask_420_w8_8bpc_neon:	200.2	        343.7
      w_mask_420_w16_8bpc_c:	        8022.3	        10989.6
      w_mask_420_w16_8bpc_neon:	528.1   	889
      w_mask_420_w32_8bpc_c:	        31851.8	        42808.6
      w_mask_420_w32_8bpc_neon:	2062.5	        3380.8
      w_mask_420_w64_8bpc_c:	        79268.5	        102683.9
      w_mask_420_w64_8bpc_neon:	5252.9	        8575.4
      w_mask_420_w128_8bpc_c:	        193704.1	255586.5
      w_mask_420_w128_8bpc_neon:	14602.3	        22167.7
      
      w_mask_422_w4_8bpc_c:	        777.3	        1038.5
      w_mask_422_w4_8bpc_neon:	72.1	        112.9
      w_mask_422_w8_8bpc_c:	        2405.7	        3168
      w_mask_422_w8_8bpc_neon:	191.9	        314.1
      w_mask_422_w16_8bpc_c:	        7783.7	        10543.9
      w_mask_422_w16_8bpc_neon:	559.8	        835.5
      w_mask_422_w32_8bpc_c:	        30895.7	        41141.2
      w_mask_422_w32_8bpc_neon:	2089.7	        3187.2
      w_mask_422_w64_8bpc_c:	        75500.2	        98766.3
      w_mask_422_w64_8bpc_neon:	5379	        8208.2
      w_mask_422_w128_8bpc_c:	        186967.1	245809.1
      w_mask_422_w128_8bpc_neon:	15159.9	        21474.5
      
      w_mask_444_w4_8bpc_c:	        850.1	        1136.6
      w_mask_444_w4_8bpc_neon:	66.5	        104.7
      w_mask_444_w8_8bpc_c:	        2373.5	        3262.9
      w_mask_444_w8_8bpc_neon:	180.5	        290.2
      w_mask_444_w16_8bpc_c:	        7291.6	        10590.7
      w_mask_444_w16_8bpc_neon:	550.9	        809.7
      w_mask_444_w32_8bpc_c:	        8048.3	        10140.8
      w_mask_444_w32_8bpc_neon:	2136.2	        3095
      w_mask_444_w64_8bpc_c:	        18055.3	        23060
      w_mask_444_w64_8bpc_neon:	5522.5	        8124.8
      w_mask_444_w128_8bpc_c:	        42754.3	        56072
      w_mask_444_w128_8bpc_neon:	15569.5	        21531.5
      3d94fb9a
  11. 14 Aug, 2019 1 commit
    • B Krishnan Iyer's avatar
      arm64: mc: NEON implementation of blend, blend_h and blend_v function · 1dc2dc7d
      B Krishnan Iyer authored
                         	A73	A53
      blend_h_w2_8bpc_c:	184.7	301.5
      blend_h_w2_8bpc_neon:	58.8	104.1
      blend_h_w4_8bpc_c:	291.4	507.3
      blend_h_w4_8bpc_neon:	48.7	108.9
      blend_h_w8_8bpc_c:	510.1	992.7
      blend_h_w8_8bpc_neon:	66.5	99.3
      blend_h_w16_8bpc_c:	972	1835.3
      blend_h_w16_8bpc_neon:	82.7	145.2
      blend_h_w32_8bpc_c:	776.7	912.9
      blend_h_w32_8bpc_neon:	155.1	266.9
      blend_h_w64_8bpc_c:	1424.3	1635.4
      blend_h_w64_8bpc_neon:	273.4	480.9
      blend_h_w128_8bpc_c:	3318.1	3774
      blend_h_w128_8bpc_neon:	614.1	1097.9
      blend_v_w2_8bpc_c:	278.8	427.5
      blend_v_w2_8bpc_neon:	113.7	170.4
      blend_v_w4_8bpc_c:	960.2	1597.7
      blend_v_w4_8bpc_neon:	222.9	351.4
      blend_v_w8_8bpc_c:	1694.2	3333.5
      blend_v_w8_8bpc_neon:	200.9	333.6
      blend_v_w16_8bpc_c:	3115.2	5971.6
      blend_v_w16_8bpc_neon:	233.2	494.8
      blend_v_w32_8bpc_c:	3949.7	6070.6
      blend_v_w32_8bpc_neon:	460.4	841.6
      blend_w4_8bpc_c:	244.2	388.3
      blend_w4_8bpc_neon:	25.5	66.7
      blend_w8_8bpc_c:	616.3	1120.8
      blend_w8_8bpc_neon:	46	110.7
      blend_w16_8bpc_c:	2193.1	4056.4
      blend_w16_8bpc_neon:	140.7	299.3
      blend_w32_8bpc_c:	2502.8	2998.5
      blend_w32_8bpc_neon:	381.4	725.3
      1dc2dc7d
  12. 19 May, 2019 1 commit
    • Martin Storsjö's avatar
      arm: mc: Fix 8tap_v w8 with OBMC 3/4 heights · bf920fba
      Martin Storsjö authored
      Also make sure that the w4 case can exit after processing 12 pixels,
      where it is convenient.
      
      This gives a small slowdown for in-order cores like A7, A8, A53, but
      acutally seems to give a small speedup for out-of-order cores like
      A9, A72 and A73.
      
      AArch64:
      Before:                      Cortex A53     A72     A73
      mc_8tap_regular_w8_v_8bpc_neon:   223.8   247.3   228.5
      After:
      mc_8tap_regular_w8_v_8bpc_neon:   232.5   243.9   223.4
      
      AArch32:
      Before:                       Cortex A7      A8      A9     A53     A72     A73
      mc_8tap_regular_w8_v_8bpc_neon:   550.2   470.7   520.5   257.0   256.4   248.2
      After:
      mc_8tap_regular_w8_v_8bpc_neon:   554.3   474.2   511.6   267.5   252.6   246.8
      bf920fba
  13. 09 May, 2019 1 commit
  14. 08 May, 2019 1 commit
  15. 08 Apr, 2019 1 commit
  16. 07 Apr, 2019 1 commit
    • Martin Storsjö's avatar
      arm: Fix typos in comments · 556780b7
      Martin Storsjö authored
      The width register has been set to clz(w)-24, not the other way
      around. And the 32 bit prep function has got the h parameter in
      r4, not in r5.
      556780b7
  17. 13 Feb, 2019 2 commits
  18. 24 Jan, 2019 6 commits
  19. 18 Nov, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: mc: Implement 8tap and bilin functions · 4aa0363a
      Martin Storsjö authored
      These functions have been tuned against Cortex A53 and Snapdragon
      835. The bilin functions have mainly been written with code size
      in mind, as they aren't used much in practice.
      
      Relative speedups for the actual filtering fuctions (that don't
      just do a plain copy) are around 4-15x, some over 20x. This is
      in comparison with GCC 5.4 with autovectorization disabled; the
      actual real-world speedup against autovectorized C code is around
      4-10x.
      
      Relative speedups measured with checkasm:
                                      Cortex A53   Snapdragon 835
      mc_8tap_regular_w2_0_8bpc_neon:       6.96   5.28
      mc_8tap_regular_w2_h_8bpc_neon:       5.16   4.35
      mc_8tap_regular_w2_hv_8bpc_neon:      5.37   4.98
      mc_8tap_regular_w2_v_8bpc_neon:       6.35   4.85
      mc_8tap_regular_w4_0_8bpc_neon:       6.78   5.73
      mc_8tap_regular_w4_h_8bpc_neon:       8.40   6.60
      mc_8tap_regular_w4_hv_8bpc_neon:      7.23   7.10
      mc_8tap_regular_w4_v_8bpc_neon:       9.06   7.76
      mc_8tap_regular_w8_0_8bpc_neon:       6.96   5.55
      mc_8tap_regular_w8_h_8bpc_neon:      10.36   6.88
      mc_8tap_regular_w8_hv_8bpc_neon:      9.49   6.86
      mc_8tap_regular_w8_v_8bpc_neon:      12.06   9.61
      mc_8tap_regular_w16_0_8bpc_neon:      6.68   4.51
      mc_8tap_regular_w16_h_8bpc_neon:     12.30   7.77
      mc_8tap_regular_w16_hv_8bpc_neon:     9.50   6.68
      mc_8tap_regular_w16_v_8bpc_neon:     12.93   9.68
      mc_8tap_regular_w32_0_8bpc_neon:      3.91   2.93
      mc_8tap_regular_w32_h_8bpc_neon:     13.06   7.89
      mc_8tap_regular_w32_hv_8bpc_neon:     9.37   6.70
      mc_8tap_regular_w32_v_8bpc_neon:     12.88   9.49
      mc_8tap_regular_w64_0_8bpc_neon:      2.89   1.68
      mc_8tap_regular_w64_h_8bpc_neon:     13.48   8.00
      mc_8tap_regular_w64_hv_8bpc_neon:     9.23   6.53
      mc_8tap_regular_w64_v_8bpc_neon:     13.11   9.68
      mc_8tap_regular_w128_0_8bpc_neon:     1.89   1.24
      mc_8tap_regular_w128_h_8bpc_neon:    13.58   7.98
      mc_8tap_regular_w128_hv_8bpc_neon:    8.86   6.53
      mc_8tap_regular_w128_v_8bpc_neon:    12.46   9.63
      mc_bilinear_w2_0_8bpc_neon:           7.02   5.40
      mc_bilinear_w2_h_8bpc_neon:           3.65   3.14
      mc_bilinear_w2_hv_8bpc_neon:          4.36   4.84
      mc_bilinear_w2_v_8bpc_neon:           5.22   4.28
      mc_bilinear_w4_0_8bpc_neon:           6.87   5.99
      mc_bilinear_w4_h_8bpc_neon:           6.50   8.61
      mc_bilinear_w4_hv_8bpc_neon:          7.70   7.99
      mc_bilinear_w4_v_8bpc_neon:           7.04   9.10
      mc_bilinear_w8_0_8bpc_neon:           7.03   5.70
      mc_bilinear_w8_h_8bpc_neon:          11.30  15.14
      mc_bilinear_w8_hv_8bpc_neon:         15.74  13.50
      mc_bilinear_w8_v_8bpc_neon:          13.40  17.54
      mc_bilinear_w16_0_8bpc_neon:          6.75   4.48
      mc_bilinear_w16_h_8bpc_neon:         17.02  13.95
      mc_bilinear_w16_hv_8bpc_neon:        17.37  13.78
      mc_bilinear_w16_v_8bpc_neon:         23.69  22.98
      mc_bilinear_w32_0_8bpc_neon:          3.88   3.18
      mc_bilinear_w32_h_8bpc_neon:         18.80  14.97
      mc_bilinear_w32_hv_8bpc_neon:        17.74  14.02
      mc_bilinear_w32_v_8bpc_neon:         24.46  23.04
      mc_bilinear_w64_0_8bpc_neon:          2.87   1.66
      mc_bilinear_w64_h_8bpc_neon:         19.54  16.02
      mc_bilinear_w64_hv_8bpc_neon:        17.80  14.32
      mc_bilinear_w64_v_8bpc_neon:         24.79  23.63
      mc_bilinear_w128_0_8bpc_neon:         2.13   1.23
      mc_bilinear_w128_h_8bpc_neon:        19.89  16.24
      mc_bilinear_w128_hv_8bpc_neon:       17.55  14.15
      mc_bilinear_w128_v_8bpc_neon:        24.45  23.54
      mct_8tap_regular_w4_0_8bpc_neon:      5.56   5.51
      mct_8tap_regular_w4_h_8bpc_neon:      7.48   5.80
      mct_8tap_regular_w4_hv_8bpc_neon:     7.27   7.09
      mct_8tap_regular_w4_v_8bpc_neon:      7.80   6.84
      mct_8tap_regular_w8_0_8bpc_neon:      9.54   9.25
      mct_8tap_regular_w8_h_8bpc_neon:      9.08   6.55
      mct_8tap_regular_w8_hv_8bpc_neon:     9.16   6.30
      mct_8tap_regular_w8_v_8bpc_neon:     10.79   8.66
      mct_8tap_regular_w16_0_8bpc_neon:    15.35  10.50
      mct_8tap_regular_w16_h_8bpc_neon:    10.18   6.76
      mct_8tap_regular_w16_hv_8bpc_neon:    9.17   6.11
      mct_8tap_regular_w16_v_8bpc_neon:    11.52   8.72
      mct_8tap_regular_w32_0_8bpc_neon:    15.82  10.09
      mct_8tap_regular_w32_h_8bpc_neon:    10.75   6.85
      mct_8tap_regular_w32_hv_8bpc_neon:    9.00   6.22
      mct_8tap_regular_w32_v_8bpc_neon:    11.58   8.67
      mct_8tap_regular_w64_0_8bpc_neon:    15.28   9.68
      mct_8tap_regular_w64_h_8bpc_neon:    10.93   6.96
      mct_8tap_regular_w64_hv_8bpc_neon:    8.81   6.53
      mct_8tap_regular_w64_v_8bpc_neon:    11.42   8.73
      mct_8tap_regular_w128_0_8bpc_neon:   14.41   7.67
      mct_8tap_regular_w128_h_8bpc_neon:   10.92   6.96
      mct_8tap_regular_w128_hv_8bpc_neon:   8.56   6.51
      mct_8tap_regular_w128_v_8bpc_neon:   11.16   8.70
      mct_bilinear_w4_0_8bpc_neon:          5.66   5.77
      mct_bilinear_w4_h_8bpc_neon:          5.16   6.40
      mct_bilinear_w4_hv_8bpc_neon:         6.86   6.82
      mct_bilinear_w4_v_8bpc_neon:          4.75   6.09
      mct_bilinear_w8_0_8bpc_neon:          9.78  10.00
      mct_bilinear_w8_h_8bpc_neon:          8.98  11.37
      mct_bilinear_w8_hv_8bpc_neon:        14.42  10.83
      mct_bilinear_w8_v_8bpc_neon:          9.12  11.62
      mct_bilinear_w16_0_8bpc_neon:        15.59  10.76
      mct_bilinear_w16_h_8bpc_neon:        11.98   8.77
      mct_bilinear_w16_hv_8bpc_neon:       15.83  10.73
      mct_bilinear_w16_v_8bpc_neon:        14.70  14.60
      mct_bilinear_w32_0_8bpc_neon:        15.89  10.32
      mct_bilinear_w32_h_8bpc_neon:        13.47   9.07
      mct_bilinear_w32_hv_8bpc_neon:       16.01  10.95
      mct_bilinear_w32_v_8bpc_neon:        14.85  14.16
      mct_bilinear_w64_0_8bpc_neon:        15.36  10.51
      mct_bilinear_w64_h_8bpc_neon:        14.00   9.61
      mct_bilinear_w64_hv_8bpc_neon:       15.82  11.27
      mct_bilinear_w64_v_8bpc_neon:        14.61  14.76
      mct_bilinear_w128_0_8bpc_neon:       14.41   7.92
      mct_bilinear_w128_h_8bpc_neon:       13.31   9.58
      mct_bilinear_w128_hv_8bpc_neon:      14.07  11.18
      mct_bilinear_w128_v_8bpc_neon:       11.57  14.42
      4aa0363a
  20. 25 Oct, 2018 2 commits
  21. 23 Oct, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: mc: Make the jump tables local symbols · 0bb53898
      Martin Storsjö authored
      For MachO, this makes sure that the label difference actually is
      evaluated at assembly time (as it already was for ELF and COFF);
      evaluating it at link time failed when the difference is stored in
      a .hword.
      
      This fixes linking errors like these:
      ld: in section __TEXT,__text reloc 0: ARM64_RELOC_SUBTRACTOR must have r_length of 2 or 3 file 'src/src@@dav1d_bitdepth_8@sta/arm_64_mc.S.o' for architecture arm64
      
      This adds an asm.S macro for decorating a symbol for making a
      local symbol. For armasm64 with gas-preprocessor, this doesn't
      actually create a local label (but neither do the local numbered
      labels either currently), which might be slightly inconsistent
      in it would be necessary to make the distinction for that assembler
      as well.
      
      Alternatively, the table symbol could be made into a plain local
      numbered label as all the other labels.
      0bb53898
  22. 21 Oct, 2018 1 commit
    • Martin Storsjö's avatar
      arm64: Don't use uxth for extending a register · 91e0b478
      Martin Storsjö authored
      armasm64 fails to assemble this:
      error A2173: syntax error in expression
              sub             x7,  x7,  w4, uxth
      
      This clearly is a bug in armasm64, and will be reported. For now,
      this workaround should be harmless though, as we've just loaded
      the register with ldrh, so the upper parts of the register should
      be zeroed.
      91e0b478
  23. 20 Oct, 2018 1 commit
    • Janne Grunau's avatar
      arm64/mc: add 8-bit neon asm for avg, w_avg and mask · 80e47425
      Janne Grunau authored
      checkasm --bench on a Qualcomm Kryo (Sanpdragon 820):
      nop: 33.0
      avg_w4_8bpc_c: 450.5
      avg_w4_8bpc_neon: 20.1
      avg_w8_8bpc_c: 438.6
      avg_w8_8bpc_neon: 45.2
      avg_w16_8bpc_c: 1003.7
      avg_w16_8bpc_neon: 112.8
      avg_w32_8bpc_c: 3249.6
      avg_w32_8bpc_neon: 429.9
      avg_w64_8bpc_c: 7213.3
      avg_w64_8bpc_neon: 1299.4
      avg_w128_8bpc_c: 16791.3
      avg_w128_8bpc_neon: 2978.4
      w_avg_w4_8bpc_c: 605.7
      w_avg_w4_8bpc_neon: 30.9
      w_avg_w8_8bpc_c: 545.8
      w_avg_w8_8bpc_neon: 72.9
      w_avg_w16_8bpc_c: 1430.1
      w_avg_w16_8bpc_neon: 193.5
      w_avg_w32_8bpc_c: 4876.3
      w_avg_w32_8bpc_neon: 715.3
      w_avg_w64_8bpc_c: 11338.0
      w_avg_w64_8bpc_neon: 2147.0
      w_avg_w128_8bpc_c: 26822.0
      w_avg_w128_8bpc_neon: 4596.3
      mask_w4_8bpc_c: 604.6
      mask_w4_8bpc_neon: 37.2
      mask_w8_8bpc_c: 654.8
      mask_w8_8bpc_neon: 96.0
      mask_w16_8bpc_c: 1663.0
      mask_w16_8bpc_neon: 272.4
      mask_w32_8bpc_c: 5707.6
      mask_w32_8bpc_neon: 1028.9
      mask_w64_8bpc_c: 12735.3
      mask_w64_8bpc_neon: 2533.2
      mask_w128_8bpc_c: 31027.6
      mask_w128_8bpc_neon: 6247.2
      80e47425