Skip to content
Snippets Groups Projects

arm32: mc: 16 bpc blend, w_mask, emu_edge

Merged Martin Storsjö requested to merge mstorsjo/dav1d:arm32-mc16 into master
  1. Dec 16, 2020
    • Martin Storsjö's avatar
      arm32: mc: Add NEON implementation of emu_edge for 16 bpc · 38df0efa
      Martin Storsjö authored
      Checkasm benchmarks:    Cortex  A7       A8      A53      A72     A73
      emu_edge_w4_16bpc_neon:      375.0    312.6    268.3    159.3   170.0
      emu_edge_w8_16bpc_neon:      619.3    425.5    435.5    249.5   291.1
      emu_edge_w16_16bpc_neon:     719.1    568.3    506.9    324.2   314.4
      emu_edge_w32_16bpc_neon:    2112.2   1677.7   1396.2   1050.5  1009.6
      emu_edge_w64_16bpc_neon:    5046.8   4322.5   3693.7   3953.8  2682.8
      emu_edge_w128_16bpc_neon:  16311.1  14341.3  12877.8  26183.5  8924.9
      
      Corresponding numbers for arm64, for comparison:
                                               Cortex A53      A72      A73
      emu_edge_w4_16bpc_neon:                       302.5    174.9    159.2
      emu_edge_w8_16bpc_neon:                       344.6    292.3    273.2
      emu_edge_w16_16bpc_neon:                      601.0    461.2    316.8
      emu_edge_w32_16bpc_neon:                      974.2   1274.7    960.5
      emu_edge_w64_16bpc_neon:                     2853.1   3527.6   2633.5
      emu_edge_w128_16bpc_neon:                   14633.5  26776.6   7236.0
      38df0efa
    • Martin Storsjö's avatar
      arm32: mc: Add NEON implementations of the w_mask functions for 16 bpc · cf74bdec
      Martin Storsjö authored
      Checkasm numbers:           Cortex A7        A8       A53       A72       A73
      w_mask_420_w4_16bpc_neon:       350.3     216.4     215.4     141.7     134.5
      w_mask_420_w8_16bpc_neon:       926.7     590.9     529.1     373.8     354.5
      w_mask_420_w16_16bpc_neon:     2956.7    1880.4    1654.8    1186.1    1134.1
      w_mask_420_w32_16bpc_neon:    11489.3    7426.4    6314.1    4599.8    4398.6
      w_mask_420_w64_16bpc_neon:    28175.9   17898.1   16002.8   11079.0   10551.8
      w_mask_420_w128_16bpc_neon:   71599.4   44630.9   40696.9   28057.3   27836.5
      w_mask_422_w4_16bpc_neon:       339.0     210.1     206.7     137.3     134.7
      w_mask_422_w8_16bpc_neon:       887.2     573.3     499.6     361.6     353.5
      w_mask_422_w16_16bpc_neon:     2918.0    1841.6    1593.0    1194.0    1157.9
      w_mask_422_w32_16bpc_neon:    11313.8    7238.7    6043.4    4577.1    4469.6
      w_mask_422_w64_16bpc_neon:    27746.5   17427.2   15386.9   11082.6   10693.8
      w_mask_422_w128_16bpc_neon:   70521.4   43864.9   39209.3   29045.7   28305.5
      w_mask_444_w4_16bpc_neon:       325.6     202.9     198.4     135.2     129.3
      w_mask_444_w8_16bpc_neon:       860.7     534.9     474.8     358.0     352.2
      w_mask_444_w16_16bpc_neon:     2764.3    1714.4    1517.8    1160.6    1133.1
      w_mask_444_w32_16bpc_neon:    10719.8    6738.3    5746.7    4458.6    4347.1
      w_mask_444_w64_16bpc_neon:    26407.9   16224.1   14783.9   10784.3   10371.4
      w_mask_444_w128_16bpc_neon:   67226.1   41060.1   37823.1   41696.1   27722.2
      
      Corresponding numbers for arm64, for comparison:
                                                     Cortex A53       A72       A73
      w_mask_420_w4_16bpc_neon:                           173.6     123.6     120.3
      w_mask_420_w8_16bpc_neon:                           484.0     344.0     329.4
      w_mask_420_w16_16bpc_neon:                         1436.3    1025.7    1028.7
      w_mask_420_w32_16bpc_neon:                         5597.0    3994.8    3981.2
      w_mask_420_w64_16bpc_neon:                        13953.4    9700.8    9579.9
      w_mask_420_w128_16bpc_neon:                       35833.7   25519.3   24277.8
      w_mask_422_w4_16bpc_neon:                           159.4     111.7     114.2
      w_mask_422_w8_16bpc_neon:                           453.4     326.2     326.7
      w_mask_422_w16_16bpc_neon:                         1398.2    1063.3    1052.6
      w_mask_422_w32_16bpc_neon:                         5532.7    4143.0    4026.3
      w_mask_422_w64_16bpc_neon:                        13885.3    9978.0    9689.8
      w_mask_422_w128_16bpc_neon:                       35763.3   25822.4   24610.9
      w_mask_444_w4_16bpc_neon:                           152.9     110.0     112.8
      w_mask_444_w8_16bpc_neon:                           437.2     332.0     325.8
      w_mask_444_w16_16bpc_neon:                         1399.3    1068.9    1041.7
      w_mask_444_w32_16bpc_neon:                         5410.9    4139.7    4136.9
      w_mask_444_w64_16bpc_neon:                        13648.7   10011.8   10004.6
      w_mask_444_w128_16bpc_neon:                       35639.6   26910.8   25631.0
      cf74bdec
    • Martin Storsjö's avatar
      arm32: mc: Add NEON implementation of the blend functions for 16 bpc · f809edb4
      Martin Storsjö authored
      Checkasm numbers:      Cortex A7      A8     A53     A72     A73
      blend_h_w2_16bpc_neon:     190.0   163.0   135.5    67.4    71.2
      blend_h_w4_16bpc_neon:     204.4   119.1   140.3    61.2    74.9
      blend_h_w8_16bpc_neon:     247.6   126.2   159.5    86.1    88.4
      blend_h_w16_16bpc_neon:    391.6   186.5   230.7   134.9   149.4
      blend_h_w32_16bpc_neon:    734.9   354.2   454.1   248.1   270.9
      blend_h_w64_16bpc_neon:   1290.8   611.7   801.1   456.6   491.3
      blend_h_w128_16bpc_neon:  2876.4  1354.2  1788.6  1083.4  1092.0
      blend_v_w2_16bpc_neon:     264.4   325.2   206.8   107.6   123.0
      blend_v_w4_16bpc_neon:     471.8   358.7   356.9   187.0   229.9
      blend_v_w8_16bpc_neon:     616.9   365.3   445.4   218.2   248.5
      blend_v_w16_16bpc_neon:    928.3   517.1   629.1   325.0   358.0
      blend_v_w32_16bpc_neon:   1771.6   790.1  1106.1   631.2   584.7
      blend_w4_16bpc_neon:       128.8    66.6    95.5    33.5    42.0
      blend_w8_16bpc_neon:       238.7   118.0   156.8    76.5    84.5
      blend_w16_16bpc_neon:      809.7   360.9   482.3   268.5   298.3
      blend_w32_16bpc_neon:     2015.7   916.6  1177.0   682.1   730.9
      
      Corresponding numbers for arm64, for comparison:
                                            Cortex A53     A72     A73
      blend_h_w2_16bpc_neon:                     109.3    83.1    56.8
      blend_h_w4_16bpc_neon:                     114.1    61.1    62.3
      blend_h_w8_16bpc_neon:                     133.3    80.8    81.0
      blend_h_w16_16bpc_neon:                    215.6   132.7   149.5
      blend_h_w32_16bpc_neon:                    390.4   253.9   235.8
      blend_h_w64_16bpc_neon:                    715.8   455.8   454.0
      blend_h_w128_16bpc_neon:                  1649.7  1034.7  1066.2
      blend_v_w2_16bpc_neon:                     185.9   176.3   178.3
      blend_v_w4_16bpc_neon:                     338.3   184.4   234.3
      blend_v_w8_16bpc_neon:                     427.0   214.5   252.7
      blend_v_w16_16bpc_neon:                    680.4   358.1   389.2
      blend_v_w32_16bpc_neon:                   1100.7   615.5   690.1
      blend_w4_16bpc_neon:                        76.0    32.3    32.1
      blend_w8_16bpc_neon:                       134.4    76.3    71.5
      blend_w16_16bpc_neon:                      476.3   268.8   301.5
      blend_w32_16bpc_neon:                     1226.8   659.9   782.8
      f809edb4
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      9257a961
    • Martin Storsjö's avatar
      arm32: mc: Use a replicating vld1 to all lanes in one place · 85de1c3b
      Martin Storsjö authored
      This is one cycle faster, when the other lanes don't need to be
      preserved, on some (old) cores.
      85de1c3b
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
Loading