1. 01 Jan, 2021 1 commit
  2. 16 Dec, 2020 11 commits
    • Henrik Gramner's avatar
    • Martin Storsjö's avatar
      arm32: mc: Add NEON implementation of emu_edge for 16 bpc · 38df0efa
      Martin Storsjö authored
      Checkasm benchmarks:    Cortex  A7       A8      A53      A72     A73
      emu_edge_w4_16bpc_neon:      375.0    312.6    268.3    159.3   170.0
      emu_edge_w8_16bpc_neon:      619.3    425.5    435.5    249.5   291.1
      emu_edge_w16_16bpc_neon:     719.1    568.3    506.9    324.2   314.4
      emu_edge_w32_16bpc_neon:    2112.2   1677.7   1396.2   1050.5  1009.6
      emu_edge_w64_16bpc_neon:    5046.8   4322.5   3693.7   3953.8  2682.8
      emu_edge_w128_16bpc_neon:  16311.1  14341.3  12877.8  26183.5  8924.9
      
      Corresponding numbers for arm64, for comparison:
                                               Cortex A53      A72      A73
      emu_edge_w4_16bpc_neon:                       302.5    174.9    159.2
      emu_edge_w8_16bpc_neon:                       344.6    292.3    273.2
      emu_edge_w16_16bpc_neon:                      601.0    461.2    316.8
      emu_edge_w32_16bpc_neon:                      974.2   1274.7    960.5
      emu_edge_w64_16bpc_neon:                     2853.1   3527.6   2633.5
      emu_edge_w128_16bpc_neon:                   14633.5  26776.6   7236.0
      38df0efa
    • Martin Storsjö's avatar
      arm32: mc: Add NEON implementations of the w_mask functions for 16 bpc · cf74bdec
      Martin Storsjö authored
      Checkasm numbers:           Cortex A7        A8       A53       A72       A73
      w_mask_420_w4_16bpc_neon:       350.3     216.4     215.4     141.7     134.5
      w_mask_420_w8_16bpc_neon:       926.7     590.9     529.1     373.8     354.5
      w_mask_420_w16_16bpc_neon:     2956.7    1880.4    1654.8    1186.1    1134.1
      w_mask_420_w32_16bpc_neon:    11489.3    7426.4    6314.1    4599.8    4398.6
      w_mask_420_w64_16bpc_neon:    28175.9   17898.1   16002.8   11079.0   10551.8
      w_mask_420_w128_16bpc_neon:   71599.4   44630.9   40696.9   28057.3   27836.5
      w_mask_422_w4_16bpc_neon:       339.0     210.1     206.7     137.3     134.7
      w_mask_422_w8_16bpc_neon:       887.2     573.3     499.6     361.6     353.5
      w_mask_422_w16_16bpc_neon:     2918.0    1841.6    1593.0    1194.0    1157.9
      w_mask_422_w32_16bpc_neon:    11313.8    7238.7    6043.4    4577.1    4469.6
      w_mask_422_w64_16bpc_neon:    27746.5   17427.2   15386.9   11082.6   10693.8
      w_mask_422_w128_16bpc_neon:   70521.4   43864.9   39209.3   29045.7   28305.5
      w_mask_444_w4_16bpc_neon:       325.6     202.9     198.4     135.2     129.3
      w_mask_444_w8_16bpc_neon:       860.7     534.9     474.8     358.0     352.2
      w_mask_444_w16_16bpc_neon:     2764.3    1714.4    1517.8    1160.6    1133.1
      w_mask_444_w32_16bpc_neon:    10719.8    6738.3    5746.7    4458.6    4347.1
      w_mask_444_w64_16bpc_neon:    26407.9   16224.1   14783.9   10784.3   10371.4
      w_mask_444_w128_16bpc_neon:   67226.1   41060.1   37823.1   41696.1   27722.2
      
      Corresponding numbers for arm64, for comparison:
                                                     Cortex A53       A72       A73
      w_mask_420_w4_16bpc_neon:                           173.6     123.6     120.3
      w_mask_420_w8_16bpc_neon:                           484.0     344.0     329.4
      w_mask_420_w16_16bpc_neon:                         1436.3    1025.7    1028.7
      w_mask_420_w32_16bpc_neon:                         5597.0    3994.8    3981.2
      w_mask_420_w64_16bpc_neon:                        13953.4    9700.8    9579.9
      w_mask_420_w128_16bpc_neon:                       35833.7   25519.3   24277.8
      w_mask_422_w4_16bpc_neon:                           159.4     111.7     114.2
      w_mask_422_w8_16bpc_neon:                           453.4     326.2     326.7
      w_mask_422_w16_16bpc_neon:                         1398.2    1063.3    1052.6
      w_mask_422_w32_16bpc_neon:                         5532.7    4143.0    4026.3
      w_mask_422_w64_16bpc_neon:                        13885.3    9978.0    9689.8
      w_mask_422_w128_16bpc_neon:                       35763.3   25822.4   24610.9
      w_mask_444_w4_16bpc_neon:                           152.9     110.0     112.8
      w_mask_444_w8_16bpc_neon:                           437.2     332.0     325.8
      w_mask_444_w16_16bpc_neon:                         1399.3    1068.9    1041.7
      w_mask_444_w32_16bpc_neon:                         5410.9    4139.7    4136.9
      w_mask_444_w64_16bpc_neon:                        13648.7   10011.8   10004.6
      w_mask_444_w128_16bpc_neon:                       35639.6   26910.8   25631.0
      cf74bdec
    • Martin Storsjö's avatar
      arm32: mc: Add NEON implementation of the blend functions for 16 bpc · f809edb4
      Martin Storsjö authored
      Checkasm numbers:      Cortex A7      A8     A53     A72     A73
      blend_h_w2_16bpc_neon:     190.0   163.0   135.5    67.4    71.2
      blend_h_w4_16bpc_neon:     204.4   119.1   140.3    61.2    74.9
      blend_h_w8_16bpc_neon:     247.6   126.2   159.5    86.1    88.4
      blend_h_w16_16bpc_neon:    391.6   186.5   230.7   134.9   149.4
      blend_h_w32_16bpc_neon:    734.9   354.2   454.1   248.1   270.9
      blend_h_w64_16bpc_neon:   1290.8   611.7   801.1   456.6   491.3
      blend_h_w128_16bpc_neon:  2876.4  1354.2  1788.6  1083.4  1092.0
      blend_v_w2_16bpc_neon:     264.4   325.2   206.8   107.6   123.0
      blend_v_w4_16bpc_neon:     471.8   358.7   356.9   187.0   229.9
      blend_v_w8_16bpc_neon:     616.9   365.3   445.4   218.2   248.5
      blend_v_w16_16bpc_neon:    928.3   517.1   629.1   325.0   358.0
      blend_v_w32_16bpc_neon:   1771.6   790.1  1106.1   631.2   584.7
      blend_w4_16bpc_neon:       128.8    66.6    95.5    33.5    42.0
      blend_w8_16bpc_neon:       238.7   118.0   156.8    76.5    84.5
      blend_w16_16bpc_neon:      809.7   360.9   482.3   268.5   298.3
      blend_w32_16bpc_neon:     2015.7   916.6  1177.0   682.1   730.9
      
      Corresponding numbers for arm64, for comparison:
                                            Cortex A53     A72     A73
      blend_h_w2_16bpc_neon:                     109.3    83.1    56.8
      blend_h_w4_16bpc_neon:                     114.1    61.1    62.3
      blend_h_w8_16bpc_neon:                     133.3    80.8    81.0
      blend_h_w16_16bpc_neon:                    215.6   132.7   149.5
      blend_h_w32_16bpc_neon:                    390.4   253.9   235.8
      blend_h_w64_16bpc_neon:                    715.8   455.8   454.0
      blend_h_w128_16bpc_neon:                  1649.7  1034.7  1066.2
      blend_v_w2_16bpc_neon:                     185.9   176.3   178.3
      blend_v_w4_16bpc_neon:                     338.3   184.4   234.3
      blend_v_w8_16bpc_neon:                     427.0   214.5   252.7
      blend_v_w16_16bpc_neon:                    680.4   358.1   389.2
      blend_v_w32_16bpc_neon:                   1100.7   615.5   690.1
      blend_w4_16bpc_neon:                        76.0    32.3    32.1
      blend_w8_16bpc_neon:                       134.4    76.3    71.5
      blend_w16_16bpc_neon:                      476.3   268.8   301.5
      blend_w32_16bpc_neon:                     1226.8   659.9   782.8
      f809edb4
    • Martin Storsjö's avatar
      eeb03a73
    • Martin Storsjö's avatar
      f3197c1a
    • Martin Storsjö's avatar
      arm32: mc: Improve scheduling in blend_h · 9257a961
      Martin Storsjö authored
      9257a961
    • Martin Storsjö's avatar
      arm32: mc: Use a replicating vld1 to all lanes in one place · 85de1c3b
      Martin Storsjö authored
      This is one cycle faster, when the other lanes don't need to be
      preserved, on some (old) cores.
      85de1c3b
    • Martin Storsjö's avatar
      9381637a
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
  3. 15 Dec, 2020 3 commits
  4. 12 Dec, 2020 9 commits
  5. 11 Dec, 2020 1 commit
  6. 10 Dec, 2020 1 commit
  7. 01 Dec, 2020 7 commits
    • Martin Storsjö's avatar
      arm32: looprestoration: NEON implementation of SGR for 10 bpc · e705519d
      Martin Storsjö authored
      Checkasm numbers:           Cortex A7         A8       A53       A72       A73
      selfguided_3x3_10bpc_neon:   919127.6   717942.8  565717.8  404748.0  372179.8
      selfguided_5x5_10bpc_neon:   640310.8   511873.4  370653.3  273593.7  256403.2
      selfguided_mix_10bpc_neon:  1533887.0  1252389.5  922111.1  659033.4  613410.6
      
      Corresponding numbers for arm64, for comparison:
      
                                                      Cortex A53       A72       A73
      selfguided_3x3_10bpc_neon:                        500706.0  367199.2  345261.2
      selfguided_5x5_10bpc_neon:                        361403.3  270550.0  249955.3
      selfguided_mix_10bpc_neon:                        846172.4  623590.3  578404.8
      e705519d
    • Martin Storsjö's avatar
      arm32: looprestoration: Prepare for 16 bpc by splitting code to separate files · e1be33b9
      Martin Storsjö authored
      looprestoration_common.S contains functions that can be used as is
      with one single instantiation of the functions for both 8 and 16 bpc.
      This file will be built once, regardless of which bitdepths are enabled.
      
      looprestoration_tmpl.S contains functions where the source can be shared
      and templated between 8 and 16 bpc. This will be included by the separate
      8/16bpc implementaton files.
      e1be33b9
    • Martin Storsjö's avatar
      arm: looprestoration16: Fix comments referring to pixels as bytes · c58e9d57
      Martin Storsjö authored
      A number of other similar comments were updated to say pixels when
      the 16 bpc code was written originally, but these were missed.
      c58e9d57
    • Martin Storsjö's avatar
      arm64: looprestoration: Add a missed parameter in a comment · 25877c3b
      Martin Storsjö authored
      Make it consistent with the weighted1 function.
      25877c3b
    • Martin Storsjö's avatar
      arm32: looprestoration: Remove an unnecessary stack arg load in SGR · ca9cd497
      Martin Storsjö authored
      For the existing 8 bpc support, there's no stack argument to load
      into r8.
      ca9cd497
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm64: looprestoration16: Don't keep precalculated squares in box3/5_h · cbd4827f
      Martin Storsjö authored
      Instead of calculating squares of pixels once, and shifting and
      adding the precalculated squares, just do multiply-accumulate of
      the pixels that are shifted anyway for the non-squared sum. This
      results in more multiplications in total, but fewer instructions,
      and multiplications aren't that much more expensive than regular
      arithmetic operations anyway.
      
      On Cortex A53 and A72, this is a fairly substantial gain, on A73
      it's a very marginal gain.
      
      The runtimes for the box3/5_h functions themselves are reduced
      by around 16-20%, and the overall runtime for SGR is reduced
      by around 2-8%.
      
      Before:                   Cortex A53       A72       A73
      selfguided_3x3_10bpc_neon:  513086.5  385767.7  348774.3
      selfguided_5x5_10bpc_neon:  378108.6  291133.5  253251.4
      selfguided_mix_10bpc_neon:  876833.1  662801.0  586387.4
      
      After:                    Cortex A53       A72       A73
      selfguided_3x3_10bpc_neon:  502734.0  363754.5  343199.8
      selfguided_5x5_10bpc_neon:  361696.4  265848.2  249476.8
      selfguided_mix_10bpc_neon:  852683.8  615848.6  577615.0
      cbd4827f
  8. 28 Nov, 2020 2 commits
  9. 23 Nov, 2020 1 commit
  10. 22 Nov, 2020 1 commit
    • Henrik Gramner's avatar
      Add more buffer pools · 236e1122
      Henrik Gramner authored
      Add buffer pools for miscellaneous smaller buffers that are
      repeatedly being freed and reallocated.
      
      Also improve dav1d_ref_create() by consolidating two separate
      memory allocations into a single one.
      236e1122
  11. 20 Nov, 2020 3 commits
    • Martin Storsjö's avatar
      arm32: mc: NEON implementation of warp8x8 for 16 bpc · dc98fff8
      Martin Storsjö authored
      Checkasm benchmarks:
                          Cortex A7      A8     A53     A72     A73
      warp_8x8_16bpc_neon:   4062.6  2109.4  2462.0  1338.9  1391.1
      warp_8x8t_16bpc_neon:  3996.3  2102.4  2412.0  1273.8  1368.9
      
      Corresponding numbers for arm64, for comparison:
                                         Cortex A53     A72     A73
      warp_8x8_16bpc_neon:                   2037.0  1148.8  1222.0
      warp_8x8t_16bpc_neon:                  2008.0  1120.4  1200.9
      dc98fff8
    • Martin Storsjö's avatar
      arm32: cdef: Add NEON implementations of CDEF for 16 bpc · 018e64e7
      Martin Storsjö authored
      Use a shared template file for assembly functions that can be
      templated into 8 and 16 bpc forms, just like in the arm64 version.
      
      Checkasm benchmarks:
                                Cortex A7      A8     A53     A72     A73
      cdef_dir_16bpc_neon:          975.9   853.2   555.2   378.7   386.9
      cdef_filter_4x4_16bpc_neon:   746.9   521.7   481.2   333.0   340.8
      cdef_filter_4x8_16bpc_neon:  1300.0   885.5   816.3   582.7   599.5
      cdef_filter_8x8_16bpc_neon:  2282.5  1415.0  1417.6  1059.0  1076.3
      
      Corresponding numbers for arm64, for comparison:
                                               Cortex A53     A72     A73
      cdef_dir_16bpc_neon:                          418.0   306.7   310.7
      cdef_filter_4x4_16bpc_neon:                   453.4   282.9   297.4
      cdef_filter_4x8_16bpc_neon:                   807.5   514.2   533.8
      cdef_filter_8x8_16bpc_neon:                  1425.2   924.4   942.0
      018e64e7
    • Martin Storsjö's avatar