1. 14 Apr, 2021 2 commits
    • Martin Storsjö's avatar
      arm64: filmgrain: Share the prologue of the fgy function · 54ad561d
      Martin Storsjö authored
      This is the same as what was done for the fguv function, to reduce
      the amount of space used for it (and also simplifying the calling
      code).
      
      This gives no significant slowdown for the case currently benchmarked
      by checkasm, while shrinking the code produced by film_grain.S by
      320 bytes.
      54ad561d
    • Martin Storsjö's avatar
      arm64: filmgrain: Add NEON implementation of the fguv function · 90bcb331
      Martin Storsjö authored
      Relative speedup over C code:
                                    Cortex A53    A72    A73   Apple M1
      fguv_32x32xn_8bpc_420_csfl0_neon:   4.51   2.87   3.88   6.51
      fguv_32x32xn_8bpc_420_csfl1_neon:   3.74   2.96   2.96   3.49
      fguv_32x32xn_8bpc_422_csfl0_neon:   4.49   3.18   4.07   5.00
      fguv_32x32xn_8bpc_422_csfl1_neon:   3.74   3.03   3.04   2.67
      fguv_32x32xn_8bpc_444_csfl0_neon:   6.68   4.24   5.66   5.02
      fguv_32x32xn_8bpc_444_csfl1_neon:   5.40   3.69   4.22   3.61
      90bcb331
  2. 15 Mar, 2021 1 commit
  3. 19 Feb, 2021 7 commits
  4. 12 Feb, 2021 1 commit
  5. 11 Feb, 2021 1 commit
    • Henrik Gramner's avatar
      Add minor SGR optimizations · c290c02e
      Henrik Gramner authored
      Split the 5x5, 3x3, and mix cases into separate functions.
      
      Shrink some tables.
      
      Move some scalar calculations out of the DSP function.
      
      Make Wiener and SGR share the same function prototype to
      eliminate a branch in lr_stripe().
      c290c02e
  6. 10 Feb, 2021 1 commit
    • Martin Storsjö's avatar
      arm64: itx16: Use usqadd to avoid separate clamping of negative values · 6f9f3391
      Martin Storsjö authored
      Before:                                Cortex A53     A72      A73
      inv_txfm_add_4x4_dct_dct_0_10bpc_neon:       40.7    23.0     24.0
      inv_txfm_add_4x4_dct_dct_1_10bpc_neon:      116.0    71.5     78.2
      inv_txfm_add_8x8_dct_dct_0_10bpc_neon:       85.7    50.7     53.8
      inv_txfm_add_8x8_dct_dct_1_10bpc_neon:      287.0   203.5    215.2
      inv_txfm_add_16x16_dct_dct_0_10bpc_neon:    255.7   129.1    140.4
      inv_txfm_add_16x16_dct_dct_1_10bpc_neon:   1401.4  1026.7   1039.2
      inv_txfm_add_16x16_dct_dct_2_10bpc_neon:   1913.2  1407.3   1479.6
      After:
      inv_txfm_add_4x4_dct_dct_0_10bpc_neon:       38.7    21.5     22.2
      inv_txfm_add_4x4_dct_dct_1_10bpc_neon:      116.0    71.3     77.2
      inv_txfm_add_8x8_dct_dct_0_10bpc_neon:       76.7    44.7     43.5
      inv_txfm_add_8x8_dct_dct_1_10bpc_neon:      278.0   203.0    203.9
      inv_txfm_add_16x16_dct_dct_0_10bpc_neon:    236.9   106.2    116.2
      inv_txfm_add_16x16_dct_dct_1_10bpc_neon:   1368.7   999.7   1008.4
      inv_txfm_add_16x16_dct_dct_2_10bpc_neon:   1880.5  1381.2   1459.4
      6f9f3391
  7. 09 Feb, 2021 2 commits
    • Martin Storsjö's avatar
      arm64: looprestoration: Rewrite the wiener functions · 2e73051c
      Martin Storsjö authored
      Make them operate in a more cache friendly manner, interleaving
      horizontal and vertical filtering (reducing the amount of stack
      used from 51 KB to 4 KB), similar to what was done for x86 in
      78d27b7d.
      
      This also adds separate 5tap versions of the filters and unrolls
      the vertical filter a bit more (which maybe could have been done
      without doing the rewrite).
      
      This does, however, increase the compiled code size by around
      3.5 KB.
      
      Before:                Cortex A53       A72       A73
      wiener_5tap_8bpc_neon:   136855.6   91446.2   87363.6
      wiener_7tap_8bpc_neon:   136861.6   91454.9   87374.5
      wiener_5tap_10bpc_neon:  167685.3  114720.3  116522.1
      wiener_5tap_12bpc_neon:  167677.5  114724.7  116511.9
      wiener_7tap_10bpc_neon:  167681.6  114738.5  116567.0
      wiener_7tap_12bpc_neon:  167673.8  114720.8  116515.4
      After:
      wiener_5tap_8bpc_neon:    87102.1   60460.6   66803.8
      wiener_7tap_8bpc_neon:   110831.7   78489.0   82015.9
      wiener_5tap_10bpc_neon:  109999.2   90259.0   89238.0
      wiener_5tap_12bpc_neon:  109978.3   90255.7   89220.7
      wiener_7tap_10bpc_neon:  137877.6  107578.5  103435.6
      wiener_7tap_12bpc_neon:  137868.8  107568.9  103390.4
      2e73051c
    • Kyle Siefring's avatar
      arm64: mc: Improve first tap for inorder cores · 4e869495
      Kyle Siefring authored
      Change order of multiply accumulates to allow inorder cores to forward
      the results.
      4e869495
  8. 08 Feb, 2021 1 commit
    • Martin Storsjö's avatar
      arm32: mc: Optimize warp by doing horz filtering in 8 bit · 0477fcf1
      Martin Storsjö authored
      Additionally reschedule instructions for loading, to reduce stalls
      on in order cores.
      
      This applies the changes from a3b8157e
      on the arm32 version.
      
      Before:             Cortex A7      A8      A9     A53     A72     A73
      warp_8x8_8bpc_neon:    3659.3  1746.0  1931.9  2128.8  1173.7  1188.9
      warp_8x8t_8bpc_neon:   3650.8  1724.6  1919.8  2105.0  1147.7  1206.9
      warp_8x8_16bpc_neon:   4039.4  2111.9  2337.1  2462.5  1334.6  1396.5
      warp_8x8t_16bpc_neon:  3973.9  2137.1  2299.6  2413.2  1282.8  1369.6
      After:
      warp_8x8_8bpc_neon:    2920.8  1269.8  1410.3  1767.3   860.2  1004.8
      warp_8x8t_8bpc_neon:   2904.9  1283.9  1397.5  1743.7   863.6  1024.7
      warp_8x8_16bpc_neon:   3895.5  2060.7  2339.8  2376.6  1331.1  1394.0
      warp_8x8t_16bpc_neon:  3822.7  2026.7  2298.7  2325.4  1278.1  1360.8
      0477fcf1
  9. 06 Feb, 2021 1 commit
  10. 05 Feb, 2021 3 commits
    • Martin Storsjö's avatar
      505e9990
    • Kyle Siefring's avatar
      arm64: warped motion: Various optimizations · a3b8157e
      Kyle Siefring authored
      - Reorder loads of filters to benifit in order cores.
      - Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the
         first stage which will hurt performance on some older big cores.
      - Rework horz stage for 8 bit mode:
          * Use smull instead of mul
          * Replace existing narrow and long instructions
          * Replace mov after calling with right shift
      
      Before:            Cortex A55    A53     A72     A73
      warp_8x8_8bpc_neon:    1683.2  1860.6  1065.0  1102.6
      warp_8x8t_8bpc_neon:   1673.2  1846.4  1057.0  1098.4
      warp_8x8_16bpc_neon:   1870.7  2031.7  1147.3  1220.7
      warp_8x8t_16bpc_neon:  1848.0  2006.2  1121.6  1188.0
      After:
      warp_8x8_8bpc_neon:    1267.2  1446.2   807.0   871.5
      warp_8x8t_8bpc_neon:   1245.4  1422.0   810.2   868.4
      warp_8x8_16bpc_neon:   1769.8  1929.3  1132.0  1238.2
      warp_8x8t_16bpc_neon:  1747.3  1904.1  1101.5  1207.9
      
      Cortex-A55
      Before:
      warp_8x8_8bpc_neon:   1683.2
      warp_8x8t_8bpc_neon:  1673.2
      warp_8x8_16bpc_neon:  1870.7
      warp_8x8t_16bpc_neon: 1848.0
      After:
      warp_8x8_8bpc_neon:   1267.2
      warp_8x8t_8bpc_neon:  1245.4
      warp_8x8_16bpc_neon:  1769.8
      warp_8x8t_16bpc_neon: 1747.3
      a3b8157e
    • Kyle Siefring's avatar
      arm64: loopfilter: Avoid leaving 8-bits · 833382b3
      Kyle Siefring authored
      Avoid moving between 8 and 16-bit vectors where possible.
      833382b3
  11. 04 Feb, 2021 2 commits
  12. 28 Jan, 2021 4 commits
  13. 20 Jan, 2021 1 commit
    • Kyle Siefring's avatar
      arm64: cdef_dir: Preload rows to prevent stalling · 11cb2efa
      Kyle Siefring authored
      Before:            Cortex A53     A55     A72     A73
      cdef_dir_8bpc_neon:     400.0   391.2   269.7   282.9
      cdef_dir_16bpc_neon:    417.7   413.0   303.8   313.6
      
      After: Cortex A55
      cdef_dir_8bpc_neon:     369.0   360.2   248.4   273.4
      cdef_dir_16bpc_neon:    388.7   384.0   272.2   290.7
      11cb2efa
  14. 07 Jan, 2021 9 commits
  15. 16 Dec, 2020 4 commits
    • Martin Storsjö's avatar
      arm32: mc: Add NEON implementation of emu_edge for 16 bpc · 38df0efa
      Martin Storsjö authored
      Checkasm benchmarks:    Cortex  A7       A8      A53      A72     A73
      emu_edge_w4_16bpc_neon:      375.0    312.6    268.3    159.3   170.0
      emu_edge_w8_16bpc_neon:      619.3    425.5    435.5    249.5   291.1
      emu_edge_w16_16bpc_neon:     719.1    568.3    506.9    324.2   314.4
      emu_edge_w32_16bpc_neon:    2112.2   1677.7   1396.2   1050.5  1009.6
      emu_edge_w64_16bpc_neon:    5046.8   4322.5   3693.7   3953.8  2682.8
      emu_edge_w128_16bpc_neon:  16311.1  14341.3  12877.8  26183.5  8924.9
      
      Corresponding numbers for arm64, for comparison:
                                               Cortex A53      A72      A73
      emu_edge_w4_16bpc_neon:                       302.5    174.9    159.2
      emu_edge_w8_16bpc_neon:                       344.6    292.3    273.2
      emu_edge_w16_16bpc_neon:                      601.0    461.2    316.8
      emu_edge_w32_16bpc_neon:                      974.2   1274.7    960.5
      emu_edge_w64_16bpc_neon:                     2853.1   3527.6   2633.5
      emu_edge_w128_16bpc_neon:                   14633.5  26776.6   7236.0
      38df0efa
    • Martin Storsjö's avatar
      arm32: mc: Add NEON implementations of the w_mask functions for 16 bpc · cf74bdec
      Martin Storsjö authored
      Checkasm numbers:           Cortex A7        A8       A53       A72       A73
      w_mask_420_w4_16bpc_neon:       350.3     216.4     215.4     141.7     134.5
      w_mask_420_w8_16bpc_neon:       926.7     590.9     529.1     373.8     354.5
      w_mask_420_w16_16bpc_neon:     2956.7    1880.4    1654.8    1186.1    1134.1
      w_mask_420_w32_16bpc_neon:    11489.3    7426.4    6314.1    4599.8    4398.6
      w_mask_420_w64_16bpc_neon:    28175.9   17898.1   16002.8   11079.0   10551.8
      w_mask_420_w128_16bpc_neon:   71599.4   44630.9   40696.9   28057.3   27836.5
      w_mask_422_w4_16bpc_neon:       339.0     210.1     206.7     137.3     134.7
      w_mask_422_w8_16bpc_neon:       887.2     573.3     499.6     361.6     353.5
      w_mask_422_w16_16bpc_neon:     2918.0    1841.6    1593.0    1194.0    1157.9
      w_mask_422_w32_16bpc_neon:    11313.8    7238.7    6043.4    4577.1    4469.6
      w_mask_422_w64_16bpc_neon:    27746.5   17427.2   15386.9   11082.6   10693.8
      w_mask_422_w128_16bpc_neon:   70521.4   43864.9   39209.3   29045.7   28305.5
      w_mask_444_w4_16bpc_neon:       325.6     202.9     198.4     135.2     129.3
      w_mask_444_w8_16bpc_neon:       860.7     534.9     474.8     358.0     352.2
      w_mask_444_w16_16bpc_neon:     2764.3    1714.4    1517.8    1160.6    1133.1
      w_mask_444_w32_16bpc_neon:    10719.8    6738.3    5746.7    4458.6    4347.1
      w_mask_444_w64_16bpc_neon:    26407.9   16224.1   14783.9   10784.3   10371.4
      w_mask_444_w128_16bpc_neon:   67226.1   41060.1   37823.1   41696.1   27722.2
      
      Corresponding numbers for arm64, for comparison:
                                                     Cortex A53       A72       A73
      w_mask_420_w4_16bpc_neon:                           173.6     123.6     120.3
      w_mask_420_w8_16bpc_neon:                           484.0     344.0     329.4
      w_mask_420_w16_16bpc_neon:                         1436.3    1025.7    1028.7
      w_mask_420_w32_16bpc_neon:                         5597.0    3994.8    3981.2
      w_mask_420_w64_16bpc_neon:                        13953.4    9700.8    9579.9
      w_mask_420_w128_16bpc_neon:                       35833.7   25519.3   24277.8
      w_mask_422_w4_16bpc_neon:                           159.4     111.7     114.2
      w_mask_422_w8_16bpc_neon:                           453.4     326.2     326.7
      w_mask_422_w16_16bpc_neon:                         1398.2    1063.3    1052.6
      w_mask_422_w32_16bpc_neon:                         5532.7    4143.0    4026.3
      w_mask_422_w64_16bpc_neon:                        13885.3    9978.0    9689.8
      w_mask_422_w128_16bpc_neon:                       35763.3   25822.4   24610.9
      w_mask_444_w4_16bpc_neon:                           152.9     110.0     112.8
      w_mask_444_w8_16bpc_neon:                           437.2     332.0     325.8
      w_mask_444_w16_16bpc_neon:                         1399.3    1068.9    1041.7
      w_mask_444_w32_16bpc_neon:                         5410.9    4139.7    4136.9
      w_mask_444_w64_16bpc_neon:                        13648.7   10011.8   10004.6
      w_mask_444_w128_16bpc_neon:                       35639.6   26910.8   25631.0
      cf74bdec
    • Martin Storsjö's avatar
      arm32: mc: Add NEON implementation of the blend functions for 16 bpc · f809edb4
      Martin Storsjö authored
      Checkasm numbers:      Cortex A7      A8     A53     A72     A73
      blend_h_w2_16bpc_neon:     190.0   163.0   135.5    67.4    71.2
      blend_h_w4_16bpc_neon:     204.4   119.1   140.3    61.2    74.9
      blend_h_w8_16bpc_neon:     247.6   126.2   159.5    86.1    88.4
      blend_h_w16_16bpc_neon:    391.6   186.5   230.7   134.9   149.4
      blend_h_w32_16bpc_neon:    734.9   354.2   454.1   248.1   270.9
      blend_h_w64_16bpc_neon:   1290.8   611.7   801.1   456.6   491.3
      blend_h_w128_16bpc_neon:  2876.4  1354.2  1788.6  1083.4  1092.0
      blend_v_w2_16bpc_neon:     264.4   325.2   206.8   107.6   123.0
      blend_v_w4_16bpc_neon:     471.8   358.7   356.9   187.0   229.9
      blend_v_w8_16bpc_neon:     616.9   365.3   445.4   218.2   248.5
      blend_v_w16_16bpc_neon:    928.3   517.1   629.1   325.0   358.0
      blend_v_w32_16bpc_neon:   1771.6   790.1  1106.1   631.2   584.7
      blend_w4_16bpc_neon:       128.8    66.6    95.5    33.5    42.0
      blend_w8_16bpc_neon:       238.7   118.0   156.8    76.5    84.5
      blend_w16_16bpc_neon:      809.7   360.9   482.3   268.5   298.3
      blend_w32_16bpc_neon:     2015.7   916.6  1177.0   682.1   730.9
      
      Corresponding numbers for arm64, for comparison:
                                            Cortex A53     A72     A73
      blend_h_w2_16bpc_neon:                     109.3    83.1    56.8
      blend_h_w4_16bpc_neon:                     114.1    61.1    62.3
      blend_h_w8_16bpc_neon:                     133.3    80.8    81.0
      blend_h_w16_16bpc_neon:                    215.6   132.7   149.5
      blend_h_w32_16bpc_neon:                    390.4   253.9   235.8
      blend_h_w64_16bpc_neon:                    715.8   455.8   454.0
      blend_h_w128_16bpc_neon:                  1649.7  1034.7  1066.2
      blend_v_w2_16bpc_neon:                     185.9   176.3   178.3
      blend_v_w4_16bpc_neon:                     338.3   184.4   234.3
      blend_v_w8_16bpc_neon:                     427.0   214.5   252.7
      blend_v_w16_16bpc_neon:                    680.4   358.1   389.2
      blend_v_w32_16bpc_neon:                   1100.7   615.5   690.1
      blend_w4_16bpc_neon:                        76.0    32.3    32.1
      blend_w8_16bpc_neon:                       134.4    76.3    71.5
      blend_w16_16bpc_neon:                      476.3   268.8   301.5
      blend_w32_16bpc_neon:                     1226.8   659.9   782.8
      f809edb4
    • Martin Storsjö's avatar
      eeb03a73