Skip to content
Snippets Groups Projects
  1. Sep 03, 2021
  2. Sep 02, 2021
  3. Sep 01, 2021
    • Martin Storsjö's avatar
      arm32: filmgrain: Add NEON implementation of gen_grain for 8 bpc · 0beeaa93
      Martin Storsjö authored
      Relative speedup over C code:
      
                                   Cortex A7     A8     A9    A53    A72    A73
      gen_grain_uv_ar0_8bpc_420_neon:   6.13   7.81   8.17   6.78   6.62  11.13
      gen_grain_uv_ar0_8bpc_422_neon:   6.34   7.64   8.00   6.83   6.93  10.31
      gen_grain_uv_ar0_8bpc_444_neon:   7.09   8.29   8.55   7.95   7.89  11.05
      gen_grain_uv_ar1_8bpc_420_neon:   3.39   2.26   3.06   4.13   3.41   4.95
      gen_grain_uv_ar1_8bpc_422_neon:   3.40   2.23   3.02   4.18   3.36   4.73
      gen_grain_uv_ar1_8bpc_444_neon:   3.46   2.18   2.95   4.46   3.57   4.91
      gen_grain_uv_ar2_8bpc_420_neon:   3.88   3.00   3.32   4.74   3.57   5.31
      gen_grain_uv_ar2_8bpc_422_neon:   3.92   3.04   3.36   4.82   3.57   5.06
      gen_grain_uv_ar2_8bpc_444_neon:   4.32   3.14   3.62   5.56   3.90   5.43
      gen_grain_uv_ar3_8bpc_420_neon:   4.35   3.53   4.05   5.35   4.44   5.56
      gen_grain_uv_ar3_8bpc_422_neon:   4.38   3.49   4.17   5.41   4.48   5.36
      gen_grain_uv_ar3_8bpc_444_neon:   4.84   3.70   4.36   5.95   4.87   5.82
      gen_grain_y_ar0_8bpc_neon:        5.18   5.57   7.65   5.93   7.13   9.01
      gen_grain_y_ar1_8bpc_neon:        2.64   1.66   2.48   3.32   3.15   3.77
      gen_grain_y_ar2_8bpc_neon:        3.57   2.64   3.21   4.59   3.68   4.64
      gen_grain_y_ar3_8bpc_neon:        4.27   3.93   4.12   5.41   4.63   5.17
      
      (A73 is benched against C code compiled with a different C compiler,
      which can explain the slightly differing numbers there.)
      
      Absolute numbers:
      
                                       Cortex A7         A8         A9        A53       A72        A73
      gen_grain_uv_ar0_8bpc_420_neon:    19614.6    13396.4    12320.4    15030.7    8288.1     8754.4
      gen_grain_uv_ar0_8bpc_422_neon:    34660.9    24315.5    22225.3    26809.2   14549.8    15804.6
      gen_grain_uv_ar0_8bpc_444_neon:    55625.6    39914.5    37100.2    44658.3   22917.3    27369.6
      gen_grain_uv_ar1_8bpc_420_neon:    50049.5    63179.4    44793.1    36406.7   22690.3    25401.9
      gen_grain_uv_ar1_8bpc_422_neon:    93289.5   117755.0    82815.4    67081.4   43133.1    46698.0
      gen_grain_uv_ar1_8bpc_444_neon:   170880.0   223259.2   156241.5   122760.0   78655.6    85604.9
      gen_grain_uv_ar2_8bpc_420_neon:    68185.5    78123.2    61457.3    47886.7   31526.2    36519.6
      gen_grain_uv_ar2_8bpc_422_neon:   129195.2   148653.9   114133.2    89822.7   60242.6    70160.1
      gen_grain_uv_ar2_8bpc_444_neon:   233133.7   272277.4   214108.7   161589.5  109069.3   127763.7
      gen_grain_uv_ar3_8bpc_420_neon:    96374.4    94372.2    79663.8    70832.0   43065.3    50593.9
      gen_grain_uv_ar3_8bpc_422_neon:   186324.8   184321.8   151490.1   136200.1   83758.0    98378.7
      gen_grain_uv_ar3_8bpc_444_neon:   335596.6   336811.6   279755.5   247251.5  151657.2   178906.0
      gen_grain_y_ar0_8bpc_neon:         46109.3    36022.2    28476.2    36478.5   18740.1    20660.4
      gen_grain_y_ar1_8bpc_neon:        165054.2   217090.4   152578.9   118409.4   74357.2    83794.5
      gen_grain_y_ar2_8bpc_neon:        226576.9   268320.3   210924.6   157829.4  105956.5   124293.2
      gen_grain_y_ar3_8bpc_neon:        328337.2   330421.3   275110.1   242097.3  148538.7   177270.8
      
      Corresponding numbers for the original arm64 version:
      
                                                                       Cortex A53       A72        A73
      gen_grain_uv_ar0_8bpc_420_neon:                                     14874.7    7765.5     8536.0
      gen_grain_uv_ar0_8bpc_422_neon:                                     26510.9   13685.3    15308.2
      gen_grain_uv_ar0_8bpc_444_neon:                                     43189.6   21565.3    24312.0
      gen_grain_uv_ar1_8bpc_420_neon:                                     33715.7   21669.8    22758.3
      gen_grain_uv_ar1_8bpc_422_neon:                                     63955.3   41581.4    42852.5
      gen_grain_uv_ar1_8bpc_444_neon:                                    117390.1   76503.5    78446.4
      gen_grain_uv_ar2_8bpc_420_neon:                                     42779.0   27794.3    29677.9
      gen_grain_uv_ar2_8bpc_422_neon:                                     82283.8   53446.7    58232.2
      gen_grain_uv_ar2_8bpc_444_neon:                                    147773.8   98492.7   103754.1
      gen_grain_uv_ar3_8bpc_420_neon:                                     56698.8   35697.1    40695.9
      gen_grain_uv_ar3_8bpc_422_neon:                                    110132.4   69829.1    79196.8
      gen_grain_uv_ar3_8bpc_444_neon:                                    196642.7  124174.9   141812.5
      gen_grain_y_ar0_8bpc_neon:                                          36461.0   17782.0    19827.0
      gen_grain_y_ar1_8bpc_neon:                                         113202.7   72457.7    75995.8
      gen_grain_y_ar2_8bpc_neon:                                         142894.0   94450.9   100304.5
      gen_grain_y_ar3_8bpc_neon:                                         191697.7  120674.9   137223.8
      0beeaa93
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm64: filmgrain: Reorder two instructions in the inner loop · 4643c6a7
      Martin Storsjö authored
      This should improve scheduling on in-order cores.
      4643c6a7
  4. Aug 31, 2021
  5. Aug 30, 2021
  6. Aug 26, 2021
  7. Aug 24, 2021
    • Martin Storsjö's avatar
      arm: Add NEON implementations of splat_mv · 5d14b4e6
      Martin Storsjö authored
      Relative speedup over C code, for arm64:
      
                     Cortex A53    A72    A73   Apple M1
      splat_mv_w1_neon:    1.09   0.95   1.22   -
      splat_mv_w2_neon:    1.76   1.32   1.74   -
      splat_mv_w4_neon:    2.78   2.19   2.19  15.00
      splat_mv_w8_neon:    3.59   2.06   2.59  12.00
      splat_mv_w16_neon:   4.12   1.72   2.53   3.14
      splat_mv_w32_neon:   4.07   1.60   2.40   3.00
      
      (The resolution of the timer used on Apple M1 isn't enough to
      measure the small versions of this function.)
      
      Relative speedup over C code, for arm32:
      
                      Cortex A7     A8     A9    A53    A72    A73
      splat_mv_w1_neon:    0.70   1.12   0.91   0.65   1.01   1.06
      splat_mv_w2_neon:    0.94   2.16   2.01   0.99   2.52   1.63
      splat_mv_w4_neon:    1.27   2.04   1.49   1.52   1.75   2.18
      splat_mv_w8_neon:    1.75   2.47   1.16   2.88   1.95   2.58
      splat_mv_w16_neon:   2.00   2.44   1.12   3.25   1.85   2.65
      splat_mv_w32_neon:   1.43   2.28   1.19   3.55   1.77   2.65
      5d14b4e6
  8. Aug 23, 2021
  9. Aug 19, 2021
  10. Aug 17, 2021
  11. Aug 16, 2021
  12. Aug 13, 2021
    • Martin Storsjö's avatar
      arm64: filmgrain16: Add NEON implementation of gen_grain for 16 bpc · 0141476d
      Martin Storsjö authored
      Relative speedup over C code:
                                   Cortex A53    A72    A73   Apple M1
      gen_grain_uv_ar0_16bpc_420_neon:   2.90   4.13   5.43   5.80
      gen_grain_uv_ar0_16bpc_422_neon:   3.23   4.51   5.52   5.83
      gen_grain_uv_ar0_16bpc_444_neon:   4.01   4.97   6.08   5.87
      gen_grain_uv_ar1_16bpc_420_neon:   2.94   2.80   3.56   3.48
      gen_grain_uv_ar1_16bpc_422_neon:   3.14   3.07   3.68   3.47
      gen_grain_uv_ar1_16bpc_444_neon:   3.54   3.51   3.93   2.61
      gen_grain_uv_ar2_16bpc_420_neon:   3.92   3.69   4.40   3.98
      gen_grain_uv_ar2_16bpc_422_neon:   4.13   3.96   4.42   3.92
      gen_grain_uv_ar2_16bpc_444_neon:   4.69   4.33   4.84   3.25
      gen_grain_uv_ar3_16bpc_420_neon:   5.05   5.39   5.42   4.74
      gen_grain_uv_ar3_16bpc_422_neon:   5.25   5.68   5.57   4.67
      gen_grain_uv_ar3_16bpc_444_neon:   6.02   6.33   6.35   4.38
      gen_grain_y_ar0_16bpc_neon:        4.67   5.23   5.22  10.11
      gen_grain_y_ar1_16bpc_neon:        3.32   3.03   3.28   2.24
      gen_grain_y_ar2_16bpc_neon:        4.59   3.95   4.64   3.52
      gen_grain_y_ar3_16bpc_neon:        5.89   5.93   6.36   4.79
      
      Absolute numbers:
                                       Cortex A53       A72       A73    Apple M1
      gen_grain_uv_ar0_16bpc_420_neon:    19797.2    9725.0    9234.0    29.7
      gen_grain_uv_ar0_16bpc_422_neon:    34899.4   16875.3   17021.6    57.7
      gen_grain_uv_ar0_16bpc_444_neon:    53776.6   28470.1   28773.1   107.8
      gen_grain_uv_ar1_16bpc_420_neon:    37998.2   24631.2   24754.0    84.2
      gen_grain_uv_ar1_16bpc_422_neon:    70817.5   44642.5   46323.1   166.3
      gen_grain_uv_ar1_16bpc_444_neon:   123333.0   77316.4   83523.1   427.5
      gen_grain_uv_ar2_16bpc_420_neon:    49115.8   33053.7   33249.9    93.6
      gen_grain_uv_ar2_16bpc_422_neon:    92965.3   59663.8   64741.9   187.9
      gen_grain_uv_ar2_16bpc_444_neon:   160899.7  108845.6  115422.4   441.8
      gen_grain_uv_ar3_16bpc_420_neon:    65786.6   41924.3   45562.1   108.1
      gen_grain_uv_ar3_16bpc_422_neon:   126232.3   78691.6   87351.5   217.6
      gen_grain_uv_ar3_16bpc_444_neon:   218702.6  140197.8  151294.8   454.3
      gen_grain_y_ar0_16bpc_neon:         35867.9   17653.6   20770.7   108.0
      gen_grain_y_ar1_16bpc_neon:        118781.8   74777.1   81338.6   426.0
      gen_grain_y_ar2_16bpc_neon:        155919.9  102145.8  109698.1   438.5
      gen_grain_y_ar3_16bpc_neon:        213348.1  133054.8  144726.0   447.9
      
      Corresponding numbers for 8bpc:
                                       Cortex A53       A72       A73    Apple M1
      gen_grain_uv_ar0_8bpc_420_neon:     15086.1    8384.7    8556.6    29.4
      gen_grain_uv_ar0_8bpc_422_neon:     26800.6   14354.4   15526.5    56.6
      gen_grain_uv_ar0_8bpc_444_neon:     43749.6   22408.6   24627.9   108.3
      gen_grain_uv_ar1_8bpc_420_neon:     33706.3   21892.6   22835.9    87.1
      gen_grain_uv_ar1_8bpc_422_neon:     63897.0   41820.1   43468.9   171.8
      gen_grain_uv_ar1_8bpc_444_neon:    117345.1   76372.5   79938.3   370.0
      gen_grain_uv_ar2_8bpc_420_neon:     42808.8   28493.8   29932.8    92.2
      gen_grain_uv_ar2_8bpc_422_neon:     82282.5   53969.4   58191.1   181.8
      gen_grain_uv_ar2_8bpc_444_neon:    147641.4   98136.4  103157.6   430.2
      gen_grain_uv_ar3_8bpc_420_neon:     56784.3   36342.0   40812.3   102.2
      gen_grain_uv_ar3_8bpc_422_neon:    110249.7   70215.6   79716.0   200.5
      gen_grain_uv_ar3_8bpc_444_neon:    196461.7  125802.8  141781.5   440.1
      gen_grain_y_ar0_8bpc_neon:          36451.7   17794.4   19839.3   109.5
      gen_grain_y_ar1_8bpc_neon:         113155.6   71811.9   77296.8   370.2
      gen_grain_y_ar2_8bpc_neon:         142812.3   95042.4  100434.4   431.8
      gen_grain_y_ar3_8bpc_neon:         191608.6  121199.5  136946.4   437.2
      0141476d
    • Martin Storsjö's avatar
      arm64: filmgrain: Deduplicate the sum_lagN functions · fcf148b3
      Martin Storsjö authored
      No difference in genereated code, but >210 lines less of duplicated
      source code.
      fcf148b3
    • Martin Storsjö's avatar
      arm64: filmgrain: Deduplicate the output_lag functions · 54a22a4f
      Martin Storsjö authored
      No practical difference in generated code (or the size of it), but
      less source code to handle.
      54a22a4f
    • Martin Storsjö's avatar
      arm64: filmgrain: Remove two stray ret instructions · caa2ede5
      Martin Storsjö authored
      These are never executed as they come after an unconditional branch.
      caa2ede5
    • Martin Storsjö's avatar
      arm64: filmgrain: Uninline the get_grain_2 macro · 513e4c24
      Martin Storsjö authored
      This shrinks the code section by 288 bytes.
      513e4c24
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      db4a486d
    • Henrik Gramner's avatar
      checkasm: Improve register preservation checking on x86 · f3b2599f
      Henrik Gramner authored
      Improve the error message on failure to specify which registers
      that have been clobbered.
      f3b2599f
Loading