1. 12 Jun, 2021 1 commit
    • Martin Storsjö's avatar
      arm32: filmgrain: Add NEON implementation of fgy and fguv for 16 bpc · ddbbfde1
      Martin Storsjö authored
      Relative speedup over C code:
                                      Cortex A7     A8     A9    A53    A72    A73
      fguv_32x32xn_16bpc_420_csfl0_neon:   3.47   1.72   2.99   4.18   2.68   6.19
      fguv_32x32xn_16bpc_420_csfl1_neon:   3.24   1.36   2.58   3.78   2.73   5.27
      fguv_32x32xn_16bpc_422_csfl0_neon:   3.57   2.07   3.05   4.32   2.74   6.20
      fguv_32x32xn_16bpc_422_csfl1_neon:   3.33   1.44   2.62   3.89   2.71   5.28
      fguv_32x32xn_16bpc_444_csfl0_neon:   3.48   1.69   3.06   4.48   2.97   6.69
      fguv_32x32xn_16bpc_444_csfl1_neon:   3.06   1.16   2.36   3.85   2.75   5.19
      fgy_32x32xn_16bpc_neon:              2.89   1.05   2.29   3.49   2.49   3.15
      
      Absolute numbers:
                                        Cortex A7       A8       A9      A53      A72      A73
      fguv_32x32xn_16bpc_420_csfl0_neon:   6237.3  12701.0   6687.1   4525.8   3220.8   3195.4
      fguv_32x32xn_16bpc_420_csfl1_neon:   5143.2  11684.8   5926.4   3857.2   2604.7   2556.5
      fguv_32x32xn_16bpc_422_csfl0_neon:   6347.3  11005.2   6797.5   4582.4   3300.4   3250.5
      fguv_32x32xn_16bpc_422_csfl1_neon:   5275.2  11594.8   5992.6   3931.1   2668.7   2607.3
      fguv_32x32xn_16bpc_444_csfl0_neon:   5181.6  11310.0   5575.4   3629.7   2383.8   2530.0
      fguv_32x32xn_16bpc_444_csfl1_neon:   4081.9  10958.8   4868.5   2962.9   1870.3   2034.2
      fgy_32x32xn_16bpc_neon:             15439.1  43129.0  19406.6  11542.3   7463.9   7827.8
      
      Corresponding numbers for arm64:
                                                                  Cortex A53      A72      A73
      fguv_32x32xn_16bpc_420_csfl0_neon:                              4019.2   3247.4   3259.6
      fguv_32x32xn_16bpc_420_csfl1_neon:                              3460.1   2628.7   2640.8
      fguv_32x32xn_16bpc_422_csfl0_neon:                              4034.4   3329.9   3287.5
      fguv_32x32xn_16bpc_422_csfl1_neon:                              3468.3   2749.3   2686.6
      fguv_32x32xn_16bpc_444_csfl0_neon:                              3117.7   2447.4   2539.8
      fguv_32x32xn_16bpc_444_csfl1_neon:                              2641.2   1977.2   2132.8
      fgy_32x32xn_16bpc_neon:                                         9873.5   7605.7   7656.2
      ddbbfde1
  2. 11 Jun, 2021 2 commits
    • Ronald S. Bultje's avatar
      Add 10/12-bit deblock SSSE3 implementation · f7043e47
      Ronald S. Bultje authored
      Currently 64-bit only.
      f7043e47
    • Martin Storsjö's avatar
      arm32: filmgrain: Add NEON implementations of fgy and fguv for 8 bpc · c187e704
      Martin Storsjö authored
      Relative speedup over C code:
                                     Cortex A7     A8     A9    A53    A72    A73
      fguv_32x32xn_8bpc_420_csfl0_neon:   4.20   2.19   3.48   4.93   3.60   5.93
      fguv_32x32xn_8bpc_420_csfl1_neon:   3.92   1.52   2.84   4.34   3.82   5.93
      fguv_32x32xn_8bpc_422_csfl0_neon:   4.27   2.13   3.58   5.02   4.04   5.95
      fguv_32x32xn_8bpc_422_csfl1_neon:   3.99   1.56   2.91   4.43   3.89   6.00
      fguv_32x32xn_8bpc_444_csfl0_neon:   4.48   2.08   3.89   5.66   4.07   6.51
      fguv_32x32xn_8bpc_444_csfl1_neon:   4.45   1.41   2.99   5.28   3.63   6.09
      fgy_32x32xn_8bpc_neon:              3.61   1.10   2.62   4.35   3.06   3.74
      
      Absolute numbers:
                                       Cortex A7       A8       A9      A53      A72      A73
      fguv_32x32xn_8bpc_420_csfl0_neon:   5318.8  11167.7   6024.6   3909.9   2945.2   2993.5
      fguv_32x32xn_8bpc_420_csfl1_neon:   4351.0  10929.7   5269.5   3316.8   2166.5   2256.9
      fguv_32x32xn_8bpc_422_csfl0_neon:   5387.9  11746.7   6080.0   3945.8   2988.1   3046.3
      fguv_32x32xn_8bpc_422_csfl1_neon:   4396.0  11083.2   5300.8   3354.9   2216.4   2291.4
      fguv_32x32xn_8bpc_444_csfl0_neon:   4347.9  10595.0   5134.4   3079.1   2277.7   2392.9
      fguv_32x32xn_8bpc_444_csfl1_neon:   3295.0  10518.2   4442.6   2476.3   1716.3   1829.2
      fgy_32x32xn_8bpc_neon:             12376.2  41046.9  17259.7   9153.1   6610.4   7005.3
      
      Corresponding numbers for arm64:                           Cortex A53      A72      A73
      fguv_32x32xn_8bpc_420_csfl0_neon:                              3822.9   2920.0   2935.7
      fguv_32x32xn_8bpc_420_csfl1_neon:                              3209.7   2231.7   2335.4
      fguv_32x32xn_8bpc_422_csfl0_neon:                              3807.9   2886.5   2966.7
      fguv_32x32xn_8bpc_422_csfl1_neon:                              3197.1   2187.9   2355.9
      fguv_32x32xn_8bpc_444_csfl0_neon:                              2757.8   2227.4   2334.4
      fguv_32x32xn_8bpc_444_csfl1_neon:                              2244.6   1719.1   1786.7
      fgy_32x32xn_8bpc_neon:                                         8192.2   6563.3   6969.1
      c187e704
  3. 10 Jun, 2021 5 commits
    • Martin Storsjö's avatar
      checkasm: Validate the benchmark call configurations even if not benchmarking · ea9c5afa
      Martin Storsjö authored
      This should help catch issues like the one fixed in
      185194be, by making sure that we
      call the benchmarked function at least once with the given parameters,
      even if not benchmarking. Otherwise the benchmark codepath is
      essentially dead untested code until somebody works on that piece
      of code.
      ea9c5afa
    • Martin Storsjö's avatar
      arm64: filmgrain16: Add a NEON implementation of fguv_32x32xn for 16 bpc · 5c5860ab
      Martin Storsjö authored
      Relative speedup over C code:
                                     Cortex A53    A82    A83   Apple M1
      fguv_32x32xn_16bpc_420_csfl0_neon:   4.57   2.08   3.57   7.61
      fguv_32x32xn_16bpc_420_csfl1_neon:   4.92   2.89   3.96   4.26
      fguv_32x32xn_16bpc_422_csfl0_neon:   4.59   2.14   3.61   5.88
      fguv_32x32xn_16bpc_422_csfl1_neon:   4.92   2.90   3.90   5.00
      fguv_32x32xn_16bpc_444_csfl0_neon:   3.64   1.89   2.86   4.72
      fguv_32x32xn_16bpc_444_csfl1_neon:   3.59   2.26   2.76   3.22
      5c5860ab
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm64: filmgrain: Stray cosmetic fixes · f65d3271
      Martin Storsjö authored
      f65d3271
    • Martin Storsjö's avatar
      arm64: filmgrain: Do the right amount of gathers for subsampled fguv · 64926847
      Martin Storsjö authored
      Previously we did 32 gathers even though only 16 are
      needed.
      
      Before:                          Cortex A53      A72      A73   Apple M1
      fguv_32x32xn_8bpc_420_csfl0_neon:    5352.1   3985.0   4068.9   8.3
      fguv_32x32xn_8bpc_420_csfl1_neon:    4738.2   3297.8   3633.0   8.2
      fguv_32x32xn_8bpc_422_csfl0_neon:    5386.0   4036.8   4093.5   8.3
      fguv_32x32xn_8bpc_422_csfl1_neon:    4779.9   3392.6   3641.6   8.2
      fguv_32x32xn_8bpc_444_csfl0_neon:    3068.4   2422.0   2436.5   4.9
      fguv_32x32xn_8bpc_444_csfl1_neon:    2558.3   1908.4   1926.6   4.4
      After:
      fguv_32x32xn_8bpc_420_csfl0_neon:    4330.4   3118.5   3224.6   5.3
      fguv_32x32xn_8bpc_420_csfl1_neon:    3731.8   2416.9   2619.6   4.7
      fguv_32x32xn_8bpc_422_csfl0_neon:    4364.7   3129.3   3247.6   5.4
      fguv_32x32xn_8bpc_422_csfl1_neon:    3762.5   2450.2   2661.8   4.7
      fguv_32x32xn_8bpc_444_csfl0_neon:    3075.1   2376.4   2429.4   4.9
      fguv_32x32xn_8bpc_444_csfl1_neon:    2564.5   1865.9   1952.8   4.4
      64926847
  4. 09 Jun, 2021 2 commits
  5. 07 Jun, 2021 1 commit
  6. 05 Jun, 2021 2 commits
  7. 04 Jun, 2021 1 commit
  8. 02 Jun, 2021 1 commit
    • Martin Storsjö's avatar
      checkasm: Remove an unused variable/parameter · 90dad3ee
      Martin Storsjö authored
      Clang 13 got support for warning about variables that are set but
      not used. We disable warnings for unused parameters, but in this case,
      the parameter variable is updated within the function too, which
      Clang warns about.
      90dad3ee
  9. 31 May, 2021 6 commits
  10. 27 May, 2021 1 commit
  11. 25 May, 2021 1 commit
    • Martin Storsjö's avatar
      arm64: filmgrain: Fix overflows in gen_grain · c389d895
      Martin Storsjö authored
      After multiplying two int8_t, the maximum possible output is
      -128*-128 = 16384. One can't add two such values in an int16_t (even if
      all the products of all other int8_t combinations can be).
      
      Previously the summing used 16 bit intermediates for the sum of two
      products and only lengtheted the result to 32 bit when accumulating
      three or more products.
      
      Before:                    Cortex A53       A72       A73   Apple M1
      gen_grain_y_ar1_8bpc_neon:   112598.5   71309.2   74889.8   372.2
      gen_grain_y_ar2_8bpc_neon:   139932.4   91442.3   95788.4   387.3
      gen_grain_y_ar3_8bpc_neon:   185607.6  115691.6  131655.8   403.0
      After:
      gen_grain_y_ar1_8bpc_neon:   112968.8   71897.9   76171.2   371.2
      gen_grain_y_ar2_8bpc_neon:   142768.8   94517.9   97934.4   387.5
      gen_grain_y_ar3_8bpc_neon:   191625.2  121083.0  135975.3   405.6
      c389d895
  12. 18 May, 2021 1 commit
  13. 16 May, 2021 1 commit
  14. 14 May, 2021 1 commit
  15. 13 May, 2021 4 commits
  16. 12 May, 2021 2 commits
  17. 11 May, 2021 1 commit
  18. 10 May, 2021 1 commit
  19. 04 May, 2021 6 commits