Skip to content
Snippets Groups Projects
  1. Sep 01, 2020
    • Henrik Gramner's avatar
      cli: Use proper integer math in Y4M PAR calculations · 3bfe8c7c
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      The previous floating-point implementation produced results that were
      sometimes slightly off due to rounding errors.
      
      For example, a frame size of 432x240 with a render size of 176x240
      previously resulted in a PAR of 98:240 instead of the correct 11:27.
      
      Also reduce fractions to produce more readable numbers.
      3bfe8c7c
  2. Aug 30, 2020
  3. Aug 29, 2020
    • Martin Storsjö's avatar
      arm32: mc: NEON implementation of avg/mask/w_avg for 16 bpc · 80aa7823
      Martin Storsjö authored
                            Cortex A7       A8       A9      A53      A72      A73
      avg_w4_16bpc_neon:        131.4     81.8    117.3    111.0     50.9     58.8
      avg_w8_16bpc_neon:        291.9    173.1    293.1    230.9    114.7    128.8
      avg_w16_16bpc_neon:       803.3    480.1    821.4    645.8    345.7    384.9
      avg_w32_16bpc_neon:      3350.0   1833.1   3188.1   2343.5   1343.9   1500.6
      avg_w64_16bpc_neon:      8185.9   4390.6  10448.2   6078.8   3303.6   3466.7
      avg_w128_16bpc_neon:    22384.3  10901.2  33721.9  16782.7   8165.1   8416.5
      w_avg_w4_16bpc_neon:      251.3    165.8    203.9    158.3     99.6    106.9
      w_avg_w8_16bpc_neon:      638.4    427.8    555.7    365.1    283.2    277.4
      w_avg_w16_16bpc_neon:    1912.3   1257.5   1623.4   1056.5    879.5    841.8
      w_avg_w32_16bpc_neon:    7461.3   4889.6   6383.8   3966.3   3286.8   3296.8
      w_avg_w64_16bpc_neon:   18689.3  11698.1  18487.3  10134.1   8156.2   7939.5
      w_avg_w128_16bpc_neon:  48776.6  28989.0  53203.3  26004.1  20055.2  20049.4
      mask_w4_16bpc_neon:       298.6    189.2    242.3    191.6    115.2    129.6
      mask_w8_16bpc_neon:       768.6    501.5    646.1    432.4    302.9    326.8
      mask_w16_16bpc_neon:     2320.5   1480.9   1873.0   1270.2    932.2    976.1
      mask_w32_16bpc_neon:     9412.0   5791.9   7348.5   4875.1   3896.4   3821.1
      mask_w64_16bpc_neon:    23385.9  13875.6  21383.8  12235.9   9469.2   9160.2
      mask_w128_16bpc_neon:   60466.4  34762.6  61055.9  31214.0  23299.0  23324.5
      
      For comparison, the corresponding numbers for the existing arm64
      implementation:
      
      avg_w4_16bpc_neon:                                    78.0     38.5     50.0
      avg_w8_16bpc_neon:                                   198.3    105.4    117.8
      avg_w16_16bpc_neon:                                  614.9    339.9    376.7
      avg_w32_16bpc_neon:                                 2313.8   1391.1   1487.7
      avg_w64_16bpc_neon:                                 5733.3   3269.1   3648.4
      avg_w128_16bpc_neon:                               15105.9   8143.5   8970.4
      w_avg_w4_16bpc_neon:                                 119.2     87.7     92.9
      w_avg_w8_16bpc_neon:                                 322.9    252.3    263.5
      w_avg_w16_16bpc_neon:                               1016.8    794.0    828.6
      w_avg_w32_16bpc_neon:                               3910.9   3159.6   3308.3
      w_avg_w64_16bpc_neon:                               9499.6   7933.9   8026.5
      w_avg_w128_16bpc_neon:                             24508.3  19502.0  20389.8
      mask_w4_16bpc_neon:                                  138.9     98.7    106.7
      mask_w8_16bpc_neon:                                  375.5    301.1    302.7
      mask_w16_16bpc_neon:                                1217.2   1064.6    954.4
      mask_w32_16bpc_neon:                                4821.0   4018.4   3825.7
      mask_w64_16bpc_neon:                               12262.7   9471.3   9169.7
      mask_w128_16bpc_neon:                              31356.6  22657.6  23324.5
      80aa7823
  4. Aug 28, 2020
  5. Aug 22, 2020
  6. Aug 21, 2020
  7. Aug 07, 2020
    • Martin Storsjö's avatar
      checkasm: Add ifdefs around the readtime check · 5bbd9632
      Martin Storsjö authored
      This fixes building in configurations where no readtime implementation
      is available at all, such as MSVC targeting 32 bit ARM.
      
      This was missed when the check was added in
      95a19254.
      5bbd9632
    • Martin Storsjö's avatar
      checkasm: Enforce declare_func to be outside of check_func · 0b824944
      Martin Storsjö authored
      Move the declaration of func_ref/func_new into declare_func. This
      enforces that declare_func is a scope outside of/before check_func.
      
      This ensures that if the signal handler is triggered, we rewind
      to a scope outside of check_func, where check_func makes sure we
      don't rerun the test that just triggered the signal handler.
      0b824944
  8. Aug 06, 2020
  9. Aug 05, 2020
  10. Jul 20, 2020
  11. Jul 13, 2020
  12. Jul 10, 2020
  13. Jul 09, 2020
  14. Jul 04, 2020
  15. Jul 02, 2020
    • Martin Storsjö's avatar
      arm32: ipred: Port 8 bpc NEON implementations of remaining arm64 funtions · 8dd9c651
      Martin Storsjö authored
      This matches was is implemented for arm64 so far.
      
      Align the dav1d_sm_weights table to allow aligned loads from it.
      
      Relative speedups over C code (vs potentially autovectorized code, built
      with Clang):
      
                                      Cortex A7     A8     A9    A53    A72    A73
      intra_pred_paeth_w4_8bpc_neon:       4.81   7.61   5.82   5.50   5.61   6.94
      intra_pred_paeth_w8_8bpc_neon:       7.83  11.95   9.51  11.05   8.90  10.51
      intra_pred_paeth_w16_8bpc_neon:      4.86   4.49   3.90   4.60   3.76   3.54
      intra_pred_paeth_w32_8bpc_neon:      4.55   4.03   3.52   4.27   3.30   3.21
      intra_pred_paeth_w64_8bpc_neon:      4.38   3.72   3.32   3.95   3.08   3.00
      intra_pred_smooth_h_w4_8bpc_neon:    5.74  10.80   5.32   6.79   4.77   6.48
      intra_pred_smooth_h_w8_8bpc_neon:   10.59  17.95   9.39  16.03   6.94   8.98
      intra_pred_smooth_h_w16_8bpc_neon:   2.81   3.19   2.12   3.70   2.90   3.59
      intra_pred_smooth_h_w32_8bpc_neon:   2.63   2.41   1.86   3.44   2.24   2.66
      intra_pred_smooth_h_w64_8bpc_neon:   2.42   2.52   1.79   3.24   1.81   2.11
      intra_pred_smooth_v_w4_8bpc_neon:    4.15   7.99   3.46   4.63   3.83   4.39
      intra_pred_smooth_v_w8_8bpc_neon:    7.31  12.42   7.04  10.00   4.26   6.20
      intra_pred_smooth_v_w16_8bpc_neon:   3.70   3.44   2.53   3.33   2.76   3.21
      intra_pred_smooth_v_w32_8bpc_neon:   3.91   3.74   2.70   3.51   2.50   2.96
      intra_pred_smooth_v_w64_8bpc_neon:   4.03   3.94   2.80   3.64   2.36   2.80
      intra_pred_smooth_w4_8bpc_neon:      4.09   7.74   4.54   4.79   3.26   5.10
      intra_pred_smooth_w8_8bpc_neon:      5.63   8.93   6.62   8.28   3.73   6.04
      intra_pred_smooth_w16_8bpc_neon:     3.97   3.40   3.32   3.74   3.01   3.77
      intra_pred_smooth_w32_8bpc_neon:     3.75   3.14   3.07   3.28   2.65   3.17
      intra_pred_smooth_w64_8bpc_neon:     3.60   3.04   2.93   2.97   2.35   2.85
      intra_pred_filter_w4_8bpc_neon:      5.54   6.43   4.90   7.26   3.44   4.61
      intra_pred_filter_w8_8bpc_neon:      7.05   7.15   5.50  10.05   4.29   6.02
      intra_pred_filter_w16_8bpc_neon:     7.36   6.46   5.27  11.51   4.75   6.70
      intra_pred_filter_w32_8bpc_neon:     7.56   6.32   5.01  12.34   4.47   6.97
      pal_pred_w4_8bpc_neon:               5.47   7.76   4.40   5.20   8.32   7.03
      pal_pred_w8_8bpc_neon:              11.11  14.12   8.44  13.95  11.88  12.43
      pal_pred_w16_8bpc_neon:             14.38  20.95   9.84  17.43  14.77  13.56
      pal_pred_w32_8bpc_neon:             12.91  19.85  10.87  19.03  14.63  14.62
      pal_pred_w64_8bpc_neon:             14.01  19.23  10.82  19.82  16.23  16.32
      cfl_ac_420_w4_8bpc_neon:             8.11  13.41   7.92   9.26  10.55   9.36
      cfl_ac_420_w8_8bpc_neon:             7.77  15.71   7.69   8.94   9.76   8.56
      cfl_ac_420_w16_8bpc_neon:            7.72  13.71   8.30   9.05   9.81   9.02
      cfl_ac_422_w4_8bpc_neon:             8.85  15.80   8.26  10.97  13.04  10.00
      cfl_ac_422_w8_8bpc_neon:             8.77  16.96   7.57  10.46  12.16   9.92
      cfl_ac_422_w16_8bpc_neon:            8.28  14.91   7.16   9.69  10.57   9.18
      cfl_ac_444_w4_8bpc_neon:             7.47  14.13   7.50   9.76  11.11   9.39
      cfl_ac_444_w8_8bpc_neon:             6.81  15.46   5.27   9.11  12.09   9.76
      cfl_ac_444_w16_8bpc_neon:            6.11  13.68   4.62   8.17  10.78   8.92
      cfl_ac_444_w32_8bpc_neon:            5.71  12.11   4.28   7.53   9.53   8.52
      cfl_pred_cfl_128_w4_8bpc_neon:       7.46  12.63   8.48   8.03   7.64   9.29
      cfl_pred_cfl_128_w8_8bpc_neon:       5.05   5.16   3.79   4.64   5.07   4.42
      cfl_pred_cfl_128_w16_8bpc_neon:      4.44   5.17   3.65   4.20   4.41   4.74
      cfl_pred_cfl_128_w32_8bpc_neon:      4.51   5.25   3.67   4.29   4.39   4.73
      cfl_pred_cfl_left_w4_8bpc_neon:      6.60  11.74   7.75   6.91   7.44   9.14
      cfl_pred_cfl_left_w8_8bpc_neon:      4.92   5.15   3.80   4.41   5.44   4.81
      cfl_pred_cfl_left_w16_8bpc_neon:     4.40   5.26   3.66   4.10   4.63   4.94
      cfl_pred_cfl_left_w32_8bpc_neon:     4.50   5.31   3.68   4.25   4.43   4.82
      cfl_pred_cfl_top_w4_8bpc_neon:       7.00  11.88   7.88   7.50   7.43   9.68
      cfl_pred_cfl_top_w8_8bpc_neon:       4.96   5.07   3.78   4.51   5.31   4.75
      cfl_pred_cfl_top_w16_8bpc_neon:      4.42   5.31   3.69   4.16   4.60   4.93
      cfl_pred_cfl_top_w32_8bpc_neon:      4.52   5.36   3.71   4.29   4.47   4.83
      cfl_pred_cfl_w4_8bpc_neon:           5.92  10.54   7.25   6.21   6.79   8.33
      cfl_pred_cfl_w8_8bpc_neon:           4.67   5.16   3.77   4.14   5.20   4.71
      cfl_pred_cfl_w16_8bpc_neon:          4.29   5.29   3.70   3.97   4.53   4.86
      cfl_pred_cfl_w32_8bpc_neon:          4.47   5.34   3.72   4.20   4.42   4.83
      8dd9c651
    • Martin Storsjö's avatar
      arm32: ipred: Optimize ipred_dc_w32 · b4291523
      Martin Storsjö authored
      Do the horizontal summing in the same way as for other cases of
      32 pixel summing.
      
      This doesn't seem to affect the runtime significantly though (checkasm
      benchmarks vary by a couple cycles), but it's 5 instructions shorter
      at least.
      b4291523
    • Martin Storsjö's avatar
      8fd0bc90
    • Martin Storsjö's avatar
      arm32: ipred: Fix comment formatting · f4a0127a
      Martin Storsjö authored
      This matches the arm64 original. The comment isn't about the condition,
      but about the state after the conditional branch.
      f4a0127a
    • Martin Storsjö's avatar
      arm32: ipred: Remove unnecessary operations in ipred_dc_w4 · d00a0227
      Martin Storsjö authored
      These came from matching some parts too closely to the arm64 version
      (where the summation can be done efficiently with uaddlv by zeroing
      the upper half of the register).
      
      Before:                  Cortex A7     A8     A9    A53   A72    A73
      intra_pred_dc_w4_8bpc_neon:  124.5   65.1   90.2  100.4  48.1   50.4
      After:
      intra_pred_dc_w4_8bpc_neon:  120.3   60.7   83.6   94.0  44.1   47.9
      d00a0227
    • Martin Storsjö's avatar
      arm32: ipred: Mark a few more loads as aligned · 74d5cf57
      Martin Storsjö authored
      This speeds things up a bit on older cores.
      
      Also do a load that duplicates the input over the whole register
      instead of just loading a single lane in iprev_v_w4. This can be a
      bit faster on Cortex A8.
      
      Before:                         Cortex A7      A8      A9     A53    A72     A73
      intra_pred_v_w4_8bpc_neon:           54.0    38.4    46.4    47.7   20.4    18.1
      intra_pred_h_w4_8bpc_neon:           66.3    43.1    55.0    57.0   27.9    22.2
      intra_pred_h_w8_8bpc_neon:           81.0    60.2    76.7    66.5   31.1    30.1
      intra_pred_dc_left_w4_8bpc_neon:     91.0    49.0    72.8    77.7   35.4    38.5
      intra_pred_dc_left_w8_8bpc_neon:    103.8    73.5    90.2    84.7   42.8    47.1
      intra_pred_dc_left_w16_8bpc_neon:   156.1   101.8   186.1   119.4   77.7    92.6
      intra_pred_dc_left_w32_8bpc_neon:   270.5   200.5   381.6   191.7  152.6   170.3
      intra_pred_dc_left_w64_8bpc_neon:   560.7   439.1   877.0   375.4  333.5   343.6
      
      After:
      intra_pred_v_w4_8bpc_neon:           53.9    38.0    46.4    47.7   19.8    19.2
      intra_pred_h_w4_8bpc_neon:           66.5    39.2    52.6    57.0   27.7    22.2
      intra_pred_h_w8_8bpc_neon:           80.5    55.8    72.9    66.5   31.4    30.1
      intra_pred_dc_left_w4_8bpc_neon:     91.0    48.2    71.8    77.7   34.9    38.6
      intra_pred_dc_left_w8_8bpc_neon:    103.8    69.6    89.2    84.7   43.2    47.3
      intra_pred_dc_left_w16_8bpc_neon:   182.3    99.9   184.9   118.8   77.7    85.8
      intra_pred_dc_left_w32_8bpc_neon:   355.4   198.9   380.1   190.6  152.9   161.0
      intra_pred_dc_left_w64_8bpc_neon:   517.5   437.4   876.9   375.7  333.3   347.7
      74d5cf57
    • Martin Storsjö's avatar
      arm64: ipred: 16 bpc NEON implementation of the cfl_ac 444 function · 72db6607
      Martin Storsjö authored
      Relative speedup over C code:
                             Cortex A53    A72    A73
      cfl_ac_444_w4_16bpc_neon:    8.03   9.41  10.48
      cfl_ac_444_w8_16bpc_neon:   10.17  10.54  10.38
      cfl_ac_444_w16_16bpc_neon:  10.73  10.38   9.73
      cfl_ac_444_w32_16bpc_neon:  10.18   9.43   9.77
      72db6607
    • Martin Storsjö's avatar
      arm64: ipred: 8 bpc NEON implementation of the cfl_ac 444 function · 9b40bb95
      Martin Storsjö authored
      Relative speedup over C code:
                            Cortex A53    A72    A73
      cfl_ac_444_w4_8bpc_neon:    8.72   8.75  10.50
      cfl_ac_444_w8_8bpc_neon:   13.10  10.77  11.23
      cfl_ac_444_w16_8bpc_neon:  13.08   9.95  10.49
      cfl_ac_444_w32_8bpc_neon:  12.58   9.43  10.63
      9b40bb95
Loading