Skip to content
Snippets Groups Projects
  1. Jul 20, 2020
  2. Jul 13, 2020
  3. Jul 10, 2020
  4. Jul 09, 2020
  5. Jul 04, 2020
  6. Jul 02, 2020
    • Martin Storsjö's avatar
      arm32: ipred: Port 8 bpc NEON implementations of remaining arm64 funtions · 8dd9c651
      Martin Storsjö authored
      This matches was is implemented for arm64 so far.
      
      Align the dav1d_sm_weights table to allow aligned loads from it.
      
      Relative speedups over C code (vs potentially autovectorized code, built
      with Clang):
      
                                      Cortex A7     A8     A9    A53    A72    A73
      intra_pred_paeth_w4_8bpc_neon:       4.81   7.61   5.82   5.50   5.61   6.94
      intra_pred_paeth_w8_8bpc_neon:       7.83  11.95   9.51  11.05   8.90  10.51
      intra_pred_paeth_w16_8bpc_neon:      4.86   4.49   3.90   4.60   3.76   3.54
      intra_pred_paeth_w32_8bpc_neon:      4.55   4.03   3.52   4.27   3.30   3.21
      intra_pred_paeth_w64_8bpc_neon:      4.38   3.72   3.32   3.95   3.08   3.00
      intra_pred_smooth_h_w4_8bpc_neon:    5.74  10.80   5.32   6.79   4.77   6.48
      intra_pred_smooth_h_w8_8bpc_neon:   10.59  17.95   9.39  16.03   6.94   8.98
      intra_pred_smooth_h_w16_8bpc_neon:   2.81   3.19   2.12   3.70   2.90   3.59
      intra_pred_smooth_h_w32_8bpc_neon:   2.63   2.41   1.86   3.44   2.24   2.66
      intra_pred_smooth_h_w64_8bpc_neon:   2.42   2.52   1.79   3.24   1.81   2.11
      intra_pred_smooth_v_w4_8bpc_neon:    4.15   7.99   3.46   4.63   3.83   4.39
      intra_pred_smooth_v_w8_8bpc_neon:    7.31  12.42   7.04  10.00   4.26   6.20
      intra_pred_smooth_v_w16_8bpc_neon:   3.70   3.44   2.53   3.33   2.76   3.21
      intra_pred_smooth_v_w32_8bpc_neon:   3.91   3.74   2.70   3.51   2.50   2.96
      intra_pred_smooth_v_w64_8bpc_neon:   4.03   3.94   2.80   3.64   2.36   2.80
      intra_pred_smooth_w4_8bpc_neon:      4.09   7.74   4.54   4.79   3.26   5.10
      intra_pred_smooth_w8_8bpc_neon:      5.63   8.93   6.62   8.28   3.73   6.04
      intra_pred_smooth_w16_8bpc_neon:     3.97   3.40   3.32   3.74   3.01   3.77
      intra_pred_smooth_w32_8bpc_neon:     3.75   3.14   3.07   3.28   2.65   3.17
      intra_pred_smooth_w64_8bpc_neon:     3.60   3.04   2.93   2.97   2.35   2.85
      intra_pred_filter_w4_8bpc_neon:      5.54   6.43   4.90   7.26   3.44   4.61
      intra_pred_filter_w8_8bpc_neon:      7.05   7.15   5.50  10.05   4.29   6.02
      intra_pred_filter_w16_8bpc_neon:     7.36   6.46   5.27  11.51   4.75   6.70
      intra_pred_filter_w32_8bpc_neon:     7.56   6.32   5.01  12.34   4.47   6.97
      pal_pred_w4_8bpc_neon:               5.47   7.76   4.40   5.20   8.32   7.03
      pal_pred_w8_8bpc_neon:              11.11  14.12   8.44  13.95  11.88  12.43
      pal_pred_w16_8bpc_neon:             14.38  20.95   9.84  17.43  14.77  13.56
      pal_pred_w32_8bpc_neon:             12.91  19.85  10.87  19.03  14.63  14.62
      pal_pred_w64_8bpc_neon:             14.01  19.23  10.82  19.82  16.23  16.32
      cfl_ac_420_w4_8bpc_neon:             8.11  13.41   7.92   9.26  10.55   9.36
      cfl_ac_420_w8_8bpc_neon:             7.77  15.71   7.69   8.94   9.76   8.56
      cfl_ac_420_w16_8bpc_neon:            7.72  13.71   8.30   9.05   9.81   9.02
      cfl_ac_422_w4_8bpc_neon:             8.85  15.80   8.26  10.97  13.04  10.00
      cfl_ac_422_w8_8bpc_neon:             8.77  16.96   7.57  10.46  12.16   9.92
      cfl_ac_422_w16_8bpc_neon:            8.28  14.91   7.16   9.69  10.57   9.18
      cfl_ac_444_w4_8bpc_neon:             7.47  14.13   7.50   9.76  11.11   9.39
      cfl_ac_444_w8_8bpc_neon:             6.81  15.46   5.27   9.11  12.09   9.76
      cfl_ac_444_w16_8bpc_neon:            6.11  13.68   4.62   8.17  10.78   8.92
      cfl_ac_444_w32_8bpc_neon:            5.71  12.11   4.28   7.53   9.53   8.52
      cfl_pred_cfl_128_w4_8bpc_neon:       7.46  12.63   8.48   8.03   7.64   9.29
      cfl_pred_cfl_128_w8_8bpc_neon:       5.05   5.16   3.79   4.64   5.07   4.42
      cfl_pred_cfl_128_w16_8bpc_neon:      4.44   5.17   3.65   4.20   4.41   4.74
      cfl_pred_cfl_128_w32_8bpc_neon:      4.51   5.25   3.67   4.29   4.39   4.73
      cfl_pred_cfl_left_w4_8bpc_neon:      6.60  11.74   7.75   6.91   7.44   9.14
      cfl_pred_cfl_left_w8_8bpc_neon:      4.92   5.15   3.80   4.41   5.44   4.81
      cfl_pred_cfl_left_w16_8bpc_neon:     4.40   5.26   3.66   4.10   4.63   4.94
      cfl_pred_cfl_left_w32_8bpc_neon:     4.50   5.31   3.68   4.25   4.43   4.82
      cfl_pred_cfl_top_w4_8bpc_neon:       7.00  11.88   7.88   7.50   7.43   9.68
      cfl_pred_cfl_top_w8_8bpc_neon:       4.96   5.07   3.78   4.51   5.31   4.75
      cfl_pred_cfl_top_w16_8bpc_neon:      4.42   5.31   3.69   4.16   4.60   4.93
      cfl_pred_cfl_top_w32_8bpc_neon:      4.52   5.36   3.71   4.29   4.47   4.83
      cfl_pred_cfl_w4_8bpc_neon:           5.92  10.54   7.25   6.21   6.79   8.33
      cfl_pred_cfl_w8_8bpc_neon:           4.67   5.16   3.77   4.14   5.20   4.71
      cfl_pred_cfl_w16_8bpc_neon:          4.29   5.29   3.70   3.97   4.53   4.86
      cfl_pred_cfl_w32_8bpc_neon:          4.47   5.34   3.72   4.20   4.42   4.83
      8dd9c651
    • Martin Storsjö's avatar
      arm32: ipred: Optimize ipred_dc_w32 · b4291523
      Martin Storsjö authored
      Do the horizontal summing in the same way as for other cases of
      32 pixel summing.
      
      This doesn't seem to affect the runtime significantly though (checkasm
      benchmarks vary by a couple cycles), but it's 5 instructions shorter
      at least.
      b4291523
    • Martin Storsjö's avatar
      8fd0bc90
    • Martin Storsjö's avatar
      arm32: ipred: Fix comment formatting · f4a0127a
      Martin Storsjö authored
      This matches the arm64 original. The comment isn't about the condition,
      but about the state after the conditional branch.
      f4a0127a
    • Martin Storsjö's avatar
      arm32: ipred: Remove unnecessary operations in ipred_dc_w4 · d00a0227
      Martin Storsjö authored
      These came from matching some parts too closely to the arm64 version
      (where the summation can be done efficiently with uaddlv by zeroing
      the upper half of the register).
      
      Before:                  Cortex A7     A8     A9    A53   A72    A73
      intra_pred_dc_w4_8bpc_neon:  124.5   65.1   90.2  100.4  48.1   50.4
      After:
      intra_pred_dc_w4_8bpc_neon:  120.3   60.7   83.6   94.0  44.1   47.9
      d00a0227
    • Martin Storsjö's avatar
      arm32: ipred: Mark a few more loads as aligned · 74d5cf57
      Martin Storsjö authored
      This speeds things up a bit on older cores.
      
      Also do a load that duplicates the input over the whole register
      instead of just loading a single lane in iprev_v_w4. This can be a
      bit faster on Cortex A8.
      
      Before:                         Cortex A7      A8      A9     A53    A72     A73
      intra_pred_v_w4_8bpc_neon:           54.0    38.4    46.4    47.7   20.4    18.1
      intra_pred_h_w4_8bpc_neon:           66.3    43.1    55.0    57.0   27.9    22.2
      intra_pred_h_w8_8bpc_neon:           81.0    60.2    76.7    66.5   31.1    30.1
      intra_pred_dc_left_w4_8bpc_neon:     91.0    49.0    72.8    77.7   35.4    38.5
      intra_pred_dc_left_w8_8bpc_neon:    103.8    73.5    90.2    84.7   42.8    47.1
      intra_pred_dc_left_w16_8bpc_neon:   156.1   101.8   186.1   119.4   77.7    92.6
      intra_pred_dc_left_w32_8bpc_neon:   270.5   200.5   381.6   191.7  152.6   170.3
      intra_pred_dc_left_w64_8bpc_neon:   560.7   439.1   877.0   375.4  333.5   343.6
      
      After:
      intra_pred_v_w4_8bpc_neon:           53.9    38.0    46.4    47.7   19.8    19.2
      intra_pred_h_w4_8bpc_neon:           66.5    39.2    52.6    57.0   27.7    22.2
      intra_pred_h_w8_8bpc_neon:           80.5    55.8    72.9    66.5   31.4    30.1
      intra_pred_dc_left_w4_8bpc_neon:     91.0    48.2    71.8    77.7   34.9    38.6
      intra_pred_dc_left_w8_8bpc_neon:    103.8    69.6    89.2    84.7   43.2    47.3
      intra_pred_dc_left_w16_8bpc_neon:   182.3    99.9   184.9   118.8   77.7    85.8
      intra_pred_dc_left_w32_8bpc_neon:   355.4   198.9   380.1   190.6  152.9   161.0
      intra_pred_dc_left_w64_8bpc_neon:   517.5   437.4   876.9   375.7  333.3   347.7
      74d5cf57
    • Martin Storsjö's avatar
      arm64: ipred: 16 bpc NEON implementation of the cfl_ac 444 function · 72db6607
      Martin Storsjö authored
      Relative speedup over C code:
                             Cortex A53    A72    A73
      cfl_ac_444_w4_16bpc_neon:    8.03   9.41  10.48
      cfl_ac_444_w8_16bpc_neon:   10.17  10.54  10.38
      cfl_ac_444_w16_16bpc_neon:  10.73  10.38   9.73
      cfl_ac_444_w32_16bpc_neon:  10.18   9.43   9.77
      72db6607
    • Martin Storsjö's avatar
      arm64: ipred: 8 bpc NEON implementation of the cfl_ac 444 function · 9b40bb95
      Martin Storsjö authored
      Relative speedup over C code:
                            Cortex A53    A72    A73
      cfl_ac_444_w4_8bpc_neon:    8.72   8.75  10.50
      cfl_ac_444_w8_8bpc_neon:   13.10  10.77  11.23
      cfl_ac_444_w16_8bpc_neon:  13.08   9.95  10.49
      cfl_ac_444_w32_8bpc_neon:  12.58   9.43  10.63
      9b40bb95
    • Martin Storsjö's avatar
      arm64: ipred: Remove an unnecessary branch in cfl_ac_420 · 2e271c49
      Martin Storsjö authored
      The branch target is directly afterwards, so the branch isn't needed.
      2e271c49
    • Martin Storsjö's avatar
      arm64: ipred: Remove an accidental leftover instruction · a903642a
      Martin Storsjö authored
      It became unused in 38629906.
      a903642a
    • Martin Storsjö's avatar
      arm64: ipred: Optimize the w16/w32 loop of pred_filter a bit · 2e36a3be
      Martin Storsjö authored
      Before:                        Cortex A53     A72     A73
      intra_pred_filter_w16_8bpc_neon:    540.2   573.8   580.2
      intra_pred_filter_w32_8bpc_neon:   1223.1  1364.1  1292.9
      After:
      intra_pred_filter_w16_8bpc_neon:    531.4   559.8   565.4
      intra_pred_filter_w32_8bpc_neon:   1243.0  1308.6  1270.9
      
      This does give a minor slowdown for the w32 case on A53, but helps
      on w16 and quite notably in all cases on A72 and A73. Doing the same
      modification on ipred16.S doesn't give quite as clear gains (the gains
      on A72 and A73 are smaller, and the regression on A53 on on w32 is a
      bit bigger), so not doing the same adjustment there.
      2e36a3be
    • Martin Storsjö's avatar
      arm64: ipred: Fix a comment typo · a26882d2
      Martin Storsjö authored
      a26882d2
    • Martin Storsjö's avatar
  7. Jul 01, 2020
  8. Jun 29, 2020
  9. Jun 25, 2020
  10. Jun 24, 2020
  11. Jun 23, 2020
    • Henrik Gramner's avatar
      x86inc: Add template defines for EVEX broadcasts · 8ec5ff0e
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Broadcasting a memory operand is binary flag, you either broadcast
      or you don't, and there's only a single possible element size for
      any given instruction.
      
      The instruction syntax however requires the broadcast semanticts
      to be explicitly defined, which is an issue when using macros to
      template code for multiple register widhts.
      
      Add some helper defines to alleviate the issue.
      8ec5ff0e
    • Ronald S. Bultje's avatar
      Accumulate leb128 value using uint64_t as intermediate type · 47daa4df
      Ronald S. Bultje authored
      The shift-amount can be up to 56, and left-shifting 32-bit integers
      by values >=32 is undefined behaviour. Therefore, use 64-bit integers
      instead. Also slightly rewrite so we only call dav1d_get_bits() once
      for the combined more|bits value, and mask the relevant portions
      out instead of reading twice. Lastly, move the overflow check out of
      the loop (as suggested by @wtc)
      
      Fixes #341.
      47daa4df
  12. Jun 21, 2020
  13. Jun 20, 2020
  14. Jun 19, 2020
    • Henrik Gramner's avatar
      x86: Branch before waiting on popcnt in ipred_z AVX2 functions · bf7adb75
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Some specific Haswell CPU:s have a hardware bug where the popcnt
      instruction doesn't set zero flag correctly, which causes the wrong
      branch to be taken.
      
      popcnt also has a 3-cycle latency on Intel CPU:s, so doing the branch
      on the input value instead of the output reduces the amount of time
      wasted going down the wrong code path in case of branch mispredictions.
      bf7adb75
    • Martin Storsjö's avatar
      arm32: Add a NEON implementation of MSAC · 53e7b21e
      Martin Storsjö authored
      Only use this in the cases when NEON can be used unconditionally
      without runtime detection (when __ARM_NEON is defined).
      
      The speedup over the C code is very modest for the smaller functions
      (and the NEON version actually is a little slower than the C code
      on Cortex A7 for adapt4), but the speedup is around 2x for
      adapt16.
      
                                    Cortex A7     A8     A9    A53    A72    A73
      msac_decode_bool_c:                41.1   43.0   43.0   37.3   26.2   31.3
      msac_decode_bool_neon:             40.2   42.0   37.2   32.8   19.9   25.5
      msac_decode_bool_adapt_c:          65.1   70.4   58.5   54.3   33.2   40.8
      msac_decode_bool_adapt_neon:       56.8   52.4   49.3   42.6   27.1   33.7
      msac_decode_bool_equi_c:           36.9   37.2   42.8   32.6   22.7   42.3
      msac_decode_bool_equi_neon:        34.9   35.1   36.4   29.7   19.5   36.4
      msac_decode_symbol_adapt4_c:      114.2  139.0  111.6   99.9   65.5   83.5
      msac_decode_symbol_adapt4_neon:   119.2  128.3   95.7   82.2   58.2   57.5
      msac_decode_symbol_adapt8_c:      176.0  207.9  164.0  154.4   88.0  117.0
      msac_decode_symbol_adapt8_neon:   128.3  130.3  110.7   85.1   59.9   61.4
      msac_decode_symbol_adapt16_c:     292.1  320.5  256.4  246.4  129.1  173.3
      msac_decode_symbol_adapt16_neon:  162.2  144.3  129.0  104.2   69.2   69.9
      
      (Omitting msac_decode_hi_tok from the benchmark, as the "C" version
      measured there uses the NEON version of msac_decode_symbol_adapt4.)
      53e7b21e
  15. Jun 18, 2020
    • Martin Storsjö's avatar
      arm64: msac: Add a special cased implementation of decode_hi_tok · 370200cd
      Martin Storsjö authored
      The speedup (over the normal version, that just calls the existing
      assembly version of symbol_adapt4) is not very impressive on
      bigger cores, but looks decent on small cores. It's an improvement
      though, in any case.
      
                                   Cortex A53    A72    A73
      msac_decode_hi_tok_c:             175.7  136.2  138.1
      msac_decode_hi_tok_neon:          146.8  129.4  125.9
      370200cd
Loading