1. 31 Jan, 2023 2 commits
    • Martin Storsjö's avatar
      checkasm: Add an --affinity= option for selecting a CPU core · 77b39555
      Martin Storsjö authored
      Add an option for selecting the core where the single thread of
      checkasm runs. This allows benchmarking on specific CPU cores on
      heterogenous CPUs, like ARM big.LITTLE configurations.
      On Linux, one can easily wrap an invocation of checkasm with
      "taskset -c <n> [...]" - so this option isn't very essential
      there - however it is quite useful on Windows.
      On Windows, it is somewhat possible to do the same by launching
      the tool with "start /B /affinity <hexmask> [...]", but that
      doesn't work well with scripting ("start" returns before the
      command has finished running, and it's not obvious how to
      invoke "start" from within WSL).
      Using "taskset" to launch processes on specific cores within WSL
      on Windows doesn't work - regardless of the Linux level affinity,
      the process ends up running on the performance cores anyway.
    • Martin Storsjö's avatar
      arm64: ipred: 8 bpc NEON implementation of the Z3 function · 99956c73
      Martin Storsjö authored
      The implementation is a hybrid between two approaches; one generic
      (but non-ideal) for cases with large max_base_y, which fills two
      pixel columns at a time, i.e. looping over pixels first vertically,
      then horizontally - i.e. in a non-optimal manner.
      For cases with smaller max_base_y, it does two rows at a time, essentially
      doing gathers with the TBX instruction.
      Relative speedup over the C code:
                               Cortex A53    A55    A72    A73    A76   Apple M1
      intra_pred_z3_w4_8bpc_neon:    3.32   2.89   2.78   3.52   2.52   9.67
      intra_pred_z3_w8_8bpc_neon:    6.24   5.55   4.76   5.60   4.11   6.40
      intra_pred_z3_w16_8bpc_neon:   7.64   7.07   4.37   6.23   4.18   8.60
      intra_pred_z3_w32_8bpc_neon:   7.51   7.21   4.34   5.92   4.27   7.88
      intra_pred_z3_w64_8bpc_neon:   6.82   6.25   4.08   5.83   3.52   7.31
  2. 27 Jan, 2023 2 commits
    • Martin Storsjö's avatar
      arm64: ipred: 8 bpc NEON implementation of the Z1 function · fd4f348e
      Martin Storsjö authored
      Relative speedup over the C code:
                               Cortex A53    A55    A72    A73    A76  Apple M1
      intra_pred_z1_w4_8bpc_neon:    4.09   3.15   3.63   4.16   3.27  13.00
      intra_pred_z1_w8_8bpc_neon:    6.93   5.66   5.57   6.76   5.51   5.50
      intra_pred_z1_w16_8bpc_neon:   7.81   6.85   6.24   7.78   6.59   9.00
      intra_pred_z1_w32_8bpc_neon:  10.56   9.95   8.72  10.95   8.28  13.33
      intra_pred_z1_w64_8bpc_neon:  11.00  11.38   9.11  11.62   8.65  14.61
      (The speedup numbers for M1 are kinda noisy due to the very coarse
      granularity of the timer used there.)
    • Martin Storsjö's avatar
      checkasm: ipred: Iterate 5 times for each Z1/Z2/Z3 function · 2e990b37
      Martin Storsjö authored
      These functions contain a number of different codepaths; try to
      make sure that we hit most codepaths for each size combination.
      This both gives better test coverage in one single run of checkasm,
      but also should give a better averaged runtime in benchmarks.
  3. 26 Jan, 2023 2 commits
  4. 12 Jan, 2023 1 commit
  5. 14 Dec, 2022 4 commits
  6. 13 Dec, 2022 3 commits
  7. 09 Dec, 2022 5 commits
  8. 04 Dec, 2022 1 commit
  9. 21 Nov, 2022 2 commits
  10. 10 Nov, 2022 1 commit
  11. 30 Oct, 2022 2 commits
  12. 27 Oct, 2022 1 commit
  13. 26 Oct, 2022 1 commit
  14. 20 Oct, 2022 1 commit
    • Victorien Le Couviour--Tuffet's avatar
      threading: Fix a race around frame completion (frame-mt) · 3e7886db
      Victorien Le Couviour--Tuffet authored
      The completion of the first frame to decode while an async reset
      request on that same frame is pending will render it stale. The
      processing of such a stale request is likely to result in a hang.
      One reason this happens is the skip condition at the beginning of
      => Consume the async request before that check.
      Another reason is several threads producing async reset requests in
      parallel: an async request for the first frame could cascade through the
      other threads (other frames) during completion of that frame, meaning
      not being caught by the last synchronous reset_task_cur() after
      signaling the main thread and before releasing the lock.
      => To solve this we need to add protections at the racy locations. That
      means after we increase first, before returning from
      reset_task_cur_async(), and after consuming the async request.
  15. 10 Oct, 2022 1 commit
    • Sebastian Dröge's avatar
      Handle host_machine.system() 'ios' and 'tvos' the same way as 'darwin' · 5b07b425
      Sebastian Dröge authored
      Despite not being documented in Meson's list of canonical system names,
      Meson does accept 'ios' mostly a synonym for darwin.
      By using 'ios' instead of darwin, it allows distinguishing between the
      two in the cases where that is necessary. Therefore, within dav1d, allow
      using the 'ios' name as alias for 'darwin' for system name, to allow
      using cross files that does this distinction.
      meson itself also allows 'tvos' in addition to 'ios' in the internal
      `is_darwin()` function, as such all 3 are handled the same here.
  16. 30 Sep, 2022 2 commits
  17. 28 Sep, 2022 3 commits
  18. 26 Sep, 2022 1 commit
  19. 19 Sep, 2022 4 commits
    • Martin Storsjö's avatar
      arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths · 345127a7
      Martin Storsjö authored
      This fixes conformance with the argon test samples, in particular
      with these samples:
      This gives a pretty notable slowdown to these transforms - some
      Before:                                 Cortex A53       A72       A73    Apple M1
      inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       365.7     290.2     299.8    0.3
      inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    1865.2    1384.1    1457.5    2.6
      inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   33976.3   26817.0   24864.2   40.4
      inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       397.7     322.2     335.1    0.4
      inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    2121.9    1336.7    1664.6    2.6
      inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   38569.4   27622.6   28176.0   51.0
      Thus, for the transforms alone, it makes them around 10-13% slower
      (the Apple M1 measurements are too noisy to be conclusive here).
      Measured on actual full decoding, it makes decoding of 10 bpc
      Chimera around maybe 1% slower on an Apple M1 - close to measurement
      noise anyway.
    • Henrik Gramner's avatar
    • Henrik Gramner's avatar
      x86: Fix overflows in 12bpc AVX2 DC-only IDCT · 49b1c3c5
      Henrik Gramner authored
      Using smaller immediates also results in a small code size reduction in
      some cases, so apply those changes to the (10bpc-only) SSE code as well.
    • Henrik Gramner's avatar
      x86: Fix clipping in high bit-depth AVX2 4x16 IDCT · 0c8a3461
      Henrik Gramner authored
      Certain clips were incorrectly performed on negated values, which
      caused things to be off-by-one in both directions. Correct this by
      negating such values prior to clipping instead of afterwards.
  20. 15 Sep, 2022 1 commit