1. 16 Apr, 2021 1 commit
    • James Almer's avatar
      dav1d: add event flags to the decoding process · a98f5e60
      James Almer authored
      And a function to fetch them. Should be useful to signal changes in the
      bitstream the user may want to know about.
      
      Starting with two flags, DAV1D_EVENT_FLAG_NEW_SEQUENCE and
      DAV1D_EVENT_FLAG_NEW_OP_PARAMS_INFO, which signal the presence of an updated
      sequence header in the last returned (or to be returned) picture.
      a98f5e60
  2. 14 Apr, 2021 3 commits
  3. 15 Mar, 2021 1 commit
  4. 19 Feb, 2021 9 commits
  5. 17 Feb, 2021 2 commits
  6. 16 Feb, 2021 1 commit
  7. 15 Feb, 2021 3 commits
    • Henrik Gramner's avatar
      Eliminate 1D scan tables · 5faff383
      Henrik Gramner authored
      5faff383
    • Henrik Gramner's avatar
      Optimize non-qmatrix coefficient decoding · 989057fb
      Henrik Gramner authored
      Not having a quantizer matrix is the most common case, so it's
      worth having a separate code path for it that eliminates some
      calculations and table lookups.
      
      Without a qm, not only can we skip calculating dq * qm, but only
      Exp-Golomb-coded coefficients will have the potential to overflow,
      so we can also skip clipping for the vast majority of coefficients.
      989057fb
    • Henrik Gramner's avatar
      Optimize decoding of non-zero coefficients · a92e307f
      Henrik Gramner authored
      Cache indices of non-zero coefficients during the AC token decoding
      loop in order to speed up the sign decoding/dequant loop later.
      a92e307f
  8. 13 Feb, 2021 1 commit
  9. 12 Feb, 2021 1 commit
  10. 11 Feb, 2021 4 commits
    • Emmanuel Gil Peyrot's avatar
      Set thread names on Haiku · b44ec453
      Emmanuel Gil Peyrot authored
      b44ec453
    • Henrik Gramner's avatar
      x86: Rewrite SGR AVX2 asm · fe2bb774
      Henrik Gramner authored
      The previous implementation did multiple passes in the horizontal
      and vertical directions, with the intermediate values being stored
      in buffers on the stack. This caused bad cache thrashing.
      
      By interleaving the all the different passes in combination with a
      ring buffer for storing only a few rows at a time the performance
      is improved by a significant amount.
      
      Also slightly speed up neighbor calculations by packing the a and b
      values into a single 32-bit unsigned integer which allows calculations
      on both values simultaneously.
      fe2bb774
    • Henrik Gramner's avatar
      Add minor SGR optimizations · c290c02e
      Henrik Gramner authored
      Split the 5x5, 3x3, and mix cases into separate functions.
      
      Shrink some tables.
      
      Move some scalar calculations out of the DSP function.
      
      Make Wiener and SGR share the same function prototype to
      eliminate a branch in lr_stripe().
      c290c02e
    • Henrik Gramner's avatar
      x86inc: Add stack probing on Windows · c36b191a
      Henrik Gramner authored
      Large stack allocations on Windows need to use stack probing in order
      to guarantee that all stack memory is committed before accessing it.
      This is done by ensuring that the guard page(s) at the end of the
      currently committed pages are touched prior to any pages beyond that.
      c36b191a
  11. 10 Feb, 2021 1 commit
    • Martin Storsjö's avatar
      arm64: itx16: Use usqadd to avoid separate clamping of negative values · 6f9f3391
      Martin Storsjö authored
      Before:                                Cortex A53     A72      A73
      inv_txfm_add_4x4_dct_dct_0_10bpc_neon:       40.7    23.0     24.0
      inv_txfm_add_4x4_dct_dct_1_10bpc_neon:      116.0    71.5     78.2
      inv_txfm_add_8x8_dct_dct_0_10bpc_neon:       85.7    50.7     53.8
      inv_txfm_add_8x8_dct_dct_1_10bpc_neon:      287.0   203.5    215.2
      inv_txfm_add_16x16_dct_dct_0_10bpc_neon:    255.7   129.1    140.4
      inv_txfm_add_16x16_dct_dct_1_10bpc_neon:   1401.4  1026.7   1039.2
      inv_txfm_add_16x16_dct_dct_2_10bpc_neon:   1913.2  1407.3   1479.6
      After:
      inv_txfm_add_4x4_dct_dct_0_10bpc_neon:       38.7    21.5     22.2
      inv_txfm_add_4x4_dct_dct_1_10bpc_neon:      116.0    71.3     77.2
      inv_txfm_add_8x8_dct_dct_0_10bpc_neon:       76.7    44.7     43.5
      inv_txfm_add_8x8_dct_dct_1_10bpc_neon:      278.0   203.0    203.9
      inv_txfm_add_16x16_dct_dct_0_10bpc_neon:    236.9   106.2    116.2
      inv_txfm_add_16x16_dct_dct_1_10bpc_neon:   1368.7   999.7   1008.4
      inv_txfm_add_16x16_dct_dct_2_10bpc_neon:   1880.5  1381.2   1459.4
      6f9f3391
  12. 09 Feb, 2021 2 commits
    • Martin Storsjö's avatar
      arm64: looprestoration: Rewrite the wiener functions · 2e73051c
      Martin Storsjö authored
      Make them operate in a more cache friendly manner, interleaving
      horizontal and vertical filtering (reducing the amount of stack
      used from 51 KB to 4 KB), similar to what was done for x86 in
      78d27b7d.
      
      This also adds separate 5tap versions of the filters and unrolls
      the vertical filter a bit more (which maybe could have been done
      without doing the rewrite).
      
      This does, however, increase the compiled code size by around
      3.5 KB.
      
      Before:                Cortex A53       A72       A73
      wiener_5tap_8bpc_neon:   136855.6   91446.2   87363.6
      wiener_7tap_8bpc_neon:   136861.6   91454.9   87374.5
      wiener_5tap_10bpc_neon:  167685.3  114720.3  116522.1
      wiener_5tap_12bpc_neon:  167677.5  114724.7  116511.9
      wiener_7tap_10bpc_neon:  167681.6  114738.5  116567.0
      wiener_7tap_12bpc_neon:  167673.8  114720.8  116515.4
      After:
      wiener_5tap_8bpc_neon:    87102.1   60460.6   66803.8
      wiener_7tap_8bpc_neon:   110831.7   78489.0   82015.9
      wiener_5tap_10bpc_neon:  109999.2   90259.0   89238.0
      wiener_5tap_12bpc_neon:  109978.3   90255.7   89220.7
      wiener_7tap_10bpc_neon:  137877.6  107578.5  103435.6
      wiener_7tap_12bpc_neon:  137868.8  107568.9  103390.4
      2e73051c
    • Kyle Siefring's avatar
      arm64: mc: Improve first tap for inorder cores · 4e869495
      Kyle Siefring authored
      Change order of multiply accumulates to allow inorder cores to forward
      the results.
      4e869495
  13. 08 Feb, 2021 2 commits
    • Martin Storsjö's avatar
      arm32: mc: Optimize warp by doing horz filtering in 8 bit · 0477fcf1
      Martin Storsjö authored
      Additionally reschedule instructions for loading, to reduce stalls
      on in order cores.
      
      This applies the changes from a3b8157e
      on the arm32 version.
      
      Before:             Cortex A7      A8      A9     A53     A72     A73
      warp_8x8_8bpc_neon:    3659.3  1746.0  1931.9  2128.8  1173.7  1188.9
      warp_8x8t_8bpc_neon:   3650.8  1724.6  1919.8  2105.0  1147.7  1206.9
      warp_8x8_16bpc_neon:   4039.4  2111.9  2337.1  2462.5  1334.6  1396.5
      warp_8x8t_16bpc_neon:  3973.9  2137.1  2299.6  2413.2  1282.8  1369.6
      After:
      warp_8x8_8bpc_neon:    2920.8  1269.8  1410.3  1767.3   860.2  1004.8
      warp_8x8t_8bpc_neon:   2904.9  1283.9  1397.5  1743.7   863.6  1024.7
      warp_8x8_16bpc_neon:   3895.5  2060.7  2339.8  2376.6  1331.1  1394.0
      warp_8x8t_16bpc_neon:  3822.7  2026.7  2298.7  2325.4  1278.1  1360.8
      0477fcf1
    • Martin Storsjö's avatar
      lf_mask: Align an array that is accessed via aliasing structures · 0a577fd2
      Martin Storsjö authored
      This fixes bus errors due to missing alignment, when built with GCC 9
      for arm32 with -mfpu=neon.
      0a577fd2
  14. 06 Feb, 2021 2 commits
  15. 05 Feb, 2021 4 commits
    • Victorien Le Couviour--Tuffet's avatar
      Fix potential deadlock · 8b1a96e4
      Victorien Le Couviour--Tuffet authored
      If the postfilter tasks allocation fails, a deadlock would occur.
      8b1a96e4
    • Martin Storsjö's avatar
      505e9990
    • Kyle Siefring's avatar
      arm64: warped motion: Various optimizations · a3b8157e
      Kyle Siefring authored
      - Reorder loads of filters to benifit in order cores.
      - Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the
         first stage which will hurt performance on some older big cores.
      - Rework horz stage for 8 bit mode:
          * Use smull instead of mul
          * Replace existing narrow and long instructions
          * Replace mov after calling with right shift
      
      Before:            Cortex A55    A53     A72     A73
      warp_8x8_8bpc_neon:    1683.2  1860.6  1065.0  1102.6
      warp_8x8t_8bpc_neon:   1673.2  1846.4  1057.0  1098.4
      warp_8x8_16bpc_neon:   1870.7  2031.7  1147.3  1220.7
      warp_8x8t_16bpc_neon:  1848.0  2006.2  1121.6  1188.0
      After:
      warp_8x8_8bpc_neon:    1267.2  1446.2   807.0   871.5
      warp_8x8t_8bpc_neon:   1245.4  1422.0   810.2   868.4
      warp_8x8_16bpc_neon:   1769.8  1929.3  1132.0  1238.2
      warp_8x8t_16bpc_neon:  1747.3  1904.1  1101.5  1207.9
      
      Cortex-A55
      Before:
      warp_8x8_8bpc_neon:   1683.2
      warp_8x8t_8bpc_neon:  1673.2
      warp_8x8_16bpc_neon:  1870.7
      warp_8x8t_16bpc_neon: 1848.0
      After:
      warp_8x8_8bpc_neon:   1267.2
      warp_8x8t_8bpc_neon:  1245.4
      warp_8x8_16bpc_neon:  1769.8
      warp_8x8t_16bpc_neon: 1747.3
      a3b8157e
    • Kyle Siefring's avatar
      arm64: loopfilter: Avoid leaving 8-bits · 833382b3
      Kyle Siefring authored
      Avoid moving between 8 and 16-bit vectors where possible.
      833382b3
  16. 04 Feb, 2021 2 commits
  17. 02 Feb, 2021 1 commit