1. 05 Mar, 2020 3 commits
    • Jean-Baptiste Kempf's avatar
      Update NEWS for 0.6.0 · efd9e551
      Jean-Baptiste Kempf authored
      efd9e551
    • Martin Storsjö's avatar
      arm64: mc: NEON implementation of w_mask for 16 bpc · c8aaddea
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Checkasm numbers:          Cortex A53       A72       A73
      w_mask_420_w4_16bpc_neon:       173.6     123.5     120.3
      w_mask_420_w8_16bpc_neon:       484.2     344.1     329.5
      w_mask_420_w16_16bpc_neon:     1411.2    1027.4    1035.1
      w_mask_420_w32_16bpc_neon:     5561.5    4093.2    3980.1
      w_mask_420_w64_16bpc_neon:    13809.6    9856.5    9581.0
      w_mask_420_w128_16bpc_neon:   35614.7   25553.8   24284.4
      w_mask_422_w4_16bpc_neon:       159.4     112.2     114.2
      w_mask_422_w8_16bpc_neon:       453.4     326.1     326.7
      w_mask_422_w16_16bpc_neon:     1394.6    1062.3    1050.2
      w_mask_422_w32_16bpc_neon:     5485.8    4219.6    4027.3
      w_mask_422_w64_16bpc_neon:    13701.2   10079.6    9692.6
      w_mask_422_w128_16bpc_neon:   35455.3   25892.5   24625.9
      w_mask_444_w4_16bpc_neon:       153.0     112.3     112.7
      w_mask_444_w8_16bpc_neon:       437.2     331.8     325.8
      w_mask_444_w16_16bpc_neon:     1395.1    1069.1    1041.7
      w_mask_444_w32_16bpc_neon:     5370.1    4213.5    4138.1
      w_mask_444_w64_16bpc_neon:    13482.6   10190.5   10004.6
      w_mask_444_w128_16bpc_neon:   35583.7   26911.2   25638.8
      
      Corresponding numbers for 8 bpc for comparison:
      
      w_mask_420_w4_8bpc_neon:        126.6      79.1      87.7
      w_mask_420_w8_8bpc_neon:        343.9     195.0     211.5
      w_mask_420_w16_8bpc_neon:       886.3     540.3     577.7
      w_mask_420_w32_8bpc_neon:      3558.6    2152.4    2216.7
      w_mask_420_w64_8bpc_neon:      8894.9    5161.2    5297.0
      w_mask_420_w128_8bpc_neon:    22520.1   13514.5   13887.2
      w_mask_422_w4_8bpc_neon:        112.9      68.2      77.0
      w_mask_422_w8_8bpc_neon:        314.4     175.5     208.7
      w_mask_422_w16_8bpc_neon:       835.5     565.0     608.3
      w_mask_422_w32_8bpc_neon:      3381.3    2231.8    2287.6
      w_mask_422_w64_8bpc_neon:      8499.4    5343.6    5460.8
      w_mask_422_w128_8bpc_neon:    21823.3   14206.5   14249.1
      w_mask_444_w4_8bpc_neon:        104.6      65.8      72.7
      w_mask_444_w8_8bpc_neon:        290.4     173.7     196.6
      w_mask_444_w16_8bpc_neon:       831.4     586.7     591.7
      w_mask_444_w32_8bpc_neon:      3320.8    2300.6    2251.0
      w_mask_444_w64_8bpc_neon:      8300.0    5480.5    5346.8
      w_mask_444_w128_8bpc_neon:    21633.8   15981.3   14384.8
      c8aaddea
    • Janne Grunau's avatar
      CI: run a selection of jobs on a node with avx2 · bce8fae9
      Janne Grunau authored
      Switches build-debian (for avx2 checkasm coverage) and test-win64 and
      test-debian-unaligned-stack (for testing asm '%if's).
      Refs #330, #333
      bce8fae9
  2. 04 Mar, 2020 6 commits
    • Henrik Gramner's avatar
    • Martin Storsjö's avatar
      arm64: mc: NEON implementation of blend for 16bpc · fb348f64
      Martin Storsjö authored
      Checkasm numbers:     Cortex A53     A72     A73
      blend_h_w2_16bpc_neon:     109.3    83.1    56.7
      blend_h_w4_16bpc_neon:     114.1    61.4    62.3
      blend_h_w8_16bpc_neon:     133.3    80.8    81.1
      blend_h_w16_16bpc_neon:    215.6   132.7   149.5
      blend_h_w32_16bpc_neon:    390.4   254.2   235.8
      blend_h_w64_16bpc_neon:    719.1   456.3   453.8
      blend_h_w128_16bpc_neon:  1646.1  1112.3  1065.9
      blend_v_w2_16bpc_neon:     185.9   175.9   180.0
      blend_v_w4_16bpc_neon:     338.0   183.4   232.1
      blend_v_w8_16bpc_neon:     426.5   213.8   250.6
      blend_v_w16_16bpc_neon:    678.2   357.8   382.6
      blend_v_w32_16bpc_neon:   1098.3   686.2   695.6
      blend_w4_16bpc_neon:        75.7    31.5    32.0
      blend_w8_16bpc_neon:       134.0    75.0    75.8
      blend_w16_16bpc_neon:      467.9   267.3   310.0
      blend_w32_16bpc_neon:     1201.9   658.7   779.7
      
      Corresponding numbers for 8bpc for comparison:
      blend_h_w2_8bpc_neon:      104.1    55.9    60.8
      blend_h_w4_8bpc_neon:      108.9    58.7    48.2
      blend_h_w8_8bpc_neon:       99.3    64.4    67.4
      blend_h_w16_8bpc_neon:     145.2    93.4    85.1
      blend_h_w32_8bpc_neon:     262.2   157.5   148.6
      blend_h_w64_8bpc_neon:     466.7   278.9   256.6
      blend_h_w128_8bpc_neon:   1054.2   624.7   571.0
      blend_v_w2_8bpc_neon:      170.5   106.6   113.4
      blend_v_w4_8bpc_neon:      333.0   189.9   225.9
      blend_v_w8_8bpc_neon:      314.9   199.0   203.5
      blend_v_w16_8bpc_neon:     476.9   300.8   241.1
      blend_v_w32_8bpc_neon:     766.9   430.4   415.1
      blend_w4_8bpc_neon:         66.7    35.4    26.0
      blend_w8_8bpc_neon:        110.7    47.9    48.1
      blend_w16_8bpc_neon:       299.4   161.8   162.3
      blend_w32_8bpc_neon:       725.8   417.0   432.8
      fb348f64
    • Martin Storsjö's avatar
      arm: mc: Optimize blend_v · 52e9b435
      Martin Storsjö authored
      Use a post-increment with a register on the last increment, avoiding
      a separate increment. Avoid processing the last 8 pixels in the w32
      case when we only output 24 pixels.
      
      Before:
      ARM32                Cortex A7      A8      A9     A53     A72     A73
      blend_v_w4_8bpc_neon:    450.4   574.7   538.7   374.6   199.3   260.5
      blend_v_w8_8bpc_neon:    559.6   351.3   552.5   357.6   214.8   204.3
      blend_v_w16_8bpc_neon:   926.3   511.6   787.9   593.0   271.0   246.8
      blend_v_w32_8bpc_neon:  1482.5   917.0  1149.5   991.9   354.0   368.9
      ARM64
      blend_v_w4_8bpc_neon:                            351.1   200.0   224.1
      blend_v_w8_8bpc_neon:                            333.0   212.4   203.8
      blend_v_w16_8bpc_neon:                           495.2   302.0   247.0
      blend_v_w32_8bpc_neon:                           840.0   557.8   514.0
      
      After:
      ARM32
      blend_v_w4_8bpc_neon:    435.5   575.8   537.6   356.2   198.3   259.5
      blend_v_w8_8bpc_neon:    545.2   347.9   553.5   339.1   207.8   204.2
      blend_v_w16_8bpc_neon:   913.7   511.0   788.1   573.7   275.4   243.3
      blend_v_w32_8bpc_neon:  1445.3   951.2  1079.1   920.4   352.2   361.6
      ARM64
      blend_v_w4_8bpc_neon:                            333.0   191.3   225.9
      blend_v_w8_8bpc_neon:                            314.9   199.3   203.5
      blend_v_w16_8bpc_neon:                           476.9   301.3   241.1
      blend_v_w32_8bpc_neon:                           766.9   432.8   416.9
      52e9b435
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm64: mc: Fix indentation · 48ffb05e
      Martin Storsjö authored
      48ffb05e
    • Martin Storsjö's avatar
      arm64: mc: Use more intuitive lane specifications for loads/stores · 83c62716
      Martin Storsjö authored
      For loads where we load/store a full or half register (instead of
      a lanewise load/store), the lane specification in itself doesn't
      matter, only its size.
      
      This doesn't change the generated code, but makes it more readable.
      83c62716
  3. 03 Mar, 2020 2 commits
  4. 02 Mar, 2020 3 commits
    • Martin Storsjö's avatar
      arm64: loopfilter: NEON implementation of loopfilter for 16 bpc · 360243c2
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Checkasm runtimes:      Cortex A53     A72     A73
      lpf_h_sb_uv_w4_16bpc_neon:   919.0   795.0   714.9
      lpf_h_sb_uv_w6_16bpc_neon:  1267.7  1116.2  1081.9
      lpf_h_sb_y_w4_16bpc_neon:   1500.2  1543.9  1778.5
      lpf_h_sb_y_w8_16bpc_neon:   2216.1  2183.0  2568.1
      lpf_h_sb_y_w16_16bpc_neon:  2641.8  2630.4  2639.4
      lpf_v_sb_uv_w4_16bpc_neon:   836.5   572.7   667.3
      lpf_v_sb_uv_w6_16bpc_neon:  1130.8   709.1   955.5
      lpf_v_sb_y_w4_16bpc_neon:   1271.6  1434.4  1272.1
      lpf_v_sb_y_w8_16bpc_neon:   1818.0  1759.1  1664.6
      lpf_v_sb_y_w16_16bpc_neon:  1998.6  2115.8  1586.6
      
      Corresponding numbers for 8 bpc for comparison:
      lpf_h_sb_uv_w4_8bpc_neon:    799.4   632.8   695.4
      lpf_h_sb_uv_w6_8bpc_neon:   1067.3   613.6   767.5
      lpf_h_sb_y_w4_8bpc_neon:    1490.5  1179.1  1018.9
      lpf_h_sb_y_w8_8bpc_neon:    1892.9  1382.0  1172.0
      lpf_h_sb_y_w16_8bpc_neon:   2117.4  1625.4  1739.0
      lpf_v_sb_uv_w4_8bpc_neon:    447.1   447.7   446.0
      lpf_v_sb_uv_w6_8bpc_neon:    522.1   529.0   513.1
      lpf_v_sb_y_w4_8bpc_neon:    1043.7   785.0   775.9
      lpf_v_sb_y_w8_8bpc_neon:    1500.4  1115.9   881.2
      lpf_v_sb_y_w16_8bpc_neon:   1493.5  1371.4  1248.5
      360243c2
    • Martin Storsjö's avatar
      arm: loopfilter: Prepare for 16 bpc · ebbf91f4
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      ebbf91f4
    • Martin Storsjö's avatar
      arm: loopfilter: Fix a comment · ac492552
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      ac492552
  5. 25 Feb, 2020 2 commits
  6. 24 Feb, 2020 7 commits
  7. 21 Feb, 2020 2 commits
  8. 20 Feb, 2020 1 commit
  9. 18 Feb, 2020 1 commit
  10. 17 Feb, 2020 2 commits
    • Martin Storsjö's avatar
      arm: cdef: Do an 8 bit implementation for cases with all edges present · b33f46e8
      Martin Storsjö authored
      This increases the code size by around 3 KB on arm64.
      
      Before:
      ARM32:                    Cortex A7      A8      A9     A53     A72     A73
      cdef_filter_4x4_8bpc_neon:    807.1   517.0   617.7   506.6   429.9   357.8
      cdef_filter_4x8_8bpc_neon:   1407.9   899.3  1054.6   862.3   726.5   628.1
      cdef_filter_8x8_8bpc_neon:   2394.9  1456.8  1676.8  1461.2  1084.4  1101.2
      ARM64:
      cdef_filter_4x4_8bpc_neon:                            460.7   301.8   308.0
      cdef_filter_4x8_8bpc_neon:                            831.6   547.0   555.2
      cdef_filter_8x8_8bpc_neon:                           1454.6   935.6   960.4
      
      After:
      ARM32:
      cdef_filter_4x4_8bpc_neon:    669.3   541.3   524.4   424.9   322.7   298.1
      cdef_filter_4x8_8bpc_neon:   1159.1   922.9   881.1   709.2   538.3   514.1
      cdef_filter_8x8_8bpc_neon:   1888.8  1285.4  1358.5  1152.9   839.3   871.2
      ARM64:
      cdef_filter_4x4_8bpc_neon:                            383.6   262.1   259.9
      cdef_filter_4x8_8bpc_neon:                            684.9   472.2   464.7
      cdef_filter_8x8_8bpc_neon:                           1160.0   756.8   788.0
      
      (The checkasm benchmark averages three different cases; the fully
      edged case is one of those three, while it's the most common case
      in actual video. The difference is much bigger if only benchmarking
      that particular case.)
      
      This actually apparently makes the code a little bit slower for the w=4
      cases on Cortex A8, while it's a significant speedup on all other cores.
      b33f46e8
    • Martin Storsjö's avatar
      arm32: cdef: Fix a typo for consistency · aff9a210
      Martin Storsjö authored
      The signedness of elements doesn't matter for vsub; match the vsub.i16
      next to it.
      aff9a210
  11. 16 Feb, 2020 1 commit
    • Henrik Gramner's avatar
      cli: Implement line buffering in print_stats() · 09d90658
      Henrik Gramner authored
      Console output is incredibly slow on Windows, which is aggravated by
      the lack of line buffering. As a result, a significant percentage of
      overall runtime is actually spent displaying the decoding progress.
      
      Doing the line buffering manually alleviates most of the issue.
      09d90658
  12. 13 Feb, 2020 1 commit
  13. 11 Feb, 2020 7 commits
    • Martin Storsjö's avatar
      arm64: looprestoration: NEON implementation of SGR for 10 bpc · e3dbf926
      Martin Storsjö authored
      This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can
      be int16_t for 10 bpc, but need to be int32_t for 12 bpc.
      
      Make actual templates out of the functions in looprestoration_tmpl.S,
      and add box3/5_h to looprestoration16.S.
      
      Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter
      (which is passed even in 8bpc mode), add a define to bitdepth.h for
      passing such a parameter in all modes. This makes this function
      a few instructions slower in 8bpc mode than it was before (overall impact
      seems to be around 1% of the total runtime of SGR), but allows using the
      same actual function instantiation for all modes, saving a bit of code
      size.
      
      Examples of checkasm runtimes:
                                 Cortex A53        A72        A73
      selfguided_3x3_10bpc_neon:   516755.8   389412.7   349058.7
      selfguided_5x5_10bpc_neon:   380699.9   293486.6   254591.6
      selfguided_mix_10bpc_neon:   878142.3   667495.9   587844.6
      
      Corresponding 8 b...
      e3dbf926
    • Martin Storsjö's avatar
      arm64: looprestoration: Prepare for 16 bpc by splitting code to separate files · 7cf5d753
      Martin Storsjö authored
      looprestoration_common.S contains functions that can be used as is
      with one single instantiation of the functions for both 8 and 16 bpc.
      This file will be built once, regardless of which bitdepths are enabled.
      
      looprestoration_tmpl.S contains functions where the source can be shared
      and templated between 8 and 16 bpc. This will be included by the separate
      8/16bpc implementaton files.
      7cf5d753
    • Martin Storsjö's avatar
      arm: looprestoration: Add 8bpc to existing function names, add HIGHBD_*_SUFFIX · 32e265a8
      Martin Storsjö authored
      Don't add it to dav1d_sgr_calc_ab1/2_neon and box3/5_v, as the same
      concrete function implementations can be shared for both 8 and 16 bpc
      for those functions.
      32e265a8
    • Martin Storsjö's avatar
      looprestoration: Add a bpc parameter to the init func · 96da9cc2
      Martin Storsjö authored
      This allows using completely different codepaths for 10 and 12 bpc,
      or just adding SIMD functions for either of them.
      96da9cc2
    • Martin Storsjö's avatar
      arm: looprestoration: Improve scheduling in box3/5_h slightly · 8fb30657
      Martin Storsjö authored
      Set flags further from the branch instructions that use them.
      8fb30657
    • Martin Storsjö's avatar
      arm: Use int16_t for the tmp intermediate buffer · 8e8fb84d
      Martin Storsjö authored
      For 8bpc and 10bpc, int16_t is enough here, and for 12bpc, other
      intermediate int16_t buffers also need to be made of size coef anyway.
      8e8fb84d
    • Martin Storsjö's avatar
      arm: looprestoration: Fix a comment · feeaf785
      Martin Storsjö authored
      feeaf785
  14. 10 Feb, 2020 2 commits