Skip to content
Snippets Groups Projects
  1. Oct 19, 2023
  2. Oct 05, 2023
  3. Oct 04, 2023
  4. Oct 03, 2023
  5. Sep 08, 2023
    • André Kempe's avatar
      fix: various errors in implementation of BTI · 769bd145
      André Kempe authored
      Amend call type in refmvs. Because these blocks are reached via
      blr x11, they need to be annotated.
      
      Add missing BTI landing pads in ipred.S and ipred16.S. Because the
      subroutines are called via a br from register, they need annotation with
      'bti j' (AARCH64_VALID_JUMP_TARGET).
      769bd145
  6. Aug 18, 2023
  7. Jul 25, 2023
  8. Jul 18, 2023
    • Henrik Gramner's avatar
      Account for chroma subsampling when allocating cbi buffers · 43a11ccb
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Reduces memory usage (by 3 kB per sb128 for 4:2:0) when decoding
      streams with subsampled chroma when frame threading is enabled.
      
      This also simplifies the logic for calculating cbi indices.
      Both entropy decoding and reconstruction access the elements in
      the same order, so calculating block x/y positions is redundant
      and we can instead just store values sequentially and increase
      the pointer by one every time it's accessed.
      43a11ccb
  9. Jul 12, 2023
  10. Jul 07, 2023
    • Martin Storsjö's avatar
      windows: Clarify unicode characters in RC files · a7e12b62
      Martin Storsjö authored
      Windows RC files can have strings expressed either as narrow
      chars expressed in a specific codepage, or as wide unicode strings.
      Regardless of which way they are expressed, they are converted into
      unicode strings in the compiled resource files.
      
      When using narrow strings, even if using escaped chars like \251,
      those chars are interpreted according to a specific codepage. The
      codepage can be specified with arguments to the RC/windres tool
      (or with a pragma, but not all tools support the pragmas),
      but when no codepage is specified, the exact interpretation varies.
      
      llvm-rc uses a hard stance of defaulting to only accepting ANSI
      chars unless something else has been specified (and pragmas aren't
      supported). llvm-windres defaults to CP 850 though, for compatibility
      with what most people probably intend to.
      
      However, GNU windres and MS rc.exe actually default to what the
      system's current default codepage is. That means that if the resource
      file is built on a machine with e.g. Japanese as the default locale,
      the file gets built differently, with a different Unicode character
      than what was intended.
      
      By converting the strings to wide strings, it is unambiguous that
      \251 refers to the Unicode code point u00A9 (octal 0251), i.e.
      copyright sign.
      
      This fixes building the RC files with llvm-rc. With GNU windres,
      llvm-windres and rc.exe, the files still generate the bitwise exact
      same output as before.
      a7e12b62
    • Matthias Dressel's avatar
      fc40a0db
  11. Jul 06, 2023
  12. Jul 01, 2023
    • Martin Storsjö's avatar
      arm32: refmvs: Fix building with MS armasm · 616bfd15
      Martin Storsjö authored
      Add an explicit align before the jump table; this avoids armasm bugs
      in how label differences are calculated. This matches how all other
      jump tables are written in our 32 bit arm assembly.
      616bfd15
  13. Jun 30, 2023
  14. Jun 28, 2023
    • Martin Storsjö's avatar
      arm64: refmvs: Fix building with MSVC · 189d47c2
      Martin Storsjö authored
      Binutils and LLVM assemblers can infer that this str instruction must
      be stur (and implicitly assemble it into that instruction), while MS
      armasm64 errored out with this message:
      
      src\libdav1d.a.p\refmvs.obj.asm(673) : error A2518: operand 2: Memory offset must be aligned
              str             q2, [x3, #(8*5-16)]
      189d47c2
  15. Jun 26, 2023
    • Martin Storsjö's avatar
      arm64: refmvs: Process two blocks at a time in save_tmvs · c39779f4
      Martin Storsjö authored
      Before:        Cortex A53       A55     A72       A73      A76  Apple M1
      save_tmvs_neon:   79184.7   79889.9  54720.2  54522.6  29919.6  216.4
      After:
      save_tmvs_neon:   73780.0   74339.2  70414.1  59102.0  35028.4  213.9
      
      The benefit from this is marginal on Cortex A53 and A55, and Apple
      M1, while this change actually makes the code notably slower on
      Cortex A72, A73 and A76.
      c39779f4
    • Martin Storsjö's avatar
      arm64: refmvs: Add NEON implementation of save_tmvs · 6aa37aec
      Martin Storsjö authored
                     Cortex A53       A55      A72      A73      A76  Apple M1
      save_tmvs_c:     116768.4  122653.1  82587.7  90445.0  45386.8  242.1
      save_tmvs_neon:   79184.7   79889.9  54720.2  54522.6  29919.6  216.4
      
      Relative speedup compared with C:
                  Cortex A53    A55    A72    A73    A76   Apple M1
      save_tmvs_neon:   1.47   1.54   1.51   1.66   1.52   1.12
      6aa37aec
  16. Jun 22, 2023
    • Martin Storsjö's avatar
      arm64: looprestoration: Rewrite the SGR functions · c121b831
      Martin Storsjö authored
      Make them operate in a more cache friendly manner, interleaving the
      various passes, and merging some of the functions that operate on
      data in similar patterns.
      
      This reduces the amount of stack used from 207 KB to 14 KB for sgr_3x3,
      from 207 KB to 16 KB for sgr_5x5 and from 255 KB to 33 KB for sgr_mix.
      
      This does however increase the size of the binary by about 12 KB. (The
      executable code generated from assembly actually shrinks by a little,
      but the higher level logic in C is quite nontrivial.)
      
      This is somewhat similar to what was done for x86 in
      fe2bb774.
      
      Benchmarks from checkasm:
      
      Before:             Cortex A53        A55        A72        A73        A76   Apple M1
      sgr_3x3_8bpc_neon:    493005.0   483133.2   365056.3   345197.9   202819.1   537.3
      sgr_5x5_8bpc_neon:    353152.6   349614.3   268962.2   248431.8   142302.4   385.9
      sgr_mix_8bpc_neon:    829903.9   815910.9   622858.5   577238.0   333362.9   881.7
      sgr_3x3_10bpc_neon:   504778.6   499851.6   379203.1   346695.2   199738.7   537.0
      sgr_5x5_10bpc_neon:   363111.9   362489.7   267903.1   247506.5   138417.2   351.3
      sgr_mix_10bpc_neon:   853053.7   846768.8   628349.6   584553.8   328399.5   843.6
      
      After:
      sgr_3x3_8bpc_neon:    387949.9   384216.4   294423.7   301968.2   184643.1   492.4
      sgr_5x5_8bpc_neon:    259854.7   257233.2   193983.7   198388.4   128497.0   341.2
      sgr_mix_8bpc_neon:    606401.5   595661.3   457209.7   462721.8   281906.7   738.6
      sgr_3x3_10bpc_neon:   392472.7   394100.5   296048.1   304339.4   184271.4   471.3
      sgr_5x5_10bpc_neon:   257248.3   257651.1   197552.5   199655.1   130739.7   322.9
      sgr_mix_10bpc_neon:   605263.3   611197.4   441789.3   461339.2   286320.1   721.4
      
      Speedup vs before:
                              27-41%     25-40%     23-42%     13-26%      5-18%   8-19%
      c121b831
    • Martin Storsjö's avatar
      arm64: looprestoration: Properly use 32 bit registers for 32 bit parameters · 3c2f2087
      Martin Storsjö authored
      This issue isn't caught by checkasm, since these functions are
      internal to the SGR implementation, and checkasm only affects
      the parameters on the external DSP function interface.
      
      This could potentially trigger errors with future compilers.
      3c2f2087
  17. Jun 12, 2023
  18. Jun 09, 2023
  19. Jun 07, 2023
  20. Jun 06, 2023
Loading