Skip to content
Snippets Groups Projects
  1. Jun 22, 2023
    • Martin Storsjö's avatar
      arm64: looprestoration: Rewrite the SGR functions · c121b831
      Martin Storsjö authored
      Make them operate in a more cache friendly manner, interleaving the
      various passes, and merging some of the functions that operate on
      data in similar patterns.
      
      This reduces the amount of stack used from 207 KB to 14 KB for sgr_3x3,
      from 207 KB to 16 KB for sgr_5x5 and from 255 KB to 33 KB for sgr_mix.
      
      This does however increase the size of the binary by about 12 KB. (The
      executable code generated from assembly actually shrinks by a little,
      but the higher level logic in C is quite nontrivial.)
      
      This is somewhat similar to what was done for x86 in
      fe2bb774.
      
      Benchmarks from checkasm:
      
      Before:             Cortex A53        A55        A72        A73        A76   Apple M1
      sgr_3x3_8bpc_neon:    493005.0   483133.2   365056.3   345197.9   202819.1   537.3
      sgr_5x5_8bpc_neon:    353152.6   349614.3   268962.2   248431.8   142302.4   385.9
      sgr_mix_8bpc_neon:    829903.9   815910.9   622858.5   577238.0   333362.9   881.7
      sgr_3x3_10bpc_neon:   504778.6   499851.6   379203.1   346695.2   199738.7   537.0
      sgr_5x5_10bpc_neon:   363111.9   362489.7   267903.1   247506.5   138417.2   351.3
      sgr_mix_10bpc_neon:   853053.7   846768.8   628349.6   584553.8   328399.5   843.6
      
      After:
      sgr_3x3_8bpc_neon:    387949.9   384216.4   294423.7   301968.2   184643.1   492.4
      sgr_5x5_8bpc_neon:    259854.7   257233.2   193983.7   198388.4   128497.0   341.2
      sgr_mix_8bpc_neon:    606401.5   595661.3   457209.7   462721.8   281906.7   738.6
      sgr_3x3_10bpc_neon:   392472.7   394100.5   296048.1   304339.4   184271.4   471.3
      sgr_5x5_10bpc_neon:   257248.3   257651.1   197552.5   199655.1   130739.7   322.9
      sgr_mix_10bpc_neon:   605263.3   611197.4   441789.3   461339.2   286320.1   721.4
      
      Speedup vs before:
                              27-41%     25-40%     23-42%     13-26%      5-18%   8-19%
      c121b831
    • Martin Storsjö's avatar
      arm64: looprestoration: Properly use 32 bit registers for 32 bit parameters · 3c2f2087
      Martin Storsjö authored
      This issue isn't caught by checkasm, since these functions are
      internal to the SGR implementation, and checkasm only affects
      the parameters on the external DSP function interface.
      
      This could potentially trigger errors with future compilers.
      3c2f2087
  2. Jun 12, 2023
  3. Jun 09, 2023
  4. Jun 07, 2023
  5. Jun 06, 2023
  6. Jun 02, 2023
  7. Jun 01, 2023
  8. May 31, 2023
  9. May 29, 2023
  10. May 26, 2023
  11. May 25, 2023
Loading