Skip to content
Snippets Groups Projects
Commit c121b831 authored by Martin Storsjö's avatar Martin Storsjö
Browse files

arm64: looprestoration: Rewrite the SGR functions

Make them operate in a more cache friendly manner, interleaving the
various passes, and merging some of the functions that operate on
data in similar patterns.

This reduces the amount of stack used from 207 KB to 14 KB for sgr_3x3,
from 207 KB to 16 KB for sgr_5x5 and from 255 KB to 33 KB for sgr_mix.

This does however increase the size of the binary by about 12 KB. (The
executable code generated from assembly actually shrinks by a little,
but the higher level logic in C is quite nontrivial.)

This is somewhat similar to what was done for x86 in
fe2bb774.

Benchmarks from checkasm:

Before:             Cortex A53        A55        A72        A73        A76   Apple M1
sgr_3x3_8bpc_neon:    493005.0   483133.2   365056.3   345197.9   202819.1   537.3
sgr_5x5_8bpc_neon:    353152.6   349614.3   268962.2   248431.8   142302.4   385.9
sgr_mix_8bpc_neon:    829903.9   815910.9   622858.5   577238.0   333362.9   881.7
sgr_3x3_10bpc_neon:   504778.6   499851.6   379203.1   346695.2   199738.7   537.0
sgr_5x5_10bpc_neon:   363111.9   362489.7   267903.1   247506.5   138417.2   351.3
sgr_mix_10bpc_neon:   853053.7   846768.8   628349.6   584553.8   328399.5   843.6

After:
sgr_3x3_8bpc_neon:    387949.9   384216.4   294423.7   301968.2   184643.1   492.4
sgr_5x5_8bpc_neon:    259854.7   257233.2   193983.7   198388.4   128497.0   341.2
sgr_mix_8bpc_neon:    606401.5   595661.3   457209.7   462721.8   281906.7   738.6
sgr_3x3_10bpc_neon:   392472.7   394100.5   296048.1   304339.4   184271.4   471.3
sgr_5x5_10bpc_neon:   257248.3   257651.1   197552.5   199655.1   130739.7   322.9
sgr_mix_10bpc_neon:   605263.3   611197.4   441789.3   461339.2   286320.1   721.4

Speedup vs before:
                        27-41%     25-40%     23-42%     13-26%      5-18%   8-19%
parent 3c2f2087
No related branches found
No related tags found
1 merge request!1545arm64: looprestoration: Rewrite the SGR functions
Pipeline #357130 passed with stages
in 30 minutes and 35 seconds
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment