arm32: looprestoration: Rewrite the SGR functions
Switch to the same cache-friendly algorithm as was done for arm64 in c121b831.
This uses much less stack memory, and is much more cache friendly. In this form, most of the individual asm functions only operate on one single row of data at a time.
Some of the functions used to be unrolled to operate on two rows at a time, while they now only operate on one at a time. In practice, this is still a large performance win, as data is accessed in a much more cache friendly manner.
This gives a 2-37% speedup, and reduces the peak amount of stack used for these functions from 255 KB to 33 KB.
Before: Cortex A7 A8 A53 A72 A73
sgr_3x3_8bpc_neon: 873990.7 748341.9 543410.2 383200.4 357502.9
sgr_3x3_10bpc_neon: 909728.0 732594.5 560123.6 392765.5 359377.7
sgr_5x5_8bpc_neon: 591597.9 527353.1 350347.4 263464.9 243098.8
sgr_5x5_10bpc_neon: 637958.2 529462.8 364613.3 280664.6 255164.6
sgr_mix_8bpc_neon: 1458977.4 1185423.2 884017.7 632922.5 587395.2
sgr_mix_10bpc_neon: 1532376.5 1259111.4 918729.3 658787.6 600317.0
After:
sgr_3x3_8bpc_neon: 836138.7 635556.5 530596.1 335794.6 348209.9
sgr_3x3_10bpc_neon: 850835.4 596445.0 534583.2 342713.4 349713.5
sgr_5x5_8bpc_neon: 577039.7 443916.5 341684.8 223374.0 232841.3
sgr_5x5_10bpc_neon: 600975.7 400041.3 347529.8 234759.9 239351.7
sgr_mix_8bpc_neon: 1297988.7 925739.1 830360.7 545476.1 548706.6
sgr_mix_10bpc_neon: 1340112.6 914395.7 873342.4 574815.7 554681.6
With this change in place, dav1d can run with around 72 KB of stack on arm targets.