Skip to content

arm32: looprestoration: Rewrite the SGR functions

Martin Storsjö requested to merge mstorsjo/dav1d:arm32-sgr-rewrite into master

Switch to the same cache-friendly algorithm as was done for arm64 in c121b831.

This uses much less stack memory, and is much more cache friendly. In this form, most of the individual asm functions only operate on one single row of data at a time.

Some of the functions used to be unrolled to operate on two rows at a time, while they now only operate on one at a time. In practice, this is still a large performance win, as data is accessed in a much more cache friendly manner.

This gives a 2-37% speedup, and reduces the peak amount of stack used for these functions from 255 KB to 33 KB.

Before:              Cortex A7         A8        A53        A72        A73
sgr_3x3_8bpc_neon:    873990.7   748341.9   543410.2   383200.4   357502.9
sgr_3x3_10bpc_neon:   909728.0   732594.5   560123.6   392765.5   359377.7
sgr_5x5_8bpc_neon:    591597.9   527353.1   350347.4   263464.9   243098.8
sgr_5x5_10bpc_neon:   637958.2   529462.8   364613.3   280664.6   255164.6
sgr_mix_8bpc_neon:   1458977.4  1185423.2   884017.7   632922.5   587395.2
sgr_mix_10bpc_neon:  1532376.5  1259111.4   918729.3   658787.6   600317.0
After:
sgr_3x3_8bpc_neon:    836138.7   635556.5   530596.1   335794.6   348209.9
sgr_3x3_10bpc_neon:   850835.4   596445.0   534583.2   342713.4   349713.5
sgr_5x5_8bpc_neon:    577039.7   443916.5   341684.8   223374.0   232841.3
sgr_5x5_10bpc_neon:   600975.7   400041.3   347529.8   234759.9   239351.7
sgr_mix_8bpc_neon:   1297988.7   925739.1   830360.7   545476.1   548706.6
sgr_mix_10bpc_neon:  1340112.6   914395.7   873342.4   574815.7   554681.6

With this change in place, dav1d can run with around 72 KB of stack on arm targets.

Merge request reports

Loading