arm32: looprestoration: Rewrite the SGR functions (30c3dd8e) · Commits · VideoLAN / dav1d

Commit 30c3dd8e authored 1 year ago by Martin Storsjö

arm32: looprestoration: Rewrite the SGR functions

Switch to the same cache-friendly algorithm as was done for arm64
in c121b831.

This uses much less stack memory, and is much more cache friendly.
In this form, most of the individual asm functions only operate on
one single row of data at a time.

Some of the functions used to be unrolled to operate on two rows
at a time, while they now only operate on one at a time. In practice,
this is still a large performance win, as data is accessed in a
much more cache friendly manner.

This gives a 2-37% speedup, and reduces the peak amount of stack
used for these functions from 255 KB to 33 KB.

Before:              Cortex A7         A8        A53        A72        A73
sgr_3x3_8bpc_neon:    873990.7   748341.9   543410.2   383200.4   357502.9
sgr_3x3_10bpc_neon:   909728.0   732594.5   560123.6   392765.5   359377.7
sgr_5x5_8bpc_neon:    591597.9   527353.1   350347.4   263464.9   243098.8
sgr_5x5_10bpc_neon:   637958.2   529462.8   364613.3   280664.6   255164.6
sgr_mix_8bpc_neon:   1458977.4  1185423.2   884017.7   632922.5   587395.2
sgr_mix_10bpc_neon:  1532376.5  1259111.4   918729.3   658787.6   600317.0
After:
sgr_3x3_8bpc_neon:    836138.7   635556.5   530596.1   335794.6   348209.9
sgr_3x3_10bpc_neon:   850835.4   596445.0   534583.2   342713.4   349713.5
sgr_5x5_8bpc_neon:    577039.7   443916.5   341684.8   223374.0   232841.3
sgr_5x5_10bpc_neon:   600975.7   400041.3   347529.8   234759.9   239351.7
sgr_mix_8bpc_neon:   1297988.7   925739.1   830360.7   545476.1   548706.6
sgr_mix_10bpc_neon:  1340112.6   914395.7   873342.4   574815.7   554681.6

With this change in place, dav1d can run with around 72 KB of stack
on arm targets.

Not all functions have been merged in the same way as they were
for arm64 in c121b831, so some
minor differences remain; it's possible to incrementally optimize
this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse
finish_filter_row1/2 with sgr_weighted_row1, and make a version of
finish_filter_row1 that produces 2 rows, like is done for arm64.

It's also possible to rewrite the logic for calculating sgr_x_by_x
in the same way as was done for arm64 in
79db1624.

parent 1b7f1263

No related branches found

No related tags found

1 merge request!1765arm32: looprestoration: Rewrite the SGR functions

Pipeline #536208 passed with stages

in 1 hour, 14 minutes, and 18 seconds

Hide whitespace changes

Inline Side-by-side

Showing with 611 additions and 1175 deletions

Please register or to comment