arm32: looprestoration: Rewrite the SGR functions
Switch to the same cache-friendly algorithm as was done for arm64 in c121b831. This uses much less stack memory, and is much more cache friendly. In this form, most of the individual asm functions only operate on one single row of data at a time. Some of the functions used to be unrolled to operate on two rows at a time, while they now only operate on one at a time. In practice, this is still a large performance win, as data is accessed in a much more cache friendly manner. This gives a 2-37% speedup, and reduces the peak amount of stack used for these functions from 255 KB to 33 KB. Before: Cortex A7 A8 A53 A72 A73 sgr_3x3_8bpc_neon: 873990.7 748341.9 543410.2 383200.4 357502.9 sgr_3x3_10bpc_neon: 909728.0 732594.5 560123.6 392765.5 359377.7 sgr_5x5_8bpc_neon: 591597.9 527353.1 350347.4 263464.9 243098.8 sgr_5x5_10bpc_neon: 637958.2 529462.8 364613.3 280664.6 255164.6 sgr_mix_8bpc_neon: 1458977.4 1185423.2 884017.7 632922.5 587395.2 sgr_mix_10bpc_neon: 1532376.5 1259111.4 918729.3 658787.6 600317.0 After: sgr_3x3_8bpc_neon: 836138.7 635556.5 530596.1 335794.6 348209.9 sgr_3x3_10bpc_neon: 850835.4 596445.0 534583.2 342713.4 349713.5 sgr_5x5_8bpc_neon: 577039.7 443916.5 341684.8 223374.0 232841.3 sgr_5x5_10bpc_neon: 600975.7 400041.3 347529.8 234759.9 239351.7 sgr_mix_8bpc_neon: 1297988.7 925739.1 830360.7 545476.1 548706.6 sgr_mix_10bpc_neon: 1340112.6 914395.7 873342.4 574815.7 554681.6 With this change in place, dav1d can run with around 72 KB of stack on arm targets. Not all functions have been merged in the same way as they were for arm64 in c121b831, so some minor differences remain; it's possible to incrementally optimize this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse finish_filter_row1/2 with sgr_weighted_row1, and make a version of finish_filter_row1 that produces 2 rows, like is done for arm64. It's also possible to rewrite the logic for calculating sgr_x_by_x in the same way as was done for arm64 in 79db1624.
parent
1b7f1263
No related branches found
No related tags found
Showing
- src/arm/32/looprestoration.S 187 additions, 231 deletionssrc/arm/32/looprestoration.S
- src/arm/32/looprestoration16.S 199 additions, 241 deletionssrc/arm/32/looprestoration16.S
- src/arm/32/looprestoration_common.S 75 additions, 310 deletionssrc/arm/32/looprestoration_common.S
- src/arm/32/looprestoration_tmpl.S 93 additions, 247 deletionssrc/arm/32/looprestoration_tmpl.S
- src/arm/looprestoration.h 57 additions, 146 deletionssrc/arm/looprestoration.h
Loading
Please register or sign in to comment