arm: looprestoration: Port the ARM64 SGR NEON assembly to 32 bit arm
The code is mostly a 1:1 port of the ARM64 code, with slightly worse scheduling due to fewer temporary registers available. The sgr_finish_filter1_neon function (used in the 3x3 and mix cases) processes 4 pixels at a time while the ARM64 version processes 8, due to not having enough registers available.
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
selfguided_3x3_8bpc_neon: 2.12 2.89 1.79 2.61 2.03 3.87
selfguided_5x5_8bpc_neon: 2.50 3.41 2.16 3.14 2.74 4.64
selfguided_mix_8bpc_neon: 2.24 2.98 1.94 2.82 2.28 4.14
Comparison to the original ARM64 assembly:
ARM64: Cortex A53 A72 A73
selfguided_3x3_8bpc_neon: 486215.5 359445.6 341317.7
selfguided_5x5_8bpc_neon: 351210.8 267427.2 243399.3
selfguided_mix_8bpc_neon: 820489.1 610909.8 569946.6
ARM32:
selfguided_3x3_8bpc_neon: 542958.8 379448.8 353229.1
selfguided_5x5_8bpc_neon: 351299.6 263685.2 242415.9
selfguided_mix_8bpc_neon: 881587.6 629934.0 580121.2