Skip to content

arm: looprestoration: Port the ARM64 SGR NEON assembly to 32 bit arm

Martin Storsjö requested to merge mstorsjo/dav1d:arm32-sgr into master

The code is mostly a 1:1 port of the ARM64 code, with slightly worse scheduling due to fewer temporary registers available. The sgr_finish_filter1_neon function (used in the 3x3 and mix cases) processes 4 pixels at a time while the ARM64 version processes 8, due to not having enough registers available.

Relative speedup over C code:

                       Cortex A7     A8     A9    A53    A72    A73
selfguided_3x3_8bpc_neon:   2.12   2.89   1.79   2.61   2.03   3.87
selfguided_5x5_8bpc_neon:   2.50   3.41   2.16   3.14   2.74   4.64
selfguided_mix_8bpc_neon:   2.24   2.98   1.94   2.82   2.28   4.14

Comparison to the original ARM64 assembly:

ARM64:                    Cortex A53        A72        A73
selfguided_3x3_8bpc_neon:   486215.5   359445.6   341317.7
selfguided_5x5_8bpc_neon:   351210.8   267427.2   243399.3
selfguided_mix_8bpc_neon:   820489.1   610909.8   569946.6
ARM32:
selfguided_3x3_8bpc_neon:   542958.8   379448.8   353229.1
selfguided_5x5_8bpc_neon:   351299.6   263685.2   242415.9
selfguided_mix_8bpc_neon:   881587.6   629934.0   580121.2

Merge request reports