Skip to content

arm: 32: Port the arm64 NEON loopfilter to arm32

Martin Storsjö requested to merge mstorsjo/dav1d:arm32-lpf into master

The code is a fairly exact 1:1 port of the ARM64 code, but operating on 8 pixels at a time, instead of 16.

Relative speedup over C code according to checkasm:

                       Cortex A7     A8     A9    A53    A72
lpf_h_sb_uv_w4_8bpc_neon:   1.37   1.43   1.25   1.72   1.56
lpf_h_sb_uv_w6_8bpc_neon:   2.18   2.17   1.72   2.63   2.31
lpf_h_sb_y_w4_8bpc_neon:    1.48   1.44   1.19   1.91   1.49
lpf_h_sb_y_w8_8bpc_neon:    2.34   2.08   1.78   2.84   2.41
lpf_h_sb_y_w16_8bpc_neon:   2.13   1.85   1.64   2.51   2.09
lpf_v_sb_uv_w4_8bpc_neon:   1.69   1.63   1.60   2.17   2.27
lpf_v_sb_uv_w6_8bpc_neon:   2.68   2.41   2.19   3.56   3.44
lpf_v_sb_y_w4_8bpc_neon:    1.77   1.76   1.42   2.34   2.12
lpf_v_sb_y_w8_8bpc_neon:    2.91   2.48   2.19   3.56   3.18
lpf_v_sb_y_w16_8bpc_neon:   2.59   2.14   1.97   3.25   2.81

Comparison to the original ARM64 assembly:

ARM64:                        A53     A72
lpf_h_sb_uv_w4_8bpc_neon:   701.2   503.7
lpf_h_sb_uv_w6_8bpc_neon:   994.7   669.3
lpf_h_sb_y_w4_8bpc_neon:   1655.4  1247.4
lpf_h_sb_y_w8_8bpc_neon:   2149.8  1578.2
lpf_h_sb_y_w16_8bpc_neon:  2323.0  1770.1
lpf_v_sb_uv_w4_8bpc_neon:   447.1   281.7
lpf_v_sb_uv_w6_8bpc_neon:   595.6   400.2
lpf_v_sb_y_w4_8bpc_neon:   1210.8   848.4
lpf_v_sb_y_w8_8bpc_neon:   1623.1  1183.9
lpf_v_sb_y_w16_8bpc_neon:  1697.5  1239.1
ARM32:
lpf_h_sb_uv_w4_8bpc_neon:   820.5   501.2
lpf_h_sb_uv_w6_8bpc_neon:  1237.4   719.7
lpf_h_sb_y_w4_8bpc_neon:   2207.0  1371.9
lpf_h_sb_y_w8_8bpc_neon:   3137.3  1803.1
lpf_h_sb_y_w16_8bpc_neon:  3295.3  1848.4
lpf_v_sb_uv_w4_8bpc_neon:   623.7   323.7
lpf_v_sb_uv_w6_8bpc_neon:   855.9   446.8
lpf_v_sb_y_w4_8bpc_neon:   1735.3   947.8
lpf_v_sb_y_w8_8bpc_neon:   2339.9  1251.9
lpf_v_sb_y_w16_8bpc_neon:  2358.3  1276.6

This is a WIP, I can test it on iOS and complete it with A73 checkasm numbers next week.

Edited by Jean-Baptiste Kempf

Merge request reports