arm: 32: Port the arm64 NEON loopfilter to arm32
The code is a fairly exact 1:1 port of the ARM64 code, but operating on 8 pixels at a time, instead of 16.
Relative speedup over C code according to checkasm:
Cortex A7 A8 A9 A53 A72
lpf_h_sb_uv_w4_8bpc_neon: 1.37 1.43 1.25 1.72 1.56
lpf_h_sb_uv_w6_8bpc_neon: 2.18 2.17 1.72 2.63 2.31
lpf_h_sb_y_w4_8bpc_neon: 1.48 1.44 1.19 1.91 1.49
lpf_h_sb_y_w8_8bpc_neon: 2.34 2.08 1.78 2.84 2.41
lpf_h_sb_y_w16_8bpc_neon: 2.13 1.85 1.64 2.51 2.09
lpf_v_sb_uv_w4_8bpc_neon: 1.69 1.63 1.60 2.17 2.27
lpf_v_sb_uv_w6_8bpc_neon: 2.68 2.41 2.19 3.56 3.44
lpf_v_sb_y_w4_8bpc_neon: 1.77 1.76 1.42 2.34 2.12
lpf_v_sb_y_w8_8bpc_neon: 2.91 2.48 2.19 3.56 3.18
lpf_v_sb_y_w16_8bpc_neon: 2.59 2.14 1.97 3.25 2.81
Comparison to the original ARM64 assembly:
ARM64: A53 A72
lpf_h_sb_uv_w4_8bpc_neon: 701.2 503.7
lpf_h_sb_uv_w6_8bpc_neon: 994.7 669.3
lpf_h_sb_y_w4_8bpc_neon: 1655.4 1247.4
lpf_h_sb_y_w8_8bpc_neon: 2149.8 1578.2
lpf_h_sb_y_w16_8bpc_neon: 2323.0 1770.1
lpf_v_sb_uv_w4_8bpc_neon: 447.1 281.7
lpf_v_sb_uv_w6_8bpc_neon: 595.6 400.2
lpf_v_sb_y_w4_8bpc_neon: 1210.8 848.4
lpf_v_sb_y_w8_8bpc_neon: 1623.1 1183.9
lpf_v_sb_y_w16_8bpc_neon: 1697.5 1239.1
ARM32:
lpf_h_sb_uv_w4_8bpc_neon: 820.5 501.2
lpf_h_sb_uv_w6_8bpc_neon: 1237.4 719.7
lpf_h_sb_y_w4_8bpc_neon: 2207.0 1371.9
lpf_h_sb_y_w8_8bpc_neon: 3137.3 1803.1
lpf_h_sb_y_w16_8bpc_neon: 3295.3 1848.4
lpf_v_sb_uv_w4_8bpc_neon: 623.7 323.7
lpf_v_sb_uv_w6_8bpc_neon: 855.9 446.8
lpf_v_sb_y_w4_8bpc_neon: 1735.3 947.8
lpf_v_sb_y_w8_8bpc_neon: 2339.9 1251.9
lpf_v_sb_y_w16_8bpc_neon: 2358.3 1276.6
This is a WIP, I can test it on iOS and complete it with A73 checkasm numbers next week.
Edited by Jean-Baptiste Kempf