-
Martin Storsjö authored
The code is a fairly exact 1:1 port of the ARM64 code, but operating on 8 pixels at a time, instead of 16. Relative speedup over C code according to checkasm: Cortex A7 A8 A9 A53 A72 A73 lpf_h_sb_uv_w4_8bpc_neon: 1.36 1.40 1.25 1.71 1.55 1.59 lpf_h_sb_uv_w6_8bpc_neon: 2.18 2.11 1.74 2.65 2.32 2.34 lpf_h_sb_y_w4_8bpc_neon: 1.48 1.43 1.20 1.91 1.49 1.64 lpf_h_sb_y_w8_8bpc_neon: 2.34 2.05 1.78 2.84 2.35 2.69 lpf_h_sb_y_w16_8bpc_neon: 2.13 1.83 1.63 2.51 2.10 2.35 lpf_v_sb_uv_w4_8bpc_neon: 1.69 1.66 1.60 2.16 2.24 2.24 lpf_v_sb_uv_w6_8bpc_neon: 2.68 2.43 2.22 3.53 3.44 3.35 lpf_v_sb_y_w4_8bpc_neon: 1.74 1.74 1.43 2.34 2.14 2.18 lpf_v_sb_y_w8_8bpc_neon: 2.92 2.47 2.19 3.55 3.22 3.54 lpf_v_sb_y_w16_8bpc_neon: 2.62 2.19 1.98 3.25 2.80 3.10 Comparison to the original ARM64 assembly: ARM64: A53 A72 A73 lpf_h_sb_uv_w4_8bpc_neon: 702.5 518.2 529.1 lpf_h_sb_uv_w6_8bpc_neon: 1007.3 672.6 736.6 lpf_h_sb_y_w4_8bpc_neon: 1652.8 1261.2 1276.5 lpf_h_sb_y_w8_8bpc_neon: 2144.7 1559.8 1638.7 lpf_h_sb_y_w16_8bpc_neon: 2318.3 1757.2 1792.8 lpf_v_sb_uv_w4_8bpc_neon: 447.1 302.0 292.4 lpf_v_sb_uv_w6_8bpc_neon: 600.0 397.7 406.9 lpf_v_sb_y_w4_8bpc_neon: 1212.6 840.1 818.4 lpf_v_sb_y_w8_8bpc_neon: 1623.3 1167.4 1156.7 lpf_v_sb_y_w16_8bpc_neon: 1694.9 1237.9 1182.3 ARM32: lpf_h_sb_uv_w4_8bpc_neon: 821.2 501.1 500.8 lpf_h_sb_uv_w6_8bpc_neon: 1232.0 715.7 746.6 lpf_h_sb_y_w4_8bpc_neon: 2208.1 1373.2 1414.7 lpf_h_sb_y_w8_8bpc_neon: 3138.3 1843.1 1915.2 lpf_h_sb_y_w16_8bpc_neon: 3293.1 1842.5 1975.9 lpf_v_sb_uv_w4_8bpc_neon: 619.9 326.7 324.9 lpf_v_sb_uv_w6_8bpc_neon: 855.9 446.7 468.2 lpf_v_sb_y_w4_8bpc_neon: 1737.6 935.5 1007.0 lpf_v_sb_y_w8_8bpc_neon: 2346.7 1232.8 1298.3 lpf_v_sb_y_w16_8bpc_neon: 2353.4 1283.4 1379.9
9a100261
Loading