arm32: mc: 16 bpc blend, w_mask, emu_edge
- Dec 16, 2020
-
-
Martin Storsjö authored
Checkasm benchmarks: Cortex A7 A8 A53 A72 A73 emu_edge_w4_16bpc_neon: 375.0 312.6 268.3 159.3 170.0 emu_edge_w8_16bpc_neon: 619.3 425.5 435.5 249.5 291.1 emu_edge_w16_16bpc_neon: 719.1 568.3 506.9 324.2 314.4 emu_edge_w32_16bpc_neon: 2112.2 1677.7 1396.2 1050.5 1009.6 emu_edge_w64_16bpc_neon: 5046.8 4322.5 3693.7 3953.8 2682.8 emu_edge_w128_16bpc_neon: 16311.1 14341.3 12877.8 26183.5 8924.9 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 emu_edge_w4_16bpc_neon: 302.5 174.9 159.2 emu_edge_w8_16bpc_neon: 344.6 292.3 273.2 emu_edge_w16_16bpc_neon: 601.0 461.2 316.8 emu_edge_w32_16bpc_neon: 974.2 1274.7 960.5 emu_edge_w64_16bpc_neon: 2853.1 3527.6 2633.5 emu_edge_w128_16bpc_neon: 14633.5 26776.6 7236.0
38df0efa -
Martin Storsjö authored
Checkasm numbers: Cortex A7 A8 A53 A72 A73 w_mask_420_w4_16bpc_neon: 350.3 216.4 215.4 141.7 134.5 w_mask_420_w8_16bpc_neon: 926.7 590.9 529.1 373.8 354.5 w_mask_420_w16_16bpc_neon: 2956.7 1880.4 1654.8 1186.1 1134.1 w_mask_420_w32_16bpc_neon: 11489.3 7426.4 6314.1 4599.8 4398.6 w_mask_420_w64_16bpc_neon: 28175.9 17898.1 16002.8 11079.0 10551.8 w_mask_420_w128_16bpc_neon: 71599.4 44630.9 40696.9 28057.3 27836.5 w_mask_422_w4_16bpc_neon: 339.0 210.1 206.7 137.3 134.7 w_mask_422_w8_16bpc_neon: 887.2 573.3 499.6 361.6 353.5 w_mask_422_w16_16bpc_neon: 2918.0 1841.6 1593.0 1194.0 1157.9 w_mask_422_w32_16bpc_neon: 11313.8 7238.7 6043.4 4577.1 4469.6 w_mask_422_w64_16bpc_neon: 27746.5 17427.2 15386.9 11082.6 10693.8 w_mask_422_w128_16bpc_neon: 70521.4 43864.9 39209.3 29045.7 28305.5 w_mask_444_w4_16bpc_neon: 325.6 202.9 198.4 135.2 129.3 w_mask_444_w8_16bpc_neon: 860.7 534.9 474.8 358.0 352.2 w_mask_444_w16_16bpc_neon: 2764.3 1714.4 1517.8 1160.6 1133.1 w_mask_444_w32_16bpc_neon: 10719.8 6738.3 5746.7 4458.6 4347.1 w_mask_444_w64_16bpc_neon: 26407.9 16224.1 14783.9 10784.3 10371.4 w_mask_444_w128_16bpc_neon: 67226.1 41060.1 37823.1 41696.1 27722.2 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 w_mask_420_w4_16bpc_neon: 173.6 123.6 120.3 w_mask_420_w8_16bpc_neon: 484.0 344.0 329.4 w_mask_420_w16_16bpc_neon: 1436.3 1025.7 1028.7 w_mask_420_w32_16bpc_neon: 5597.0 3994.8 3981.2 w_mask_420_w64_16bpc_neon: 13953.4 9700.8 9579.9 w_mask_420_w128_16bpc_neon: 35833.7 25519.3 24277.8 w_mask_422_w4_16bpc_neon: 159.4 111.7 114.2 w_mask_422_w8_16bpc_neon: 453.4 326.2 326.7 w_mask_422_w16_16bpc_neon: 1398.2 1063.3 1052.6 w_mask_422_w32_16bpc_neon: 5532.7 4143.0 4026.3 w_mask_422_w64_16bpc_neon: 13885.3 9978.0 9689.8 w_mask_422_w128_16bpc_neon: 35763.3 25822.4 24610.9 w_mask_444_w4_16bpc_neon: 152.9 110.0 112.8 w_mask_444_w8_16bpc_neon: 437.2 332.0 325.8 w_mask_444_w16_16bpc_neon: 1399.3 1068.9 1041.7 w_mask_444_w32_16bpc_neon: 5410.9 4139.7 4136.9 w_mask_444_w64_16bpc_neon: 13648.7 10011.8 10004.6 w_mask_444_w128_16bpc_neon: 35639.6 26910.8 25631.0
cf74bdec -
Martin Storsjö authored
Checkasm numbers: Cortex A7 A8 A53 A72 A73 blend_h_w2_16bpc_neon: 190.0 163.0 135.5 67.4 71.2 blend_h_w4_16bpc_neon: 204.4 119.1 140.3 61.2 74.9 blend_h_w8_16bpc_neon: 247.6 126.2 159.5 86.1 88.4 blend_h_w16_16bpc_neon: 391.6 186.5 230.7 134.9 149.4 blend_h_w32_16bpc_neon: 734.9 354.2 454.1 248.1 270.9 blend_h_w64_16bpc_neon: 1290.8 611.7 801.1 456.6 491.3 blend_h_w128_16bpc_neon: 2876.4 1354.2 1788.6 1083.4 1092.0 blend_v_w2_16bpc_neon: 264.4 325.2 206.8 107.6 123.0 blend_v_w4_16bpc_neon: 471.8 358.7 356.9 187.0 229.9 blend_v_w8_16bpc_neon: 616.9 365.3 445.4 218.2 248.5 blend_v_w16_16bpc_neon: 928.3 517.1 629.1 325.0 358.0 blend_v_w32_16bpc_neon: 1771.6 790.1 1106.1 631.2 584.7 blend_w4_16bpc_neon: 128.8 66.6 95.5 33.5 42.0 blend_w8_16bpc_neon: 238.7 118.0 156.8 76.5 84.5 blend_w16_16bpc_neon: 809.7 360.9 482.3 268.5 298.3 blend_w32_16bpc_neon: 2015.7 916.6 1177.0 682.1 730.9 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 blend_h_w2_16bpc_neon: 109.3 83.1 56.8 blend_h_w4_16bpc_neon: 114.1 61.1 62.3 blend_h_w8_16bpc_neon: 133.3 80.8 81.0 blend_h_w16_16bpc_neon: 215.6 132.7 149.5 blend_h_w32_16bpc_neon: 390.4 253.9 235.8 blend_h_w64_16bpc_neon: 715.8 455.8 454.0 blend_h_w128_16bpc_neon: 1649.7 1034.7 1066.2 blend_v_w2_16bpc_neon: 185.9 176.3 178.3 blend_v_w4_16bpc_neon: 338.3 184.4 234.3 blend_v_w8_16bpc_neon: 427.0 214.5 252.7 blend_v_w16_16bpc_neon: 680.4 358.1 389.2 blend_v_w32_16bpc_neon: 1100.7 615.5 690.1 blend_w4_16bpc_neon: 76.0 32.3 32.1 blend_w8_16bpc_neon: 134.4 76.3 71.5 blend_w16_16bpc_neon: 476.3 268.8 301.5 blend_w32_16bpc_neon: 1226.8 659.9 782.8
f809edb4 -
Martin Storsjö authoredeeb03a73
-
Martin Storsjö authoredf3197c1a
-
Martin Storsjö authored9257a961
-
Martin Storsjö authored
This is one cycle faster, when the other lanes don't need to be preserved, on some (old) cores.
85de1c3b -
Martin Storsjö authored9381637a
-
Martin Storsjö authoredc6df7491
-
Martin Storsjö authoredb0c97120
-