aarch64: Improve scheduling in sad_x3/sad_x4
Cortex A53 A72 A73 8 bpc: Before: sad_x3_4x4_neon: 580 303 204 sad_x3_4x8_neon: 1065 516 323 sad_x3_8x4_neon: 668 262 282 sad_x3_8x8_neon: 1238 454 471 sad_x3_8x16_neon: 2378 842 847 sad_x3_16x8_neon: 2136 738 776 sad_x3_16x16_neon: 4162 1378 1463 After: sad_x3_4x4_neon: 477 298 206 sad_x3_4x8_neon: 842 515 327 sad_x3_8x4_neon: 603 260 279 sad_x3_8x8_neon: 1110 451 464 sad_x3_8x16_neon: 2125 841 843 sad_x3_16x8_neon: 2124 730 766 sad_x3_16x16_neon: 4145 1370 1434 10 bpc: Before: sad_x3_4x4_neon: 632 247 254 sad_x3_4x8_neon: 1162 419 443 sad_x3_8x4_neon: 890 358 416 sad_x3_8x8_neon: 1670 632 759 sad_x3_8x16_neon: 3230 1179 1458 sad_x3_16x8_neon: 3070 1209 1403 sad_x3_16x16_neon: 6030 2333 2699 After: sad_x3_4x4_neon: 522 253 255 sad_x3_4x8_neon: 932 443 431 sad_x3_8x4_neon: 880 354 406 sad_x3_8x8_neon: 1660 626 736 sad_x3_8x16_neon: 3220 1170 1397 sad_x3_16x8_neon: 3060 1184 1362 sad_x3_16x16_neon: 6020 2272 2579 Thus, this is around a 20-25% speedup on Cortex A53 for the small sizes (much smaller difference for bigger sizes though), while it doesn't make much of a difference at all (mostly within measurement noise) for the out-of-order cores (A72 and A73).
parent
d46938de
No related branches found
No related tags found
Loading
Please register or sign in to comment