aarch64: Improve scheduling in sad_x3/sad_x4
Cortex A53 A72 A73
8 bpc:
Before:
sad_x3_4x4_neon: 580 303 204
sad_x3_4x8_neon: 1065 516 323
sad_x3_8x4_neon: 668 262 282
sad_x3_8x8_neon: 1238 454 471
sad_x3_8x16_neon: 2378 842 847
sad_x3_16x8_neon: 2136 738 776
sad_x3_16x16_neon: 4162 1378 1463
After:
sad_x3_4x4_neon: 477 298 206
sad_x3_4x8_neon: 842 515 327
sad_x3_8x4_neon: 603 260 279
sad_x3_8x8_neon: 1110 451 464
sad_x3_8x16_neon: 2125 841 843
sad_x3_16x8_neon: 2124 730 766
sad_x3_16x16_neon: 4145 1370 1434
10 bpc:
Before:
sad_x3_4x4_neon: 632 247 254
sad_x3_4x8_neon: 1162 419 443
sad_x3_8x4_neon: 890 358 416
sad_x3_8x8_neon: 1670 632 759
sad_x3_8x16_neon: 3230 1179 1458
sad_x3_16x8_neon: 3070 1209 1403
sad_x3_16x16_neon: 6030 2333 2699
After:
sad_x3_4x4_neon: 522 253 255
sad_x3_4x8_neon: 932 443 431
sad_x3_8x4_neon: 880 354 406
sad_x3_8x8_neon: 1660 626 736
sad_x3_8x16_neon: 3220 1170 1397
sad_x3_16x8_neon: 3060 1184 1362
sad_x3_16x16_neon: 6020 2272 2579
Thus, this is around a 20-25% speedup on Cortex A53 for the small sizes (much smaller difference for bigger sizes though), while it doesn't make much of a difference at all (mostly within measurement noise) for the out-of-order cores (A72 and A73).