Provide implementations for sad, sad_xN, ssd functions using dotprod instructions on aarch64
Based on the groundwork of !169 (merged), provide implementations for functions using the instructions SDOT/UDOT in the DotProd Armv8 extension.
Functions implemented: sad_16x8, sad_16x16, sad_x3_16x8_neon, sad_x3_16x16_neon, sad_x4_16x8_neon, sad_x4_16x16_neon, ssd_8x4, ssd_8x8, ssd_8x16, ssd_16x8, ssd_16x16, pixel_vsad
Performance improvement against Neon ranges from 5% to 188%.
Following is the output of ./checkasm8 --bench (run on a Graviton4 system):
sad_16x8_c: 1324
sad_16x8_neon: 222
sad_16x8_dotprod: 211
sad_16x16_c: 2535
sad_16x16_neon: 344
sad_16x16_dotprod: 325
sad_x3_16x8_c: 3837
sad_x3_16x8_neon: 415
sad_x3_16x8_dotprod: 329
sad_x3_16x16_c: 7724
sad_x3_16x16_neon: 722
sad_x3_16x16_dotprod: 546
sad_x4_16x8_c: 5080
sad_x4_16x8_neon: 438
sad_x4_16x8_dotprod: 377
sad_x4_16x16_c: 10263
sad_x4_16x16_neon: 784
sad_x4_16x16_dotprod: 652
ssd_8x4_c: 381
ssd_8x4_neon: 163
ssd_8x4_dotprod: 133
ssd_8x4_sve: 150
ssd_8x8_c: 695
ssd_8x8_neon: 237
ssd_8x8_dotprod: 158
ssd_8x8_sve: 228
ssd_8x16_c: 1335
ssd_8x16_neon: 387
ssd_8x16_dotprod: 260
ssd_16x8_c: 1342
ssd_16x8_neon: 285
ssd_16x8_dotprod: 167
ssd_16x16_c: 2622
ssd_16x16_neon: 503
ssd_16x16_dotprod: 267
vsad_c: 2782
vsad_neon: 287
vsad_dotprod: 229
The ssd ones are faster than the _sve ones, which brings of the point of how to choose the functions when both implementations are available (eg on a Graviton3/4 system).
Merge request reports
Activity
requested review from @mstorsjo
- Resolved by Martin Storsjö
- Resolved by Konstantinos Margaritis
Overall; please do squash your changes, to present them in the form that you'd want them to be merged in the end. That is, don't add/remove things back and forth in your branch, but each commit should be a working unit.
Then add the benchmarks to the commit message of the commit(s) that add the function implementations. In this case, your MR should only need to touch files in common//pixel.
- Resolved by Konstantinos Margaritis
- Resolved by Konstantinos Margaritis
- Resolved by Konstantinos Margaritis
The ssd ones are faster than the _sve ones, which brings of the point of how to choose the functions when both implementations are available (eg on a Graviton3/4 system).
This isn't very surprising... The SVE functions here don't really use/benefit much from SVE, the main difference is that they use expanding loads (allowing loading e.g. 8 bit values into the lower byte of 16 bit element vectors, avoiding a separate
uxtl
); the performance gain over plain NEON is very modest.So if we do add these significantly faster dotprod implementations, I would suggest we should consider simply dropping the corresponding SVE functions that don't bring much extra value. But that should be done as a separate later change in that case.
- Resolved by Konstantinos Margaritis
From your benchmarks:
sad_16x8_c: 1324 sad_16x8_neon: 222 sad_16x8_neon_dotprod: 211
This printout doesn't seem to match what I'm getting here, I'm just getting
sad_16x16_dotprod
, notsad_16x16_neon_dotprod
.
added 1 commit
- eca273b8 - Enable compilation/assembly of code using dotprod only when available
added 1 commit
- 4c4b352a - Added dotprod pixel_vsad implementation, 25% faster than Neon
- Resolved by Konstantinos Margaritis
- Resolved by Martin Storsjö
- Resolved by Martin Storsjö
- Resolved by Konstantinos Margaritis
- Resolved by Konstantinos Margaritis
added 1 commit
- dbef5535 - removed #ifdef HAVE_DOTPROD, rescheduled instructions for better performance
- Resolved by Konstantinos Margaritis