Provide implementations for sad, sad_xN, ssd functions using dotprod instructions on aarch64

requested review from @mstorsjo

Overall; please do squash your changes, to present them in the form that you'd want them to be merged in the end. That is, don't add/remove things back and forth in your branch, but each commit should be a working unit.

Then add the benchmarks to the commit message of the commit(s) that add the function implementations. In this case, your MR should only need to touch files in common//pixel.

The ssd ones are faster than the _sve ones, which brings of the point of how to choose the functions when both implementations are available (eg on a Graviton3/4 system).

This isn't very surprising... The SVE functions here don't really use/benefit much from SVE, the main difference is that they use expanding loads (allowing loading e.g. 8 bit values into the lower byte of 16 bit element vectors, avoiding a separate uxtl); the performance gain over plain NEON is very modest.

So if we do add these significantly faster dotprod implementations, I would suggest we should consider simply dropping the corresponding SVE functions that don't bring much extra value. But that should be done as a separate later change in that case.

From your benchmarks:

sad_16x8_c: 1324
sad_16x8_neon: 222
sad_16x8_neon_dotprod: 211

This printout doesn't seem to match what I'm getting here, I'm just getting sad_16x16_dotprod, not sad_16x16_neon_dotprod.

added 1 commit

eca273b8 - Enable compilation/assembly of code using dotprod only when available

Compare with previous version

changed the description

added 1 commit

4c4b352a - Added dotprod pixel_vsad implementation, 25% faster than Neon

Compare with previous version

changed the description

added 1 commit

dbef5535 - removed #ifdef HAVE_DOTPROD, rescheduled instructions for better performance

Compare with previous version

Provide implementations for sad, sad_xN, ssd functions using dotprod instructions on aarch64

Merge request reports

Activity