Skip to content
Snippets Groups Projects

Provide implementations for sad, sad_xN, ssd functions using dotprod instructions on aarch64

Based on the groundwork of !169 (merged), provide implementations for functions using the instructions SDOT/UDOT in the DotProd Armv8 extension.

Functions implemented: sad_16x8, sad_16x16, sad_x3_16x8_neon, sad_x3_16x16_neon, sad_x4_16x8_neon, sad_x4_16x16_neon, ssd_8x4, ssd_8x8, ssd_8x16, ssd_16x8, ssd_16x16, pixel_vsad

Performance improvement against Neon ranges from 5% to 188%.

Following is the output of ./checkasm8 --bench (run on a Graviton4 system):

sad_16x8_c: 1324
sad_16x8_neon: 222
sad_16x8_dotprod: 211
sad_16x16_c: 2535
sad_16x16_neon: 344
sad_16x16_dotprod: 325
sad_x3_16x8_c: 3837
sad_x3_16x8_neon: 415
sad_x3_16x8_dotprod: 329
sad_x3_16x16_c: 7724
sad_x3_16x16_neon: 722
sad_x3_16x16_dotprod: 546
sad_x4_16x8_c: 5080
sad_x4_16x8_neon: 438
sad_x4_16x8_dotprod: 377
sad_x4_16x16_c: 10263
sad_x4_16x16_neon: 784
sad_x4_16x16_dotprod: 652
ssd_8x4_c: 381
ssd_8x4_neon: 163
ssd_8x4_dotprod: 133
ssd_8x4_sve: 150
ssd_8x8_c: 695
ssd_8x8_neon: 237
ssd_8x8_dotprod: 158
ssd_8x8_sve: 228
ssd_8x16_c: 1335
ssd_8x16_neon: 387
ssd_8x16_dotprod: 260
ssd_16x8_c: 1342
ssd_16x8_neon: 285
ssd_16x8_dotprod: 167
ssd_16x16_c: 2622
ssd_16x16_neon: 503
ssd_16x16_dotprod: 267
vsad_c: 2782
vsad_neon: 287
vsad_dotprod: 229

The ssd ones are faster than the _sve ones, which brings of the point of how to choose the functions when both implementations are available (eg on a Graviton3/4 system).

Edited by Konstantinos Margaritis

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
    • Resolved by Konstantinos Margaritis

      Overall; please do squash your changes, to present them in the form that you'd want them to be merged in the end. That is, don't add/remove things back and forth in your branch, but each commit should be a working unit.

      Then add the benchmarks to the commit message of the commit(s) that add the function implementations. In this case, your MR should only need to touch files in common//pixel.

  • Martin Storsjö
  • Martin Storsjö
  • The ssd ones are faster than the _sve ones, which brings of the point of how to choose the functions when both implementations are available (eg on a Graviton3/4 system).

    This isn't very surprising... The SVE functions here don't really use/benefit much from SVE, the main difference is that they use expanding loads (allowing loading e.g. 8 bit values into the lower byte of 16 bit element vectors, avoiding a separate uxtl); the performance gain over plain NEON is very modest.

    So if we do add these significantly faster dotprod implementations, I would suggest we should consider simply dropping the corresponding SVE functions that don't bring much extra value. But that should be done as a separate later change in that case.

  • added 1 commit

    • eca273b8 - Enable compilation/assembly of code using dotprod only when available

    Compare with previous version

  • Konstantinos Margaritis changed the description

    changed the description

  • added 1 commit

    • 4c4b352a - Added dotprod pixel_vsad implementation, 25% faster than Neon

    Compare with previous version

  • Konstantinos Margaritis changed the description

    changed the description

  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • added 1 commit

    • dbef5535 - removed #ifdef HAVE_DOTPROD, rescheduled instructions for better performance

    Compare with previous version

  • Martin Storsjö
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading