Skip to content
Snippets Groups Projects

ARM64: Various optimizations for symbol decode

Merged Kyle Siefring requested to merge KyleSiefring/dav1d:arm64_msac_impr4_review into master

Changes stem from redesigning the reduction stage of the multisymbol decode function.

  • No longer use adapt4 for 5 possible symbol values
  • Specialize reduction for 4/8/16 decode functions
  • Modify control flow

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • WIP:

    • Numbers for 4 and 8 look good for A72, Neoverse N1, and Neoverse V1 (graviton 1-3).
    • Numbers for 16 wide are a small improvement for Neoverse v1, but worse for A72 + N1.
    • The control flow changes seem to improve performance for A72/N1 on boolean symbol functions.

    I will probably end up reverting the new 16 wide version. Performance is already pretty close to the 8 wide version, so I don't expect major performance improvements.

    Edited by Kyle Siefring
  • Kyle Siefring added 1 commit

    added 1 commit

    • c1d4ee52 - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Cortex A72
    Old:
    msac_decode_bool_c:                  33.0 ( 1.00x)
    msac_decode_bool_neon:               39.3 ( 0.84x)
    msac_decode_bool_adapt_c:            69.5 ( 1.00x)
    msac_decode_bool_adapt_neon:         41.6 ( 1.67x)
    msac_decode_bool_equi_c:             30.5 ( 1.00x)
    msac_decode_bool_equi_neon:          35.0 ( 0.87x)
    msac_decode_hi_tok_c:               130.1 ( 1.00x)
    msac_decode_hi_tok_neon:            119.0 ( 1.09x)
    msac_decode_symbol_adapt4_c:         92.0 ( 1.00x)
    msac_decode_symbol_adapt4_neon:      67.8 ( 1.36x)
    msac_decode_symbol_adapt8_c:        116.2 ( 1.00x)
    msac_decode_symbol_adapt8_neon:      76.6 ( 1.52x)
    msac_decode_symbol_adapt16_c:       153.9 ( 1.00x)
    msac_decode_symbol_adapt16_neon:     77.5 ( 1.98x)
    
    New:
    msac_decode_bool_c:                  33.0 ( 1.00x)
    msac_decode_bool_neon:               29.0 ( 1.14x)
    msac_decode_bool_adapt_c:            60.0 ( 1.00x)
    msac_decode_bool_adapt_neon:         37.6 ( 1.60x)
    msac_decode_bool_equi_c:             30.6 ( 1.00x)
    msac_decode_bool_equi_neon:          26.3 ( 1.17x)
    msac_decode_hi_tok_c:               104.6 ( 1.00x)
    msac_decode_hi_tok_neon:            110.7 ( 0.95x)
    msac_decode_symbol_adapt4_c:         97.7 ( 1.00x)
    msac_decode_symbol_adapt4_neon:      61.9 ( 1.58x)
    msac_decode_symbol_adapt8_c:        110.6 ( 1.00x)
    msac_decode_symbol_adapt8_neon:      68.0 ( 1.63x)
    msac_decode_symbol_adapt16_c:       153.9 ( 1.00x)
    msac_decode_symbol_adapt16_neon:     75.6 ( 2.04x)
    Neoverse N1
    Old:
    msac_decode_bool_c:                  14.9 ( 1.00x)
    msac_decode_bool_neon:               14.3 ( 1.05x)
    msac_decode_bool_adapt_c:            23.2 ( 1.00x)
    msac_decode_bool_adapt_neon:         17.5 ( 1.32x)
    msac_decode_bool_equi_c:             14.3 ( 1.00x)
    msac_decode_bool_equi_neon:          14.0 ( 1.02x)
    msac_decode_hi_tok_c:                73.4 ( 1.00x)
    msac_decode_hi_tok_neon:             65.2 ( 1.13x)
    msac_decode_symbol_adapt4_c:         36.5 ( 1.00x)
    msac_decode_symbol_adapt4_neon:      28.4 ( 1.29x)
    msac_decode_symbol_adapt8_c:         52.5 ( 1.00x)
    msac_decode_symbol_adapt8_neon:      29.0 ( 1.81x)
    msac_decode_symbol_adapt16_c:        84.3 ( 1.00x)
    msac_decode_symbol_adapt16_neon:     33.3 ( 2.54x)
    
    New:
    msac_decode_bool_c:                  16.0 ( 1.00x)
    msac_decode_bool_neon:               14.0 ( 1.14x)
    msac_decode_bool_adapt_c:            21.4 ( 1.00x)
    msac_decode_bool_adapt_neon:         16.8 ( 1.28x)
    msac_decode_bool_equi_c:             14.3 ( 1.00x)
    msac_decode_bool_equi_neon:          11.5 ( 1.24x)
    msac_decode_hi_tok_c:                59.4 ( 1.00x)
    msac_decode_hi_tok_neon:             51.5 ( 1.15x)
    msac_decode_symbol_adapt4_c:         36.8 ( 1.00x)
    msac_decode_symbol_adapt4_neon:      22.8 ( 1.62x)
    msac_decode_symbol_adapt8_c:         52.9 ( 1.00x)
    msac_decode_symbol_adapt8_neon:      29.3 ( 1.81x)
    msac_decode_symbol_adapt16_c:        84.3 ( 1.00x)
    msac_decode_symbol_adapt16_neon:     38.0 ( 2.22x)
    Neoverse V1
    Old:
    msac_decode_bool_c:                  15.3 ( 1.00x)
    msac_decode_bool_neon:               13.0 ( 1.18x)
    msac_decode_bool_adapt_c:            19.1 ( 1.00x)
    msac_decode_bool_adapt_neon:         15.4 ( 1.24x)
    msac_decode_bool_equi_c:             13.3 ( 1.00x)
    msac_decode_bool_equi_neon:          11.3 ( 1.18x)
    msac_decode_hi_tok_c:                73.7 ( 1.00x)
    msac_decode_hi_tok_neon:             63.3 ( 1.17x)
    msac_decode_symbol_adapt4_c:         30.0 ( 1.00x)
    msac_decode_symbol_adapt4_neon:      28.6 ( 1.05x)
    msac_decode_symbol_adapt8_c:         41.9 ( 1.00x)
    msac_decode_symbol_adapt8_neon:      29.5 ( 1.42x)
    msac_decode_symbol_adapt16_c:        64.4 ( 1.00x)
    msac_decode_symbol_adapt16_neon:     31.6 ( 2.04x)
    
    New:
    msac_decode_bool_c:                  14.5 ( 1.00x)
    msac_decode_bool_neon:               12.9 ( 1.13x)
    msac_decode_bool_adapt_c:            18.4 ( 1.00x)
    msac_decode_bool_adapt_neon:         15.0 ( 1.23x)
    msac_decode_bool_equi_c:             13.1 ( 1.00x)
    msac_decode_bool_equi_neon:          10.5 ( 1.25x)
    msac_decode_hi_tok_c:                57.8 ( 1.00x)
    msac_decode_hi_tok_neon:             42.8 ( 1.35x)
    msac_decode_symbol_adapt4_c:         30.1 ( 1.00x)
    msac_decode_symbol_adapt4_neon:      22.6 ( 1.33x)
    msac_decode_symbol_adapt8_c:         41.6 ( 1.00x)
    msac_decode_symbol_adapt8_neon:      25.6 ( 1.63x)
    msac_decode_symbol_adapt16_c:        65.0 ( 1.00x)
    msac_decode_symbol_adapt16_neon:     29.1 ( 2.23x)
  • Kyle Siefring added 1 commit

    added 1 commit

    • 1539d51a - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Kyle Siefring added 1 commit

    added 1 commit

    • d66c7375 - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Kyle Siefring added 1 commit

    added 1 commit

    • 2846a1a1 - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Kyle Siefring added 1 commit

    added 1 commit

    • 4e477ee6 - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Kyle Siefring added 1 commit

    added 1 commit

    • 742126ac - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Kyle Siefring added 1 commit

    added 1 commit

    • f04db1b8 - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Kyle Siefring marked this merge request as ready

    marked this merge request as ready

  • requested review from @mstorsjo

  • Kyle Siefring added 1 commit

    added 1 commit

    • 26b15fca - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Kyle Siefring added 1 commit

    added 1 commit

    • 2f4dc609 - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • changed milestone to %1.4.2

  • Kyle Siefring added 1 commit

    added 1 commit

    • c2800a5d - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Added updated performance metric to the commit message.

    I also gathered metrics for the total impact on Neoverse V1 (Graviton 3).

    Procedure

    1. Create new branch. Rebase and drop my msac patches.
    2. Create a branch with all patches applied.
    3. Compare performance on clips by swapping between branches.

    Everything was run with a single thread. Use the clips arm used in their mv patch + chimera.

    Results

    Nature - 1.90% faster
    Models - 1.04% faster
    Balloons - 1.12% faster
    Mountain Bike - 2.7% faster 
    Chimera-AV1-8bit-1920x1080-6736kbps - 3.78% faster

    First 4 are from youtube, so are low bitrate.

  • Kyle Siefring added 1 commit

    added 1 commit

    • 5615cf28 - ARM64: Various optimizations for symbol decode

    Compare with previous version

  • Martin Storsjö requested review from @gramner and removed review request for @mstorsjo

    requested review from @gramner and removed review request for @mstorsjo

  • Henrik Gramner approved this merge request

    approved this merge request

  • Kyle Siefring added 14 commits

    added 14 commits

    Compare with previous version

  • Kyle Siefring added 5 commits

    added 5 commits

    Compare with previous version

Please register or sign in to reply
Loading