ARM64: Various optimizations for symbol decode
Changes stem from redesigning the reduction stage of the multisymbol decode function.
- No longer use adapt4 for 5 possible symbol values
- Specialize reduction for 4/8/16 decode functions
- Modify control flow
Merge request reports
Activity
WIP:
- Numbers for 4 and 8 look good for A72, Neoverse N1, and Neoverse V1 (graviton 1-3).
- Numbers for 16 wide are a small improvement for Neoverse v1, but worse for A72 + N1.
- The control flow changes seem to improve performance for A72/N1 on boolean symbol functions.
I will probably end up reverting the new 16 wide version. Performance is already pretty close to the 8 wide version, so I don't expect major performance improvements.
Edited by Kyle Siefringadded 1 commit
- c1d4ee52 - ARM64: Various optimizations for symbol decode
Cortex A72
Old: msac_decode_bool_c: 33.0 ( 1.00x) msac_decode_bool_neon: 39.3 ( 0.84x) msac_decode_bool_adapt_c: 69.5 ( 1.00x) msac_decode_bool_adapt_neon: 41.6 ( 1.67x) msac_decode_bool_equi_c: 30.5 ( 1.00x) msac_decode_bool_equi_neon: 35.0 ( 0.87x) msac_decode_hi_tok_c: 130.1 ( 1.00x) msac_decode_hi_tok_neon: 119.0 ( 1.09x) msac_decode_symbol_adapt4_c: 92.0 ( 1.00x) msac_decode_symbol_adapt4_neon: 67.8 ( 1.36x) msac_decode_symbol_adapt8_c: 116.2 ( 1.00x) msac_decode_symbol_adapt8_neon: 76.6 ( 1.52x) msac_decode_symbol_adapt16_c: 153.9 ( 1.00x) msac_decode_symbol_adapt16_neon: 77.5 ( 1.98x) New: msac_decode_bool_c: 33.0 ( 1.00x) msac_decode_bool_neon: 29.0 ( 1.14x) msac_decode_bool_adapt_c: 60.0 ( 1.00x) msac_decode_bool_adapt_neon: 37.6 ( 1.60x) msac_decode_bool_equi_c: 30.6 ( 1.00x) msac_decode_bool_equi_neon: 26.3 ( 1.17x) msac_decode_hi_tok_c: 104.6 ( 1.00x) msac_decode_hi_tok_neon: 110.7 ( 0.95x) msac_decode_symbol_adapt4_c: 97.7 ( 1.00x) msac_decode_symbol_adapt4_neon: 61.9 ( 1.58x) msac_decode_symbol_adapt8_c: 110.6 ( 1.00x) msac_decode_symbol_adapt8_neon: 68.0 ( 1.63x) msac_decode_symbol_adapt16_c: 153.9 ( 1.00x) msac_decode_symbol_adapt16_neon: 75.6 ( 2.04x)
Neoverse N1
Old: msac_decode_bool_c: 14.9 ( 1.00x) msac_decode_bool_neon: 14.3 ( 1.05x) msac_decode_bool_adapt_c: 23.2 ( 1.00x) msac_decode_bool_adapt_neon: 17.5 ( 1.32x) msac_decode_bool_equi_c: 14.3 ( 1.00x) msac_decode_bool_equi_neon: 14.0 ( 1.02x) msac_decode_hi_tok_c: 73.4 ( 1.00x) msac_decode_hi_tok_neon: 65.2 ( 1.13x) msac_decode_symbol_adapt4_c: 36.5 ( 1.00x) msac_decode_symbol_adapt4_neon: 28.4 ( 1.29x) msac_decode_symbol_adapt8_c: 52.5 ( 1.00x) msac_decode_symbol_adapt8_neon: 29.0 ( 1.81x) msac_decode_symbol_adapt16_c: 84.3 ( 1.00x) msac_decode_symbol_adapt16_neon: 33.3 ( 2.54x) New: msac_decode_bool_c: 16.0 ( 1.00x) msac_decode_bool_neon: 14.0 ( 1.14x) msac_decode_bool_adapt_c: 21.4 ( 1.00x) msac_decode_bool_adapt_neon: 16.8 ( 1.28x) msac_decode_bool_equi_c: 14.3 ( 1.00x) msac_decode_bool_equi_neon: 11.5 ( 1.24x) msac_decode_hi_tok_c: 59.4 ( 1.00x) msac_decode_hi_tok_neon: 51.5 ( 1.15x) msac_decode_symbol_adapt4_c: 36.8 ( 1.00x) msac_decode_symbol_adapt4_neon: 22.8 ( 1.62x) msac_decode_symbol_adapt8_c: 52.9 ( 1.00x) msac_decode_symbol_adapt8_neon: 29.3 ( 1.81x) msac_decode_symbol_adapt16_c: 84.3 ( 1.00x) msac_decode_symbol_adapt16_neon: 38.0 ( 2.22x)
Neoverse V1
Old: msac_decode_bool_c: 15.3 ( 1.00x) msac_decode_bool_neon: 13.0 ( 1.18x) msac_decode_bool_adapt_c: 19.1 ( 1.00x) msac_decode_bool_adapt_neon: 15.4 ( 1.24x) msac_decode_bool_equi_c: 13.3 ( 1.00x) msac_decode_bool_equi_neon: 11.3 ( 1.18x) msac_decode_hi_tok_c: 73.7 ( 1.00x) msac_decode_hi_tok_neon: 63.3 ( 1.17x) msac_decode_symbol_adapt4_c: 30.0 ( 1.00x) msac_decode_symbol_adapt4_neon: 28.6 ( 1.05x) msac_decode_symbol_adapt8_c: 41.9 ( 1.00x) msac_decode_symbol_adapt8_neon: 29.5 ( 1.42x) msac_decode_symbol_adapt16_c: 64.4 ( 1.00x) msac_decode_symbol_adapt16_neon: 31.6 ( 2.04x) New: msac_decode_bool_c: 14.5 ( 1.00x) msac_decode_bool_neon: 12.9 ( 1.13x) msac_decode_bool_adapt_c: 18.4 ( 1.00x) msac_decode_bool_adapt_neon: 15.0 ( 1.23x) msac_decode_bool_equi_c: 13.1 ( 1.00x) msac_decode_bool_equi_neon: 10.5 ( 1.25x) msac_decode_hi_tok_c: 57.8 ( 1.00x) msac_decode_hi_tok_neon: 42.8 ( 1.35x) msac_decode_symbol_adapt4_c: 30.1 ( 1.00x) msac_decode_symbol_adapt4_neon: 22.6 ( 1.33x) msac_decode_symbol_adapt8_c: 41.6 ( 1.00x) msac_decode_symbol_adapt8_neon: 25.6 ( 1.63x) msac_decode_symbol_adapt16_c: 65.0 ( 1.00x) msac_decode_symbol_adapt16_neon: 29.1 ( 2.23x)
added 1 commit
- 1539d51a - ARM64: Various optimizations for symbol decode
added 1 commit
- d66c7375 - ARM64: Various optimizations for symbol decode
added 1 commit
- 2846a1a1 - ARM64: Various optimizations for symbol decode
added 1 commit
- 4e477ee6 - ARM64: Various optimizations for symbol decode
added 1 commit
- 742126ac - ARM64: Various optimizations for symbol decode
added 1 commit
- f04db1b8 - ARM64: Various optimizations for symbol decode
requested review from @mstorsjo
added 1 commit
- 26b15fca - ARM64: Various optimizations for symbol decode
added 1 commit
- 2f4dc609 - ARM64: Various optimizations for symbol decode
changed milestone to %1.4.2
added ARM performance labels
added 1 commit
- c2800a5d - ARM64: Various optimizations for symbol decode
Added updated performance metric to the commit message.
I also gathered metrics for the total impact on Neoverse V1 (Graviton 3).
Procedure
- Create new branch. Rebase and drop my msac patches.
- Create a branch with all patches applied.
- Compare performance on clips by swapping between branches.
Everything was run with a single thread. Use the clips arm used in their mv patch + chimera.
Results
Nature - 1.90% faster Models - 1.04% faster Balloons - 1.12% faster Mountain Bike - 2.7% faster Chimera-AV1-8bit-1920x1080-6736kbps - 3.78% faster
First 4 are from youtube, so are low bitrate.
added 1 commit
- 5615cf28 - ARM64: Various optimizations for symbol decode
added 14 commits
-
5615cf28...8141546d - 13 commits from branch
videolan:master
- 25d9f916 - ARM64: Various optimizations for symbol decode
-
5615cf28...8141546d - 13 commits from branch
added 5 commits
-
25d9f916...d835c6bf - 4 commits from branch
videolan:master
- 7f68f23c - ARM64: Various optimizations for symbol decode
-
25d9f916...d835c6bf - 4 commits from branch