AArch64: Faster method for calculating sgr table
No speed tests yet.
Merge request reports
Activity
Here are some bench numbers:
Before: Cortex A53 A55 A72 A76 Apple M1 sgr_3x3_8bpc_neon: 387485.0 382215.6 295115.3 185532.2 462.7 sgr_5x5_8bpc_neon: 260326.7 256158.9 197389.1 128890.7 326.7 sgr_mix_8bpc_neon: 593261.1 590692.4 450166.9 282033.8 698.6 After: sgr_3x3_8bpc_neon: 393581.8 387151.7 306943.7 179458.2 441.5 sgr_5x5_8bpc_neon: 261657.7 254394.9 196568.8 120481.1 313.3 sgr_mix_8bpc_neon: 605526.6 594022.4 454753.8 268776.3 663.2
As this is benchmarking a very large function, the numbers can be a bit noisy. But the general trend does seem like it is slower on A53/A55/A72 but a bit faster on A76 and Apple M1.
A55 seems to have 1/N throughput for each table register. A72 has latency issues with tbl.
Should still use less cycles, so it's probably a latency issue for those cores. Not 100% on A53.
I could try doubling the loop width or moving around the first additions into the loop. For the later, I would need to try the same on the old code to compare.
Edited by Kyle SiefringThe update does indeed seem quite promising:
Initial (master): Cortex A53 A55 A72 A73 A76 Apple M1 sgr_3x3_8bpc_neon: 387702.8 383154.2 295742.4 302100.1 185420.7 472.2 sgr_5x5_8bpc_neon: 261725.1 256919.8 194205.1 197585.6 128311.3 332.9 sgr_mix_8bpc_neon: 628085.0 593664.2 453551.8 450553.8 281956.0 711.2 Current: sgr_3x3_8bpc_neon: 368331.4 363949.7 275499.0 272056.3 169614.4 432.7 sgr_5x5_8bpc_neon: 257866.7 255265.5 195962.5 199557.8 120481.3 319.2 sgr_mix_8bpc_neon: 598234.1 572896.4 418500.4 438910.7 258977.7 659.3
In this form, the 3x3 is a clear speedup across the board. The 5x5 is also either a small speedup, or practically equal to before. And the mix version benefits from the faster 3x3.
(I wonder what the speedup of the 3x3 would be if you'd do the same widening of the existing code, compared to this? Or are we out of register space in the existing implementation to widen it?)
added 1 commit
- 89038c6d - AArch64: New method for calculating sgr table
added 1 commit
- 3c3b1982 - AArch64: New method for calculating sgr table
requested review from @mstorsjo
- Automatically resolved by Kyle Siefring
- Automatically resolved by Kyle Siefring
- Resolved by Martin Storsjö
32 // In the comments, let RefTable denote the original, reference table. 33 const x_by_x_tables 34 // RangeMins 35 // 36 // Min(RefTable[i*8:i*8+8]) 37 // First two values are zeroed. 38 // 39 // Lookup using RangeMins[(x >> 3)] 40 .byte 0, 0, 11, 8, 6, 5, 5, 4, 4, 3, 3, 3, 2, 2, 2, 2 41 .byte 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0 42 43 // DiffMasks 44 // 45 // Diff from next RefTable entry encoded in bits. Diffs for the first 16 elements are excluded (zeroed). 46 // Using popcount, we can integrate the diff bit field. By shifting away bits in a byte, we can refine the range of 47 // the integral. Finally, adding the integral to RangeMins[(x>>3)] reconstructs RefTable (for x > 15). I wasn't quite able to grasp how this works, just by reading this, but when reading the pseudocode below, I got it. Very clever!
I'm wondering if we could rephrase it, to ease understanding on first read. What about, if we'd replace the first sentence with this:
This contains a bit pattern, indicating at which index positions the value of RefTable changes. For each range in the RangeMins table (covering 8 RefTable entries), we have one byte; each bit indicates whether the value of RefTable changes at that particular index.
- Automatically resolved by Kyle Siefring
- Automatically resolved by Kyle Siefring
- Automatically resolved by Kyle Siefring
added 28 commits
-
3c3b1982...d2687884 - 27 commits from branch
videolan:master
- 360e046e - AArch64: New method for calculating sgr table
-
3c3b1982...d2687884 - 27 commits from branch
added 1 commit
- 9fe221f7 - AArch64: New method for calculating sgr table
added 6 commits
-
9fe221f7...ec5c3052 - 5 commits from branch
videolan:master
- 79db1624 - AArch64: New method for calculating sgr table
-
9fe221f7...ec5c3052 - 5 commits from branch
enabled an automatic merge when the pipeline for 79db1624 succeeds