Skip to content
Snippets Groups Projects

AArch64: Faster method for calculating sgr table

Merged Kyle Siefring requested to merge KyleSiefring/dav1d:popcount_aarch64_sgr into master

No speed tests yet.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Here are some bench numbers:

    Before:            Cortex A53        A55        A72       A76   Apple M1
    sgr_3x3_8bpc_neon:   387485.0   382215.6   295115.3  185532.2   462.7
    sgr_5x5_8bpc_neon:   260326.7   256158.9   197389.1  128890.7   326.7
    sgr_mix_8bpc_neon:   593261.1   590692.4   450166.9  282033.8   698.6
    After:
    sgr_3x3_8bpc_neon:   393581.8   387151.7   306943.7  179458.2   441.5
    sgr_5x5_8bpc_neon:   261657.7   254394.9   196568.8  120481.1   313.3
    sgr_mix_8bpc_neon:   605526.6   594022.4   454753.8  268776.3   663.2

    As this is benchmarking a very large function, the numbers can be a bit noisy. But the general trend does seem like it is slower on A53/A55/A72 but a bit faster on A76 and Apple M1.

    • A55 seems to have 1/N throughput for each table register. A72 has latency issues with tbl.

      Should still use less cycles, so it's probably a latency issue for those cores. Not 100% on A53.

      I could try doubling the loop width or moving around the first additions into the loop. For the later, I would need to try the same on the old code to compare.

      Edited by Kyle Siefring
    • The update does indeed seem quite promising:

      Initial (master):  Cortex A53        A55        A72        A73       A76   Apple M1
      sgr_3x3_8bpc_neon:   387702.8   383154.2   295742.4   302100.1  185420.7   472.2
      sgr_5x5_8bpc_neon:   261725.1   256919.8   194205.1   197585.6  128311.3   332.9
      sgr_mix_8bpc_neon:   628085.0   593664.2   453551.8   450553.8  281956.0   711.2
      
      Current:
      sgr_3x3_8bpc_neon:   368331.4   363949.7   275499.0   272056.3  169614.4   432.7
      sgr_5x5_8bpc_neon:   257866.7   255265.5   195962.5   199557.8  120481.3   319.2
      sgr_mix_8bpc_neon:   598234.1   572896.4   418500.4   438910.7  258977.7   659.3

      In this form, the 3x3 is a clear speedup across the board. The 5x5 is also either a small speedup, or practically equal to before. And the mix version benefits from the faster 3x3.

      (I wonder what the speedup of the 3x3 would be if you'd do the same widening of the existing code, compared to this? Or are we out of register space in the existing implementation to widen it?)

    • If it is possible, it would definitely require more work. The 5x5 version has more loads in the tbl code and can almost keep up, so I suspect the problem is mostly due to latency.

    • Please register or sign in to reply
  • Kyle Siefring added 2 commits

    added 2 commits

    • 1f8e7fee - AArch64: Faster method for calculating sgr table
    • 94d08384 - PoC commit for wide sgr3x3 loop

    Compare with previous version

  • Kyle Siefring added 1 commit

    added 1 commit

    • 89038c6d - AArch64: New method for calculating sgr table

    Compare with previous version

  • Kyle Siefring added 1 commit

    added 1 commit

    • 3c3b1982 - AArch64: New method for calculating sgr table

    Compare with previous version

  • Moved a tbl instruction around for consistency. Should be close enough to reuse the profiling numbers you ran.

  • Ronald S. Bultje requested review from @mstorsjo

    requested review from @mstorsjo

  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
    Martin Storsjö @mstorsjo started a thread on commit 3c3b1982
  • 32 // In the comments, let RefTable denote the original, reference table.
    33 const x_by_x_tables
    34 // RangeMins
    35 //
    36 // Min(RefTable[i*8:i*8+8])
    37 // First two values are zeroed.
    38 //
    39 // Lookup using RangeMins[(x >> 3)]
    40 .byte 0, 0, 11, 8, 6, 5, 5, 4, 4, 3, 3, 3, 2, 2, 2, 2
    41 .byte 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0
    42
    43 // DiffMasks
    44 //
    45 // Diff from next RefTable entry encoded in bits. Diffs for the first 16 elements are excluded (zeroed).
    46 // Using popcount, we can integrate the diff bit field. By shifting away bits in a byte, we can refine the range of
    47 // the integral. Finally, adding the integral to RangeMins[(x>>3)] reconstructs RefTable (for x > 15).
    • I wasn't quite able to grasp how this works, just by reading this, but when reading the pseudocode below, I got it. Very clever!

      I'm wondering if we could rephrase it, to ease understanding on first read. What about, if we'd replace the first sentence with this:

      This contains a bit pattern, indicating at which index positions the value of RefTable changes. For each range in the RangeMins table (covering 8 RefTable entries), we have one byte; each bit indicates whether the value of RefTable changes at that particular index.

    • Please register or sign in to reply
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Kyle Siefring added 28 commits

    added 28 commits

    Compare with previous version

  • Kyle Siefring added 1 commit

    added 1 commit

    • 9fe221f7 - AArch64: New method for calculating sgr table

    Compare with previous version

  • LGTM now, thanks!

  • Martin Storsjö approved this merge request

    approved this merge request

  • Martin Storsjö added 6 commits

    added 6 commits

    Compare with previous version

  • Martin Storsjö enabled an automatic merge when the pipeline for 79db1624 succeeds

    enabled an automatic merge when the pipeline for 79db1624 succeeds

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading