Skip to content
Snippets Groups Projects

arm/itx16: Add 12-bit wht

Merged Nathan E. Egge requested to merge unlord/dav1d:aarch64_tx4 into master
3 unresolved threads

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Nathan E. Egge resolved all threads

    resolved all threads

  • Nathan E. Egge added 1 commit

    added 1 commit

    • d3288e4d - arm64/itx: Trim dav1d_inv_wht4_1d_c, saves 92 bytes

    Compare with previous version

  • Nathan E. Egge added 1 commit

    added 1 commit

    • aaefb78e - arm32/itx: Trim dav1d_inv_wht4_1d_c, saves 68 bytes

    Compare with previous version

  • Martin Storsjö
    Martin Storsjö @mstorsjo started a thread on commit 85716d6d
  • 579 579 b L(itx_4x4_end)
    580 580 1:
    581 581 .endif
    582 adr x4, inv_\txfm1\()_4s_x4_neon
    583 movrel x5, X(inv_\txfm2\()_4h_x4_neon)
    582 adr x5, inv_\txfm1\()_4s_x4_neon
    583 movrel x6, X(inv_\txfm2\()_4h_x4_neon)
    • I'm not thrilled about deviating from the register allocation scheme that is used everywhere else - but I guess the alternative would be to waste one extra instruction on a mov w<somethingelse>, w4 to store the value in a different register. Such instructions are kinda cheap, but still they do take space.

      So I guess this is the most reasonable compromise.

    • Author Developer

      I went through the same calculus and reached the same conclusion. Would a comment saying that 4x4 deviates from the rest of the functions help? I don't know if similar changes will be needed for 12bpc transforms.

    • If we'd go through with making 12 bpc versions of all the other functions, we would probably want to change the register layout everywhere, yeah. (Depending on whether we're feeling lazy or not, we could also choose to just expend the one extra instruction to move the argument to another register, if it would feel like too much work to renumber the register use throughout it all.)

      A comment in the 4x4 implementation, somewhere, saying that we're using a different register layout due to keeping the bitdepth_max value, probably woudl be good.

    • Please register or sign in to reply
  • Martin Storsjö
    Martin Storsjö @mstorsjo started a thread on commit 2dc14e9f
  • 598 598 vld1.16 {d3}, [r0, :64], r1
    599 599
    600 600 L(itx_4x4_end):
    601 vmvn.i16 q15, #0xfc00 // 0x3ff
    601 ldr r4, [sp, #44]
    • Hmm, generally, this feels quite brittle - loading arguments from the stack, not from the current function but from one further up in the call stack. Like, sure, it works, and if it breaks it should show up in checkasm, but it's still not something I'd normally do.

      Do we have spare GPRs so we can do the load in the frontend function? I guess that costs more instructions as we'd have to do it in every entry point function though...

    • Author Developer

      Yes, it will break checkasm and we should now if anything changes. The call signature won't change, but someone may later save and restore other registers so this constant could change.

    • Right. I'm a bit undecided here - if we'd do 12 bpc versions of all the transforms, we probably would want to fetch this argument earlier, in the right function scope, as we probably can be called in a variety of cases there. If we'd do that, we would need to renumber the registers just like for arm64 though (not sure how strained we are for GPRs here though).

      As long as it is just this one case for 4x4, I can be convinced that this is ok, but I don't condone of the practice overall :-) Please leave a comment saying that we're fetching this argument from the stack of the parent fucntion.

    • Fwiw accessing the caller's stack is something we do on x86 in a bunch of places, and I don't think there's anything inherently wrong with it as long as everything is written in asm where you have full control over the stack layout.

      But yes, a comment pointing it out is useful to the future reader of the code.

    • Please register or sign in to reply
  • Martin Storsjö
    • Author Developer

      Yes, it is a bit much and a similar mess for x86, see line 100 of src/cpu.h.

      As far as I can tell, this set of conditions was first introduced in fc3777b4 and cut and paste since. I do not have 32-bit ARM Windows or Apple devices to test on, but if this can be simplified we should fix all of the places like this in a separate commit.

    • a similar mess for x86

      Fwiw the main reason for the x86 mess in src/cpu.h is that MSVC doesn't define any of the SSE macros, so every flag can't simply be checked by itself in isolation (although it could probably be simplified a little bit).

    • Please register or sign in to reply
  • Nathan E. Egge added 19 commits

    added 19 commits

    • aaefb78e...006ca01d - 15 commits from branch videolan:master
    • 3b852b15 - arm64/itx16: Add 4x4 12bpc NEON wht_wht transform
    • ec695854 - arm32/itx16: Add 4x4 12bpc NEON wht_wht transform
    • 485413b0 - arm64/itx: Trim dav1d_inv_wht4_1d_c, saves 92 bytes
    • 61d16e07 - arm32/itx: Trim dav1d_inv_wht4_1d_c, saves 68 bytes

    Compare with previous version

  • Martin Storsjö approved this merge request

    approved this merge request

  • Looks ok to me, thanks!

  • merged

  • Please register or sign in to reply
    Loading