arm/itx16: Add 12-bit wht
Merge request reports
Activity
changed milestone to %1.4.1
added ARM performance labels
requested review from @mstorsjo
added 1 commit
- 9e1990f5 - arm64/itx: Trim dav1d_inv_wht4_1d_c, saves 92 bytes
added 1 commit
- 77eb2023 - arm64/itx: Trim dav1d_inv_wht4_1d_c, saves 92 bytes
- Automatically resolved by Nathan E. Egge
added 1 commit
- d3288e4d - arm64/itx: Trim dav1d_inv_wht4_1d_c, saves 92 bytes
added 1 commit
- aaefb78e - arm32/itx: Trim dav1d_inv_wht4_1d_c, saves 68 bytes
579 579 b L(itx_4x4_end) 580 580 1: 581 581 .endif 582 adr x4, inv_\txfm1\()_4s_x4_neon 583 movrel x5, X(inv_\txfm2\()_4h_x4_neon) 582 adr x5, inv_\txfm1\()_4s_x4_neon 583 movrel x6, X(inv_\txfm2\()_4h_x4_neon) I'm not thrilled about deviating from the register allocation scheme that is used everywhere else - but I guess the alternative would be to waste one extra instruction on a
mov w<somethingelse>, w4
to store the value in a different register. Such instructions are kinda cheap, but still they do take space.So I guess this is the most reasonable compromise.
If we'd go through with making 12 bpc versions of all the other functions, we would probably want to change the register layout everywhere, yeah. (Depending on whether we're feeling lazy or not, we could also choose to just expend the one extra instruction to move the argument to another register, if it would feel like too much work to renumber the register use throughout it all.)
A comment in the 4x4 implementation, somewhere, saying that we're using a different register layout due to keeping the
bitdepth_max
value, probably woudl be good.
598 598 vld1.16 {d3}, [r0, :64], r1 599 599 600 600 L(itx_4x4_end): 601 vmvn.i16 q15, #0xfc00 // 0x3ff 601 ldr r4, [sp, #44] Hmm, generally, this feels quite brittle - loading arguments from the stack, not from the current function but from one further up in the call stack. Like, sure, it works, and if it breaks it should show up in checkasm, but it's still not something I'd normally do.
Do we have spare GPRs so we can do the load in the frontend function? I guess that costs more instructions as we'd have to do it in every entry point function though...
Right. I'm a bit undecided here - if we'd do 12 bpc versions of all the transforms, we probably would want to fetch this argument earlier, in the right function scope, as we probably can be called in a variety of cases there. If we'd do that, we would need to renumber the registers just like for arm64 though (not sure how strained we are for GPRs here though).
As long as it is just this one case for 4x4, I can be convinced that this is ok, but I don't condone of the practice overall :-) Please leave a comment saying that we're fetching this argument from the stack of the parent fucntion.
Fwiw accessing the caller's stack is something we do on x86 in a bunch of places, and I don't think there's anything inherently wrong with it as long as everything is written in asm where you have full control over the stack layout.
But yes, a comment pointing it out is useful to the future reader of the code.
- Resolved by Martin Storsjö
Yes, it is a bit much and a similar mess for x86, see line 100 of src/cpu.h.
As far as I can tell, this set of conditions was first introduced in fc3777b4 and cut and paste since. I do not have 32-bit ARM Windows or Apple devices to test on, but if this can be simplified we should fix all of the places like this in a separate commit.
added 19 commits
-
aaefb78e...006ca01d - 15 commits from branch
videolan:master
- 3b852b15 - arm64/itx16: Add 4x4 12bpc NEON wht_wht transform
- ec695854 - arm32/itx16: Add 4x4 12bpc NEON wht_wht transform
- 485413b0 - arm64/itx: Trim dav1d_inv_wht4_1d_c, saves 92 bytes
- 61d16e07 - arm32/itx: Trim dav1d_inv_wht4_1d_c, saves 68 bytes
Toggle commit list-
aaefb78e...006ca01d - 15 commits from branch