Skip to content

x86/deblock: make hbd/ssse3 implementations 32bit-compatible

Ronald S. Bultje requested to merge rbultje/dav1d:hbd_ssse3_deblock_32bit into master

Potential improvements / future directions:

  • wd=16 modifies only 6 pixels per side, so 6x2x2=24 bytes can be done in 1.5 transposes instead of 2 (i.e. one transpose8x8w + 2 transpose4x4w, instead of 2 transpose8x8w) + accompanying writes (movu+[movq or movhps] instead of 2xmova). This would also save minor amounts of stack space since we don't need to save p7/q7 anymore. The disadvantage of this is that the writes would be unaligned.
  • I've modified flat6/8 to use psubw new, old; pand new, mask; paddw new, old instead of the original pandn old, mask; pand new, mask; por new, old versions. This made writing 32bit code easier and turned out to be slightly faster, also. I did not (yet) do this for flat16.
Edited by Ronald S. Bultje

Merge request reports