src/arm/64/mc16.S · a3b8157edc3b8a055190ae33497666dec2df81d4 · VideoLAN / dav1d

arm64: warped motion: Various optimizations · a3b8157e

Kyle Siefring authored Feb 04, 2021 and

Martin Storsjö committed Feb 05, 2021

- Reorder loads of filters to benifit in order cores.
- Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the
   first stage which will hurt performance on some older big cores.
- Rework horz stage for 8 bit mode:
    * Use smull instead of mul
    * Replace existing narrow and long instructions
    * Replace mov after calling with right shift

Before:            Cortex A55    A53     A72     A73
warp_8x8_8bpc_neon:    1683.2  1860.6  1065.0  1102.6
warp_8x8t_8bpc_neon:   1673.2  1846.4  1057.0  1098.4
warp_8x8_16bpc_neon:   1870.7  2031.7  1147.3  1220.7
warp_8x8t_16bpc_neon:  1848.0  2006.2  1121.6  1188.0
After:
warp_8x8_8bpc_neon:    1267.2  1446.2   807.0   871.5
warp_8x8t_8bpc_neon:   1245.4  1422.0   810.2   868.4
warp_8x8_16bpc_neon:   1769.8  1929.3  1132.0  1238.2
warp_8x8t_16bpc_neon:  1747.3  1904.1  1101.5  1207.9

Cortex-A55
Before:
warp_8x8_8bpc_neon:   1683.2
warp_8x8t_8bpc_neon:  1673.2
warp_8x8_16bpc_neon:  1870.7
warp_8x8t_16bpc_neon: 1848.0
After:
warp_8x8_8bpc_neon:   1267.2
warp_8x8t_8bpc_neon:  1245.4
warp_8x8_16bpc_neon:  1769.8
warp_8x8t_16bpc_neon: 1747.3

a3b8157e