-
- Reorder loads of filters to benifit in order cores. - Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the first stage which will hurt performance on some older big cores. - Rework horz stage for 8 bit mode: * Use smull instead of mul * Replace existing narrow and long instructions * Replace mov after calling with right shift Before: Cortex A55 A53 A72 A73 warp_8x8_8bpc_neon: 1683.2 1860.6 1065.0 1102.6 warp_8x8t_8bpc_neon: 1673.2 1846.4 1057.0 1098.4 warp_8x8_16bpc_neon: 1870.7 2031.7 1147.3 1220.7 warp_8x8t_16bpc_neon: 1848.0 2006.2 1121.6 1188.0 After: warp_8x8_8bpc_neon: 1267.2 1446.2 807.0 871.5 warp_8x8t_8bpc_neon: 1245.4 1422.0 810.2 868.4 warp_8x8_16bpc_neon: 1769.8 1929.3 1132.0 1238.2 warp_8x8t_16bpc_neon: 1747.3 1904.1 1101.5 1207.9 Cortex-A55 Before: warp_8x8_8bpc_neon: 1683.2 warp_8x8t_8bpc_neon: 1673.2 warp_8x8_16bpc_neon: 1870.7 warp_8x8t_16bpc_neon: 1848.0 After: warp_8x8_8bpc_neon: 1267.2 warp_8x8t_8bpc_neon: 1245.4 warp_8x8_16bpc_neon: 1769.8 warp_8x8t_16bpc_neon: 1747.3
a3b8157e