• Kyle Siefring's avatar
    arm64: warped motion: Various optimizations · a3b8157e
    Kyle Siefring authored
    - Reorder loads of filters to benifit in order cores.
    - Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the
       first stage which will hurt performance on some older big cores.
    - Rework horz stage for 8 bit mode:
        * Use smull instead of mul
        * Replace existing narrow and long instructions
        * Replace mov after calling with right shift
    
    Before:            Cortex A55    A53     A72     A73
    warp_8x8_8bpc_neon:    1683.2  1860.6  1065.0  1102.6
    warp_8x8t_8bpc_neon:   1673.2  1846.4  1057.0  1098.4
    warp_8x8_16bpc_neon:   1870.7  2031.7  1147.3  1220.7
    warp_8x8t_16bpc_neon:  1848.0  2006.2  1121.6  1188.0
    After:
    warp_8x8_8bpc_neon:    1267.2  1446.2   807.0   871.5
    warp_8x8t_8bpc_neon:   1245.4  1422.0   810.2   868.4
    warp_8x8_16bpc_neon:   1769.8  1929.3  1132.0  1238.2
    warp_8x8t_16bpc_neon:  1747.3  1904.1  1101.5  1207.9
    
    Cortex-A55
    Before:
    warp_8x8_8bpc_neon:   1683.2
    warp_8x8t_8bpc_neon:  1673.2
    warp_8x8_16bpc_neon:  1870.7
    warp_8x8t_16bpc_neon: 1848.0
    After:
    warp_8x8_8bpc_neon:   1267.2
    warp_8x8t_8bpc_neon:  1245.4
    warp_8x8_16bpc_neon:  1769.8
    warp_8x8t_16bpc_neon: 1747.3
    a3b8157e
mc.S 117 KB