• Martin Storsjö's avatar
    arm: mc: Optimize blend_v · 52e9b435
    Martin Storsjö authored
    Use a post-increment with a register on the last increment, avoiding
    a separate increment. Avoid processing the last 8 pixels in the w32
    case when we only output 24 pixels.
    
    Before:
    ARM32                Cortex A7      A8      A9     A53     A72     A73
    blend_v_w4_8bpc_neon:    450.4   574.7   538.7   374.6   199.3   260.5
    blend_v_w8_8bpc_neon:    559.6   351.3   552.5   357.6   214.8   204.3
    blend_v_w16_8bpc_neon:   926.3   511.6   787.9   593.0   271.0   246.8
    blend_v_w32_8bpc_neon:  1482.5   917.0  1149.5   991.9   354.0   368.9
    ARM64
    blend_v_w4_8bpc_neon:                            351.1   200.0   224.1
    blend_v_w8_8bpc_neon:                            333.0   212.4   203.8
    blend_v_w16_8bpc_neon:                           495.2   302.0   247.0
    blend_v_w32_8bpc_neon:                           840.0   557.8   514.0
    
    After:
    ARM32
    blend_v_w4_8bpc_neon:    435.5   575.8   537.6   356.2   198.3   259.5
    blend_v_w8_8bpc_neon:    545.2   347.9   553.5   339.1   207.8   204.2
    blend_v_w16_8bpc_neon:   913.7   511.0   788.1   573.7   275.4   243.3
    blend_v_w32_8bpc_neon:  1445.3   951.2  1079.1   920.4   352.2   361.6
    ARM64
    blend_v_w4_8bpc_neon:                            333.0   191.3   225.9
    blend_v_w8_8bpc_neon:                            314.9   199.3   203.5
    blend_v_w16_8bpc_neon:                           476.9   301.3   241.1
    blend_v_w32_8bpc_neon:                           766.9   432.8   416.9
    52e9b435
mc.S 112 KB