Skip to content
  • Arpad Panyik's avatar
    AArch64: Optimize prep_neon function · d835c6bf
    Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
    Optimize the widening copy part of subpel filters (the prep_neon
    function). In this patch we combine widening shifts with widening
    multiplications in the inner loops to get maximum throughput.
    
    The change will increase .text by 36 bytes.
    
    Relative performance of micro benchmarks (lower is better):
    
    Cortex-A55:
      mct_w4:   0.795x
      mct_w8:   0.913x
      mct_w16:  0.912x
      mct_w32:  0.838x
      mct_w64:  1.025x
      mct_w128: 1.002x
    
    Cortex-A510:
      mct_w4:   0.760x
      mct_w8:   0.636x
      mct_w16:  0.640x
      mct_w32:  0.854x
      mct_w64:  0.864x
      mct_w128: 0.995x
    
    Cortex-A72:
      mct_w4:   0.616x
      mct_w8:   0.854x
      mct_w16:  0.756x
      mct_w32:  1.052x
      mct_w64:  1.044x
      mct_w128: 0.702x
    
    Cortex-A76:
      mct_w4:   0.837x
      mct_w8:   0.797x
      mct_w16:  0.841x
      mct_w32:  0.804x
      mct_w64:  0.948x
      mct_w128: 0.904x
    
    Cortex-A78:
      mct_w16:  0.542x
      mct_w32:  0.725x
      mct_w64:  0.741x
      mct_w128: 0.745x
    
    Cortex-A715:
      mct_w16:  0.561x
      mct_w32:  0.720x
      mct_w64:  0.740x
      mct_w128: 0.748x
    
    Cortex-X1:
      mct_w32:  0.886x
      mct_w64:  0.882x
      mct_w128: 0.917x
    
    Cortex-X3:
      mct_w32:  0.835x
      mct_w64:  0.803x
      mct_w128: 0.808x
    d835c6bf