Skip to content
Snippets Groups Projects
Arpad Panyik's avatar
Arpad Panyik authored
Optimize the widening copy part of subpel filters (the prep_neon
function). In this patch we combine widening shifts with widening
multiplications in the inner loops to get maximum throughput.

The change will increase .text by 36 bytes.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  mct_w4:   0.795x
  mct_w8:   0.913x
  mct_w16:  0.912x
  mct_w32:  0.838x
  mct_w64:  1.025x
  mct_w128: 1.002x

Cortex-A510:
  mct_w4:   0.760x
  mct_w8:   0.636x
  mct_w16:  0.640x
  mct_w32:  0.854x
  mct_w64:  0.864x
  mct_w128: 0.995x

Cortex-A72:
  mct_w4:   0.616x
  mct_w8:   0.854x
  mct_w16:  0.756x
  mct_w32:  1.052x
  mct_w64:  1.044x
  mct_w128: 0.702x

Cortex-A76:
  mct_w4:   0.837x
  mct_w8:   0.797x
  mct_w16:  0.841x
  mct_w32:  0.804x
  mct_w64:  0.948x
  mct_w128: 0.904x

Cortex-A78:
  mct_w16:  0.542x
  mct_w32:  0.725x
  mct_w64:  0.741x
  mct_w128: 0.745x

Cortex-A715:
  mct_w16:  0.561x
  mct_w32:  0.720x
  mct_w64:  0.740x
  mct_w128: 0.748x

Cortex-X1:
  mct_w32:  0.886x
  mct_w64:  0.882x
  mct_w128: 0.917x

Cortex-X3:
  mct_w32:  0.835x
  mct_w64:  0.803x
  mct_w128: 0.808x
d835c6bf
Name Last commit Last update