Arpad Panyik
authored
Optimize the widening copy part of subpel filters (the prep_neon function). In this patch we combine widening shifts with widening multiplications in the inner loops to get maximum throughput. The change will increase .text by 36 bytes. Relative performance of micro benchmarks (lower is better): Cortex-A55: mct_w4: 0.795x mct_w8: 0.913x mct_w16: 0.912x mct_w32: 0.838x mct_w64: 1.025x mct_w128: 1.002x Cortex-A510: mct_w4: 0.760x mct_w8: 0.636x mct_w16: 0.640x mct_w32: 0.854x mct_w64: 0.864x mct_w128: 0.995x Cortex-A72: mct_w4: 0.616x mct_w8: 0.854x mct_w16: 0.756x mct_w32: 1.052x mct_w64: 1.044x mct_w128: 0.702x Cortex-A76: mct_w4: 0.837x mct_w8: 0.797x mct_w16: 0.841x mct_w32: 0.804x mct_w64: 0.948x mct_w128: 0.904x Cortex-A78: mct_w16: 0.542x mct_w32: 0.725x mct_w64: 0.741x mct_w128: 0.745x Cortex-A715: mct_w16: 0.561x mct_w32: 0.720x mct_w64: 0.740x mct_w128: 0.748x Cortex-X1: mct_w32: 0.886x mct_w64: 0.882x mct_w128: 0.917x Cortex-X3: mct_w32: 0.835x mct_w64: 0.803x mct_w128: 0.808x
Name | Last commit | Last update |
---|