AArch64: Optimize 2D i8mm subpel filters
Rewrite the accumulator initializations of the horizontal part of the 2D filters with zero register fills. It can improve the performance on out-of-order CPUs which can fill vector registers by zero with zero latency. Zeroed accumulators imply the usage of the rounding shifts at the end of filters.
The only exception is the very short *hv_filter4*
, where the longer
latency of rounding shift could decrease the performance.
Relative performance of micro benchmarks (lower is better):
Cortex-X3:
mct_8tap_regular_w16_hv_8bpc_i8mm: 0.982x
mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.979x
mct_8tap_regular_w8_hv_8bpc_i8mm: 0.972x
mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.969x
mct_8tap_regular_w4_hv_8bpc_i8mm: 0.942x
mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.935x
mc_8tap_regular_w16_hv_8bpc_i8mm: 0.988x
mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.982x
mc_8tap_regular_w8_hv_8bpc_i8mm: 0.981x
mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.975x
mc_8tap_regular_w4_hv_8bpc_i8mm: 0.998x
mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.996x
mc_8tap_regular_w2_hv_8bpc_i8mm: 1.006x
mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.993x
Cortex-A715:
mct_8tap_regular_w16_hv_8bpc_i8mm: 0.883x
mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.931x
mct_8tap_regular_w8_hv_8bpc_i8mm: 0.882x
mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.928x
mct_8tap_regular_w4_hv_8bpc_i8mm: 0.969x
mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.934x
mc_8tap_regular_w16_hv_8bpc_i8mm: 0.881x
mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.925x
mc_8tap_regular_w8_hv_8bpc_i8mm: 0.879x
mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.925x
mc_8tap_regular_w4_hv_8bpc_i8mm: 0.917x
mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.976x
mc_8tap_regular_w2_hv_8bpc_i8mm: 0.915x
mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.972x
Cortex-A510:
mct_8tap_regular_w16_hv_8bpc_i8mm: 0.994x
mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.949x
mct_8tap_regular_w8_hv_8bpc_i8mm: 0.987x
mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.947x
mct_8tap_regular_w4_hv_8bpc_i8mm: 1.002x
mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.999x
mc_8tap_regular_w16_hv_8bpc_i8mm: 0.989x
mc_8tap_sharp_w16_hv_8bpc_i8mm: 1.003x
mc_8tap_regular_w8_hv_8bpc_i8mm: 0.986x
mc_8tap_sharp_w8_hv_8bpc_i8mm: 1.000x
mc_8tap_regular_w4_hv_8bpc_i8mm: 1.007x
mc_8tap_sharp_w4_hv_8bpc_i8mm: 1.000x
mc_8tap_regular_w2_hv_8bpc_i8mm: 1.005x
mc_8tap_sharp_w2_hv_8bpc_i8mm: 1.000x
Merge request reports
Activity
added ARM performance labels
requested review from @mstorsjo
added 3 commits
-
2da2f141...d1bdf4f1 - 2 commits from branch
videolan:master
- 5e92e26b - AArch64: Optimize 2D i8mm subpel filters
-
2da2f141...d1bdf4f1 - 2 commits from branch
After the merge of !1658 (merged), we can save an instruction in the shared code path of H and HV.
939 968 940 969 smull v0.4s, v16.4h, v7.h[1] 941 970 smull2 v1.4s, v16.8h, v7.h[1] 942 .ifc \isa, neon_dotprod 971 .ifc \isa, neon_i8mm 972 mov v16.16b, v17.16b 973 movi v5.4s, #0 974 movi v6.4s, #0 975 .else 943 976 sub v23.16b, v23.16b, v24.16b 944 .endif 945 977 mov v16.16b, v17.16b 946 947 978 mov v5.16b, v27.16b 948 979 mov v6.16b, v27.16b Isn't this change here equivalent to doing this?
.ifc \isa, neon_dotprod sub v23 .endif mov v16, v17 .ifc \isa, neon_i8mm mov v5, #0 mov v6, #0 .else mov v5, v27 mov v6, v27 .endif When viewed as a diff, that would feel more straightforward - but you're probably right that the end result, with just one `.if/.else` instead of two, is better. So this is ok, I'm just trying to wrap my head around the diff here.
Fixed in 431ab0fc.
704 702 smull v0.4s, v16.4h, v7.h[0] 705 703 smull2 v1.4s, v16.8h, v7.h[0] 706 704 mov v16.16b, v17.16b 707 .ifc \isa, neon_dotprod 705 .ifc \isa, neon_i8mm 706 movi v5.4s, #0 707 movi v6.4s, #0 708 tbl v2.16b, {v23.16b}, v28.16b Fixed in 431ab0fc.
- Resolved by Arpad Panyik
added 2 commits
changed milestone to %1.4.2