Skip to content
Snippets Groups Projects

AArch64: Optimize 2D i8mm subpel filters

Merged Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_sbd_i8mm_hv into master
3 unresolved threads

Rewrite the accumulator initializations of the horizontal part of the 2D filters with zero register fills. It can improve the performance on out-of-order CPUs which can fill vector registers by zero with zero latency. Zeroed accumulators imply the usage of the rounding shifts at the end of filters.

The only exception is the very short *hv_filter4*, where the longer latency of rounding shift could decrease the performance.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:

mct_8tap_regular_w16_hv_8bpc_i8mm:  0.982x
mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.979x
mct_8tap_regular_w8_hv_8bpc_i8mm:   0.972x
mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.969x
mct_8tap_regular_w4_hv_8bpc_i8mm:   0.942x
mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.935x
mc_8tap_regular_w16_hv_8bpc_i8mm:   0.988x
mc_8tap_sharp_w16_hv_8bpc_i8mm:     0.982x
mc_8tap_regular_w8_hv_8bpc_i8mm:    0.981x
mc_8tap_sharp_w8_hv_8bpc_i8mm:      0.975x
mc_8tap_regular_w4_hv_8bpc_i8mm:    0.998x
mc_8tap_sharp_w4_hv_8bpc_i8mm:      0.996x
mc_8tap_regular_w2_hv_8bpc_i8mm:    1.006x
mc_8tap_sharp_w2_hv_8bpc_i8mm:      0.993x

Cortex-A715:

mct_8tap_regular_w16_hv_8bpc_i8mm:  0.883x
mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.931x
mct_8tap_regular_w8_hv_8bpc_i8mm:   0.882x
mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.928x
mct_8tap_regular_w4_hv_8bpc_i8mm:   0.969x
mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.934x
mc_8tap_regular_w16_hv_8bpc_i8mm:   0.881x
mc_8tap_sharp_w16_hv_8bpc_i8mm:     0.925x
mc_8tap_regular_w8_hv_8bpc_i8mm:    0.879x
mc_8tap_sharp_w8_hv_8bpc_i8mm:      0.925x
mc_8tap_regular_w4_hv_8bpc_i8mm:    0.917x
mc_8tap_sharp_w4_hv_8bpc_i8mm:      0.976x
mc_8tap_regular_w2_hv_8bpc_i8mm:    0.915x
mc_8tap_sharp_w2_hv_8bpc_i8mm:      0.972x

Cortex-A510:

mct_8tap_regular_w16_hv_8bpc_i8mm:  0.994x
mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.949x
mct_8tap_regular_w8_hv_8bpc_i8mm:   0.987x
mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.947x
mct_8tap_regular_w4_hv_8bpc_i8mm:   1.002x
mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.999x
mc_8tap_regular_w16_hv_8bpc_i8mm:   0.989x
mc_8tap_sharp_w16_hv_8bpc_i8mm:     1.003x
mc_8tap_regular_w8_hv_8bpc_i8mm:    0.986x
mc_8tap_sharp_w8_hv_8bpc_i8mm:      1.000x
mc_8tap_regular_w4_hv_8bpc_i8mm:    1.007x
mc_8tap_sharp_w4_hv_8bpc_i8mm:      1.000x
mc_8tap_regular_w2_hv_8bpc_i8mm:    1.005x
mc_8tap_sharp_w2_hv_8bpc_i8mm:      1.000x

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
939 968
940 969 smull v0.4s, v16.4h, v7.h[1]
941 970 smull2 v1.4s, v16.8h, v7.h[1]
942 .ifc \isa, neon_dotprod
971 .ifc \isa, neon_i8mm
972 mov v16.16b, v17.16b
973 movi v5.4s, #0
974 movi v6.4s, #0
975 .else
943 976 sub v23.16b, v23.16b, v24.16b
944 .endif
945 977 mov v16.16b, v17.16b
946
947 978 mov v5.16b, v27.16b
948 979 mov v6.16b, v27.16b
  • Isn't this change here equivalent to doing this?

    .ifc \isa, neon_dotprod
        sub v23
    .endif
        mov v16, v17
    .ifc \isa, neon_i8mm
        mov v5, #0
        mov v6, #0
    .else
        mov v5, v27
        mov v6, v27
    .endif
    When viewed as a diff, that would feel more straightforward - but you're probably right that the end result, with just one `.if/.else` instead of two, is better. So this is ok, I'm just trying to wrap my head around the diff here.
  • I just want to use the least possible GAS directives, but your version is clearly more readable, so I will fix it!

  • Fixed in 431ab0fc.

  • Please register or sign in to reply
  • Martin Storsjö
    Martin Storsjö @mstorsjo started a thread on commit 942a8fd7
  • 704 702 smull v0.4s, v16.4h, v7.h[0]
    705 703 smull2 v1.4s, v16.8h, v7.h[0]
    706 704 mov v16.16b, v17.16b
    707 .ifc \isa, neon_dotprod
    705 .ifc \isa, neon_i8mm
    706 movi v5.4s, #0
    707 movi v6.4s, #0
    708 tbl v2.16b, {v23.16b}, v28.16b
  • Martin Storsjö
    • Overall quite good, and quite straightforward similar to the other cases. A couple details I want mentioned in the commit message, and one detail to explain and/or generalize to dotprod if beneficial.

    • We've also found a bit enhancement for the *filter4* and *filter8* functions (inspired by the original Armv8.0-a version), where we can get rid of some vector mov instructions, but it affects DotProd and i8mm CPUs, so it will be a different MR.

    • Please register or sign in to reply
  • Arpad Panyik added 1 commit

    added 1 commit

    • 431ab0fc - AArch64: Optimize 2D i8mm subpel filters

    Compare with previous version

  • Arpad Panyik added 2 commits

    added 2 commits

    • b2eca1ac - 1 commit from branch videolan:master
    • 4831afe1 - AArch64: Optimize 2D i8mm subpel filters

    Compare with previous version

  • Arpad Panyik added 1 commit

    added 1 commit

    • e7fdeeac - AArch64: Optimize 2D i8mm subpel filters

    Compare with previous version

  • Arpad Panyik added 1 commit

    added 1 commit

    • 643195f5 - AArch64: Optimize 2D i8mm subpel filters

    Compare with previous version

  • Martin Storsjö approved this merge request

    approved this merge request

  • LGTM now, thanks!

  • changed milestone to %1.4.2

  • Please register or sign in to reply
    Loading