Skip to content
Snippets Groups Projects

AArch64: Optimize lane load/store in MC functions

Merged Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_lane_neon into master
  1. Sep 06, 2024
    • Arpad Panyik's avatar
      AArch64: Optimize lane load/store in MC functions · ec5c3052
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Partial register writes can create long dependency chains, which can
      reduce performance on out-of-order CPUs. This patch removes most of
      these kinds of problems in MC functions by filling the full register
      before other lane loading instructions.
      
      Most lane extracting stores can also be optimized using FP scalar
      stores when the 0th lane would be extracted.
      
      Relative runtime of micro benchmarks after this patch on some Neoverse
      and Cortex CPU cores:
      
      8bpc neon                V2      V1      X3      X1    A715     A78     A76
       avg        w8:       0.942x  1.030x  0.936x  0.935x  1.000x  0.877x  0.976x
       w_avg      w8:       0.908x  0.913x  0.919x  0.914x  0.999x  0.905x  0.910x
       mask       w8:       0.937x  0.905x  0.929x  0.907x  1.009x  0.921x  0.868x
       w_mask 420 w4:       0.969x  0.968x  0.951x  0.962x  0.995x  0.976x  0.958x
       w_mask 420 w8:       0.979x  0.935x  0.936x  0.935x  0.996x  0.948x  0.959x
       blend      w4:       0.721x  0.841x  0.764x  0.822x  0.772x  0.826x  0.883x
       blend      w8:       0.692x  0.733x  0.686x  0.730x  0.828x  0.723x  0.762x
       blend    h w2:       0.738x  0.776x  0.746x  0.775x  0.683x  0.827x  0.851x
       blend    h w4:       0.858x  0.942x  0.880x  0.933x  0.784x  0.924x  0.965x
       blend    h w8:       0.804x  0.807x  0.806x  0.805x  0.814x  0.810x  0.748x
       blend    v w2:       0.898x  0.931x  0.903x  0.949x  0.784x  0.867x  0.875x
       blend    v w4:       0.935x  0.905x  0.933x  0.922x  0.763x  0.777x  0.807x
       blend    v w8:       0.803x  0.802x  0.804x  0.815x  0.674x  0.677x  0.678x
      
      16bpc neon               V2      V1      X3      X1    A715     A78     A76
       avg        w4:       0.899x  0.967x  0.897x  0.948x  1.002x  0.901x  0.884x
       w_avg      w4:       0.952x  0.951x  0.936x  0.946x  0.997x  0.937x  0.925x
       mask       w4:       0.893x  0.958x  0.887x  0.948x  1.003x  0.938x  0.934x
       w_mask 420 w4:       0.933x  0.932x  0.932x  0.939x  1.000x  0.910x  0.955x
       w_mask 420 w8:       0.966x  0.962x  0.967x  0.961x  1.000x  0.990x  1.010x
       blend      w4:       0.367x  0.361x  0.370x  0.352x  0.418x  0.394x  0.476x
       blend    h w2:       0.365x  0.445x  0.369x  0.437x  0.416x  0.576x  0.699x
       blend    h w4:       0.343x  0.402x  0.342x  0.398x  0.418x  0.525x  0.603x
       blend    v w2:       0.464x  0.460x  0.460x  0.447x  0.494x  0.446x  0.503x
       blend    v w4:       0.432x  0.424x  0.437x  0.416x  0.433x  0.427x  0.534x
       blend    v w8:       0.936x  0.847x  0.949x  0.848x  1.007x  0.811x  0.785x
      
      bilinear 8bpc neon       V2      V1      X3      X1    A715     A78     A76
       mct     w4  0:       0.982x  0.983x  0.955x  1.029x  0.784x  0.817x  0.814x
       mc      w2  h:       0.277x  0.333x  0.275x  0.325x  0.299x  0.435x  0.518x
       mct     w4  h:       0.835x  0.862x  0.814x  0.887x  1.074x  0.899x  0.884x
       mc      w2  v:       0.887x  0.966x  0.894x  0.945x  0.808x  0.953x  0.997x
       mc      w4  v:       0.762x  0.899x  0.766x  0.867x  0.695x  0.915x  1.017x
       mct     w4  v:       0.700x  0.812x  0.740x  0.777x  0.777x  0.824x  0.853x
       mc      w2 hv:       0.928x  0.985x  0.929x  0.978x  0.789x  0.969x  1.010x
       mct     w4 hv:       0.887x  0.913x  0.912x  0.920x  1.001x  0.922x  0.937x
      
      bilinear 16bpc neon      V2      V1      X3      X1    A715     A78     A76
       mc      w2  0:       0.991x  1.032x  0.993x  0.970x  0.878x  0.925x  0.999x
       mct     w4  0:       0.811x  0.730x  0.797x  0.680x  0.808x  0.711x  0.805x
       mc      w4  h:       0.885x  0.901x  0.895x  0.905x  1.003x  0.909x  0.910x
       mct     w4  h:       0.902x  0.914x  0.898x  0.896x  1.000x  0.897x  0.934x
       mc      w2  v:       0.888x  0.966x  0.913x  0.955x  0.824x  0.958x  1.005x
       mc      w4  v:       0.897x  0.894x  0.903x  0.902x  1.001x  0.895x  0.895x
       mct     w4  v:       0.924x  0.908x  0.921x  0.901x  1.001x  0.904x  0.918x
       mc      w4 hv:       0.927x  0.925x  0.924x  0.933x  1.000x  0.936x  0.959x
       mct     w4 hv:       0.923x  0.944x  0.923x  0.944x  0.999x  0.931x  0.956x
      
      8tap 8bpc neon           V2      V1      X3      X1    A715     A78     A76
       mct regular w4  0:   0.829x  0.854x  0.735x  0.861x  0.769x  0.766x  0.840x
       mc  regular w2  h:   0.984x  1.008x  0.983x  1.012x  0.986x  0.989x  0.995x
       mc  sharp   w2  h:   0.987x  1.008x  0.986x  1.011x  0.985x  0.989x  0.995x
       mc  regular w4  h:   0.907x  0.911x  0.916x  0.908x  0.997x  0.936x  0.932x
       mc  sharp   w4  h:   0.916x  0.914x  0.918x  0.913x  0.999x  0.939x  0.905x
       mct regular w4  h:   0.992x  0.979x  0.993x  0.971x  1.000x  0.986x  0.976x
       mct sharp   w4  h:   0.991x  0.979x  0.989x  0.984x  1.001x  0.979x  0.983x
       mc  regular w2  v:   1.002x  1.001x  1.005x  1.000x  1.000x  0.998x  0.983x
       mc  sharp   w2  v:   1.005x  1.001x  1.009x  0.998x  0.994x  0.997x  0.989x
       mc  regular w4  v:   0.985x  0.998x  0.991x  0.998x  1.000x  1.000x  0.983x
       mc  sharp   w4  v:   1.005x  1.002x  1.006x  1.002x  0.998x  0.991x  0.999x
       mct regular w4  v:   0.966x  0.967x  0.961x  0.974x  0.996x  0.954x  0.982x
       mct sharp   w4  v:   0.970x  0.944x  0.967x  0.944x  0.997x  0.951x  0.966x
       mc  regular w2 hv:   0.993x  0.993x  0.994x  0.987x  0.993x  0.985x  0.999x
       mc  sharp   w2 hv:   0.994x  0.996x  0.992x  0.998x  0.997x  0.999x  0.999x
       mc  regular w4 hv:   0.964x  0.958x  0.964x  0.960x  0.982x  0.938x  0.958x
       mc  sharp   w4 hv:   0.982x  0.981x  0.980x  0.982x  0.995x  0.986x  0.941x
       mct regular w4 hv:   0.993x  0.994x  0.992x  0.994x  0.996x  0.992x  0.988x
       mct sharp   w4 hv:   0.993x  0.996x  0.991x  0.996x  0.954x  0.992x  1.011x
      
      8tap 16bpc neon          V2      V1      X3      X1    A715     A78     A76
       mc  regular w2  0:   0.869x  1.059x  0.874x  0.956x  0.883x  0.932x  1.000x
       mct regular w4  0:   0.348x  0.369x  0.354x  0.377x  0.560x  0.409x  0.648x
       mc  regular w2  h:   0.996x  0.988x  0.992x  0.985x  0.989x  0.991x  1.006x
       mc  sharp   w2  h:   0.996x  0.989x  0.979x  0.991x  0.987x  0.988x  0.997x
       mc  regular w4  h:   0.957x  0.937x  0.957x  0.948x  0.961x  0.927x  0.994x
       mc  sharp   w4  h:   0.966x  0.940x  0.962x  0.954x  0.985x  0.929x  0.970x
       mct regular w4  h:   0.922x  0.942x  0.932x  0.933x  1.007x  0.938x  0.905x
       mct sharp   w4  h:   0.919x  0.943x  0.919x  0.931x  0.971x  0.943x  0.929x
       mc  regular w2  v:   1.000x  0.997x  1.001x  1.003x  1.001x  0.999x  0.984x
       mc  sharp   w2  v:   1.000x  0.999x  1.000x  0.999x  1.000x  1.000x  0.993x
       mc  regular w4  v:   0.936x  0.941x  0.936x  0.939x  0.999x  0.928x  0.981x
       mc  sharp   w4  v:   0.955x  0.961x  0.949x  0.956x  0.999x  0.947x  0.953x
       mct regular w4  v:   0.977x  0.966x  0.979x  0.968x  0.990x  0.972x  0.972x
       mct sharp   w4  v:   0.973x  0.965x  0.981x  0.963x  0.994x  0.977x  0.974x
       mc  regular w2 hv:   0.995x  1.001x  0.995x  0.995x  0.995x  1.000x  0.981x
       mc  sharp   w2 hv:   0.993x  1.012x  0.993x  0.988x  0.996x  0.992x  1.008x
       mc  regular w4 hv:   0.938x  0.943x  0.939x  0.943x  0.986x  0.943x  0.997x
       mc  sharp   w4 hv:   0.969x  0.959x  0.970x  0.974x  0.986x  0.993x  0.997x
       mct regular w4 hv:   0.942x  0.970x  0.951x  0.960x  0.977x  0.958x  1.018x
       mct sharp   w4 hv:   0.923x  0.958x  0.934x  0.955x  0.973x  0.946x  0.986x
      ec5c3052
Loading