AArch64: Optimize lane load/store in MC functions
- Sep 06, 2024
-
-
Partial register writes can create long dependency chains, which can reduce performance on out-of-order CPUs. This patch removes most of these kinds of problems in MC functions by filling the full register before other lane loading instructions. Most lane extracting stores can also be optimized using FP scalar stores when the 0th lane would be extracted. Relative runtime of micro benchmarks after this patch on some Neoverse and Cortex CPU cores: 8bpc neon V2 V1 X3 X1 A715 A78 A76 avg w8: 0.942x 1.030x 0.936x 0.935x 1.000x 0.877x 0.976x w_avg w8: 0.908x 0.913x 0.919x 0.914x 0.999x 0.905x 0.910x mask w8: 0.937x 0.905x 0.929x 0.907x 1.009x 0.921x 0.868x w_mask 420 w4: 0.969x 0.968x 0.951x 0.962x 0.995x 0.976x 0.958x w_mask 420 w8: 0.979x 0.935x 0.936x 0.935x 0.996x 0.948x 0.959x blend w4: 0.721x 0.841x 0.764x 0.822x 0.772x 0.826x 0.883x blend w8: 0.692x 0.733x 0.686x 0.730x 0.828x 0.723x 0.762x blend h w2: 0.738x 0.776x 0.746x 0.775x 0.683x 0.827x 0.851x blend h w4: 0.858x 0.942x 0.880x 0.933x 0.784x 0.924x 0.965x blend h w8: 0.804x 0.807x 0.806x 0.805x 0.814x 0.810x 0.748x blend v w2: 0.898x 0.931x 0.903x 0.949x 0.784x 0.867x 0.875x blend v w4: 0.935x 0.905x 0.933x 0.922x 0.763x 0.777x 0.807x blend v w8: 0.803x 0.802x 0.804x 0.815x 0.674x 0.677x 0.678x 16bpc neon V2 V1 X3 X1 A715 A78 A76 avg w4: 0.899x 0.967x 0.897x 0.948x 1.002x 0.901x 0.884x w_avg w4: 0.952x 0.951x 0.936x 0.946x 0.997x 0.937x 0.925x mask w4: 0.893x 0.958x 0.887x 0.948x 1.003x 0.938x 0.934x w_mask 420 w4: 0.933x 0.932x 0.932x 0.939x 1.000x 0.910x 0.955x w_mask 420 w8: 0.966x 0.962x 0.967x 0.961x 1.000x 0.990x 1.010x blend w4: 0.367x 0.361x 0.370x 0.352x 0.418x 0.394x 0.476x blend h w2: 0.365x 0.445x 0.369x 0.437x 0.416x 0.576x 0.699x blend h w4: 0.343x 0.402x 0.342x 0.398x 0.418x 0.525x 0.603x blend v w2: 0.464x 0.460x 0.460x 0.447x 0.494x 0.446x 0.503x blend v w4: 0.432x 0.424x 0.437x 0.416x 0.433x 0.427x 0.534x blend v w8: 0.936x 0.847x 0.949x 0.848x 1.007x 0.811x 0.785x bilinear 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct w4 0: 0.982x 0.983x 0.955x 1.029x 0.784x 0.817x 0.814x mc w2 h: 0.277x 0.333x 0.275x 0.325x 0.299x 0.435x 0.518x mct w4 h: 0.835x 0.862x 0.814x 0.887x 1.074x 0.899x 0.884x mc w2 v: 0.887x 0.966x 0.894x 0.945x 0.808x 0.953x 0.997x mc w4 v: 0.762x 0.899x 0.766x 0.867x 0.695x 0.915x 1.017x mct w4 v: 0.700x 0.812x 0.740x 0.777x 0.777x 0.824x 0.853x mc w2 hv: 0.928x 0.985x 0.929x 0.978x 0.789x 0.969x 1.010x mct w4 hv: 0.887x 0.913x 0.912x 0.920x 1.001x 0.922x 0.937x bilinear 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc w2 0: 0.991x 1.032x 0.993x 0.970x 0.878x 0.925x 0.999x mct w4 0: 0.811x 0.730x 0.797x 0.680x 0.808x 0.711x 0.805x mc w4 h: 0.885x 0.901x 0.895x 0.905x 1.003x 0.909x 0.910x mct w4 h: 0.902x 0.914x 0.898x 0.896x 1.000x 0.897x 0.934x mc w2 v: 0.888x 0.966x 0.913x 0.955x 0.824x 0.958x 1.005x mc w4 v: 0.897x 0.894x 0.903x 0.902x 1.001x 0.895x 0.895x mct w4 v: 0.924x 0.908x 0.921x 0.901x 1.001x 0.904x 0.918x mc w4 hv: 0.927x 0.925x 0.924x 0.933x 1.000x 0.936x 0.959x mct w4 hv: 0.923x 0.944x 0.923x 0.944x 0.999x 0.931x 0.956x 8tap 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct regular w4 0: 0.829x 0.854x 0.735x 0.861x 0.769x 0.766x 0.840x mc regular w2 h: 0.984x 1.008x 0.983x 1.012x 0.986x 0.989x 0.995x mc sharp w2 h: 0.987x 1.008x 0.986x 1.011x 0.985x 0.989x 0.995x mc regular w4 h: 0.907x 0.911x 0.916x 0.908x 0.997x 0.936x 0.932x mc sharp w4 h: 0.916x 0.914x 0.918x 0.913x 0.999x 0.939x 0.905x mct regular w4 h: 0.992x 0.979x 0.993x 0.971x 1.000x 0.986x 0.976x mct sharp w4 h: 0.991x 0.979x 0.989x 0.984x 1.001x 0.979x 0.983x mc regular w2 v: 1.002x 1.001x 1.005x 1.000x 1.000x 0.998x 0.983x mc sharp w2 v: 1.005x 1.001x 1.009x 0.998x 0.994x 0.997x 0.989x mc regular w4 v: 0.985x 0.998x 0.991x 0.998x 1.000x 1.000x 0.983x mc sharp w4 v: 1.005x 1.002x 1.006x 1.002x 0.998x 0.991x 0.999x mct regular w4 v: 0.966x 0.967x 0.961x 0.974x 0.996x 0.954x 0.982x mct sharp w4 v: 0.970x 0.944x 0.967x 0.944x 0.997x 0.951x 0.966x mc regular w2 hv: 0.993x 0.993x 0.994x 0.987x 0.993x 0.985x 0.999x mc sharp w2 hv: 0.994x 0.996x 0.992x 0.998x 0.997x 0.999x 0.999x mc regular w4 hv: 0.964x 0.958x 0.964x 0.960x 0.982x 0.938x 0.958x mc sharp w4 hv: 0.982x 0.981x 0.980x 0.982x 0.995x 0.986x 0.941x mct regular w4 hv: 0.993x 0.994x 0.992x 0.994x 0.996x 0.992x 0.988x mct sharp w4 hv: 0.993x 0.996x 0.991x 0.996x 0.954x 0.992x 1.011x 8tap 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc regular w2 0: 0.869x 1.059x 0.874x 0.956x 0.883x 0.932x 1.000x mct regular w4 0: 0.348x 0.369x 0.354x 0.377x 0.560x 0.409x 0.648x mc regular w2 h: 0.996x 0.988x 0.992x 0.985x 0.989x 0.991x 1.006x mc sharp w2 h: 0.996x 0.989x 0.979x 0.991x 0.987x 0.988x 0.997x mc regular w4 h: 0.957x 0.937x 0.957x 0.948x 0.961x 0.927x 0.994x mc sharp w4 h: 0.966x 0.940x 0.962x 0.954x 0.985x 0.929x 0.970x mct regular w4 h: 0.922x 0.942x 0.932x 0.933x 1.007x 0.938x 0.905x mct sharp w4 h: 0.919x 0.943x 0.919x 0.931x 0.971x 0.943x 0.929x mc regular w2 v: 1.000x 0.997x 1.001x 1.003x 1.001x 0.999x 0.984x mc sharp w2 v: 1.000x 0.999x 1.000x 0.999x 1.000x 1.000x 0.993x mc regular w4 v: 0.936x 0.941x 0.936x 0.939x 0.999x 0.928x 0.981x mc sharp w4 v: 0.955x 0.961x 0.949x 0.956x 0.999x 0.947x 0.953x mct regular w4 v: 0.977x 0.966x 0.979x 0.968x 0.990x 0.972x 0.972x mct sharp w4 v: 0.973x 0.965x 0.981x 0.963x 0.994x 0.977x 0.974x mc regular w2 hv: 0.995x 1.001x 0.995x 0.995x 0.995x 1.000x 0.981x mc sharp w2 hv: 0.993x 1.012x 0.993x 0.988x 0.996x 0.992x 1.008x mc regular w4 hv: 0.938x 0.943x 0.939x 0.943x 0.986x 0.943x 0.997x mc sharp w4 hv: 0.969x 0.959x 0.970x 0.974x 0.986x 0.993x 0.997x mct regular w4 hv: 0.942x 0.970x 0.951x 0.960x 0.977x 0.958x 1.018x mct sharp w4 hv: 0.923x 0.958x 0.934x 0.955x 0.973x 0.946x 0.986x
ec5c3052
-