Commits on Source (21)
-
The 6-tap horizontal and the horizontal parts of 6-tap HV subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on Cortex CPU cores: SBD mct h X1 A78 A76 A72 A55 regular w8: 0.878x 0.894x 0.990x 0.923x 0.944x regular w16: 0.962x 0.931x 0.943x 0.949x 0.949x regular w32: 0.937x 0.937x 0.972x 0.938x 0.947x regular w64: 0.920x 0.965x 0.992x 0.936x 0.944x SBD mct hv X1 A78 A76 A72 A55 regular w8: 0.931x 0.970x 0.951x 0.950x 0.971x regular w16: 0.940x 0.971x 0.941x 0.952x 0.967x regular w32: 0.943x 0.972x 0.946x 0.961x 0.974x regular w64: 0.943x 0.973x 0.952x 0.944x 0.975x
a992a9be -
Partial register writes can create long dependency chains, which can reduce performance on out-of-order CPUs. This patch removes most of these kinds of problems in MC functions by filling the full register before other lane loading instructions. Most lane extracting stores can also be optimized using FP scalar stores when the 0th lane would be extracted. Relative runtime of micro benchmarks after this patch on some Neoverse and Cortex CPU cores: 8bpc neon V2 V1 X3 X1 A715 A78 A76 avg w8: 0.942x 1.030x 0.936x 0.935x 1.000x 0.877x 0.976x w_avg w8: 0.908x 0.913x 0.919x 0.914x 0.999x 0.905x 0.910x mask w8: 0.937x 0.905x 0.929x 0.907x 1.009x 0.921x 0.868x w_mask 420 w4: 0.969x 0.968x 0.951x 0.962x 0.995x 0.976x 0.958x w_mask 420 w8: 0.979x 0.935x 0.936x 0.935x 0.996x 0.948x 0.959x blend w4: 0.721x 0.841x 0.764x 0.822x 0.772x 0.826x 0.883x blend w8: 0.692x 0.733x 0.686x 0.730x 0.828x 0.723x 0.762x blend h w2: 0.738x 0.776x 0.746x 0.775x 0.683x 0.827x 0.851x blend h w4: 0.858x 0.942x 0.880x 0.933x 0.784x 0.924x 0.965x blend h w8: 0.804x 0.807x 0.806x 0.805x 0.814x 0.810x 0.748x blend v w2: 0.898x 0.931x 0.903x 0.949x 0.784x 0.867x 0.875x blend v w4: 0.935x 0.905x 0.933x 0.922x 0.763x 0.777x 0.807x blend v w8: 0.803x 0.802x 0.804x 0.815x 0.674x 0.677x 0.678x 16bpc neon V2 V1 X3 X1 A715 A78 A76 avg w4: 0.899x 0.967x 0.897x 0.948x 1.002x 0.901x 0.884x w_avg w4: 0.952x 0.951x 0.936x 0.946x 0.997x 0.937x 0.925x mask w4: 0.893x 0.958x 0.887x 0.948x 1.003x 0.938x 0.934x w_mask 420 w4: 0.933x 0.932x 0.932x 0.939x 1.000x 0.910x 0.955x w_mask 420 w8: 0.966x 0.962x 0.967x 0.961x 1.000x 0.990x 1.010x blend w4: 0.367x 0.361x 0.370x 0.352x 0.418x 0.394x 0.476x blend h w2: 0.365x 0.445x 0.369x 0.437x 0.416x 0.576x 0.699x blend h w4: 0.343x 0.402x 0.342x 0.398x 0.418x 0.525x 0.603x blend v w2: 0.464x 0.460x 0.460x 0.447x 0.494x 0.446x 0.503x blend v w4: 0.432x 0.424x 0.437x 0.416x 0.433x 0.427x 0.534x blend v w8: 0.936x 0.847x 0.949x 0.848x 1.007x 0.811x 0.785x bilinear 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct w4 0: 0.982x 0.983x 0.955x 1.029x 0.784x 0.817x 0.814x mc w2 h: 0.277x 0.333x 0.275x 0.325x 0.299x 0.435x 0.518x mct w4 h: 0.835x 0.862x 0.814x 0.887x 1.074x 0.899x 0.884x mc w2 v: 0.887x 0.966x 0.894x 0.945x 0.808x 0.953x 0.997x mc w4 v: 0.762x 0.899x 0.766x 0.867x 0.695x 0.915x 1.017x mct w4 v: 0.700x 0.812x 0.740x 0.777x 0.777x 0.824x 0.853x mc w2 hv: 0.928x 0.985x 0.929x 0.978x 0.789x 0.969x 1.010x mct w4 hv: 0.887x 0.913x 0.912x 0.920x 1.001x 0.922x 0.937x bilinear 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc w2 0: 0.991x 1.032x 0.993x 0.970x 0.878x 0.925x 0.999x mct w4 0: 0.811x 0.730x 0.797x 0.680x 0.808x 0.711x 0.805x mc w4 h: 0.885x 0.901x 0.895x 0.905x 1.003x 0.909x 0.910x mct w4 h: 0.902x 0.914x 0.898x 0.896x 1.000x 0.897x 0.934x mc w2 v: 0.888x 0.966x 0.913x 0.955x 0.824x 0.958x 1.005x mc w4 v: 0.897x 0.894x 0.903x 0.902x 1.001x 0.895x 0.895x mct w4 v: 0.924x 0.908x 0.921x 0.901x 1.001x 0.904x 0.918x mc w4 hv: 0.927x 0.925x 0.924x 0.933x 1.000x 0.936x 0.959x mct w4 hv: 0.923x 0.944x 0.923x 0.944x 0.999x 0.931x 0.956x 8tap 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct regular w4 0: 0.829x 0.854x 0.735x 0.861x 0.769x 0.766x 0.840x mc regular w2 h: 0.984x 1.008x 0.983x 1.012x 0.986x 0.989x 0.995x mc sharp w2 h: 0.987x 1.008x 0.986x 1.011x 0.985x 0.989x 0.995x mc regular w4 h: 0.907x 0.911x 0.916x 0.908x 0.997x 0.936x 0.932x mc sharp w4 h: 0.916x 0.914x 0.918x 0.913x 0.999x 0.939x 0.905x mct regular w4 h: 0.992x 0.979x 0.993x 0.971x 1.000x 0.986x 0.976x mct sharp w4 h: 0.991x 0.979x 0.989x 0.984x 1.001x 0.979x 0.983x mc regular w2 v: 1.002x 1.001x 1.005x 1.000x 1.000x 0.998x 0.983x mc sharp w2 v: 1.005x 1.001x 1.009x 0.998x 0.994x 0.997x 0.989x mc regular w4 v: 0.985x 0.998x 0.991x 0.998x 1.000x 1.000x 0.983x mc sharp w4 v: 1.005x 1.002x 1.006x 1.002x 0.998x 0.991x 0.999x mct regular w4 v: 0.966x 0.967x 0.961x 0.974x 0.996x 0.954x 0.982x mct sharp w4 v: 0.970x 0.944x 0.967x 0.944x 0.997x 0.951x 0.966x mc regular w2 hv: 0.993x 0.993x 0.994x 0.987x 0.993x 0.985x 0.999x mc sharp w2 hv: 0.994x 0.996x 0.992x 0.998x 0.997x 0.999x 0.999x mc regular w4 hv: 0.964x 0.958x 0.964x 0.960x 0.982x 0.938x 0.958x mc sharp w4 hv: 0.982x 0.981x 0.980x 0.982x 0.995x 0.986x 0.941x mct regular w4 hv: 0.993x 0.994x 0.992x 0.994x 0.996x 0.992x 0.988x mct sharp w4 hv: 0.993x 0.996x 0.991x 0.996x 0.954x 0.992x 1.011x 8tap 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc regular w2 0: 0.869x 1.059x 0.874x 0.956x 0.883x 0.932x 1.000x mct regular w4 0: 0.348x 0.369x 0.354x 0.377x 0.560x 0.409x 0.648x mc regular w2 h: 0.996x 0.988x 0.992x 0.985x 0.989x 0.991x 1.006x mc sharp w2 h: 0.996x 0.989x 0.979x 0.991x 0.987x 0.988x 0.997x mc regular w4 h: 0.957x 0.937x 0.957x 0.948x 0.961x 0.927x 0.994x mc sharp w4 h: 0.966x 0.940x 0.962x 0.954x 0.985x 0.929x 0.970x mct regular w4 h: 0.922x 0.942x 0.932x 0.933x 1.007x 0.938x 0.905x mct sharp w4 h: 0.919x 0.943x 0.919x 0.931x 0.971x 0.943x 0.929x mc regular w2 v: 1.000x 0.997x 1.001x 1.003x 1.001x 0.999x 0.984x mc sharp w2 v: 1.000x 0.999x 1.000x 0.999x 1.000x 1.000x 0.993x mc regular w4 v: 0.936x 0.941x 0.936x 0.939x 0.999x 0.928x 0.981x mc sharp w4 v: 0.955x 0.961x 0.949x 0.956x 0.999x 0.947x 0.953x mct regular w4 v: 0.977x 0.966x 0.979x 0.968x 0.990x 0.972x 0.972x mct sharp w4 v: 0.973x 0.965x 0.981x 0.963x 0.994x 0.977x 0.974x mc regular w2 hv: 0.995x 1.001x 0.995x 0.995x 0.995x 1.000x 0.981x mc sharp w2 hv: 0.993x 1.012x 0.993x 0.988x 0.996x 0.992x 1.008x mc regular w4 hv: 0.938x 0.943x 0.939x 0.943x 0.986x 0.943x 0.997x mc sharp w4 hv: 0.969x 0.959x 0.970x 0.974x 0.986x 0.993x 0.997x mct regular w4 hv: 0.942x 0.970x 0.951x 0.960x 0.977x 0.958x 1.018x mct sharp w4 hv: 0.923x 0.958x 0.934x 0.955x 0.973x 0.946x 0.986x
ec5c3052 -
For the 3x3 part, double the width of the vertical loop. This is done to provide more latency in the new sgr calculation. Initial (master): Cortex A53 A55 A72 A73 A76 Apple M1 sgr_3x3_8bpc_neon: 387702.8 383154.2 295742.4 302100.1 185420.7 472.2 sgr_5x5_8bpc_neon: 261725.1 256919.8 194205.1 197585.6 128311.3 332.9 sgr_mix_8bpc_neon: 628085.0 593664.2 453551.8 450553.8 281956.0 711.2 Current: sgr_3x3_8bpc_neon: 368331.4 363949.7 275499.0 272056.3 169614.4 432.7 sgr_5x5_8bpc_neon: 257866.7 255265.5 195962.5 199557.8 120481.3 319.2 sgr_mix_8bpc_neon: 598234.1 572896.4 418500.4 438910.7 258977.7 659.3 Include a minor improvement that gets rid of a dup instruction.
79db1624 -
Martin Storsjö authored
This allows executing all the tools within e.g. valgrind. This matches the "meson test --wrap <tool>" feature.
166e1df5 -
Shared object binary size reduction: x84_64 : 16112 bytes ARM64 : 16008 bytes ARM64(+Os) : 21592 bytes ARMv7(+Os+mthumb): 18480 bytes Size reduction of symbols: x84_64 : 15712 bytes ARM64 : 18688 bytes ARM64(+Os) : 18404 bytes ARMv7(+Os+mthumb): 17322 bytes Compiles were done with clang version 18.1.8 and symbol sizes were obtained using nm on the shared object. Provides speed ups on older ARM64 cpus with very little impact on other cpus. Speedup: c7i (skylake) Nature1080p : x0.999 Chimera : x0.998 odroid C4 Nature1080p : x1.007 Chimera : x1.016 Models1080p : x1.005 MountainBike1080p: x1.009 Balloons1080p : x1.008 Raspberry Pi 4 Nature1080p : x1.005 Chimera : x0.999 Models1080p : x0.999 MountainBike1080p: x1.004 Balloons1080p : x1.003 Raspberry Pi 2 (Cortex-A7): (using size optimized build) Nature1080p : x1.003 Models1080p : x0.997
4385e7e1 -
Kacper Michajłow authored
Instead of generating version.h, move the so version there and parse it in meson.
74ccc936 -
Kacper Michajłow authored
This is possible, because we no longer generate version.h at compile time. Reverts header change from 7629402b to preserve the same behaviour as before.
f4a0d7cb -
There are some instruction sequences we could merge after the lane load/store patch (ec5c3052). This change will simplify the loading of filter weights to save 288 bytes in the Armv8.0 Neon path of 6-tap and 8-tap MC functions.
82e9155c -
dd32cd50
-
Jean-Baptiste Kempf authoredbd875480
-
Jean-Baptiste Kempf authored21235966
-
Luca Barbato authored33b9d514
-
Luca Barbato authoredda51b123
-
Luca Barbato authoredb1d847be
-
Luca Barbato authored19e122ee
-
Luca Barbato authored
Initial i32x4 version, can be used as base for high bitdept.
0bf331a1 -
Luca Barbato authored75d3ad14
-
Luca Barbato authored8d9b1e26
-
8e993f4d
-
This is needed for GCC 4.7 and earlier, as well as Visual Studio 2022 version 17.9 and earlier.
a7a40a3f -
f2c3ccd6
Showing
- NEWS 24 additions, 0 deletionsNEWS
- examples/dp_renderer.h 16 additions, 6 deletionsexamples/dp_renderer.h
- examples/dp_renderer_placebo.c 11 additions, 11 deletionsexamples/dp_renderer_placebo.c
- examples/meson.build 7 additions, 3 deletionsexamples/meson.build
- include/common/intops.h 2 additions, 2 deletionsinclude/common/intops.h
- include/dav1d/dav1d.h 4 additions, 4 deletionsinclude/dav1d/dav1d.h
- include/dav1d/meson.build 1 addition, 10 deletionsinclude/dav1d/meson.build
- include/dav1d/version.h 4 additions, 4 deletionsinclude/dav1d/version.h
- meson.build 73 additions, 73 deletionsmeson.build
- package/crossfiles/arm64-iPhoneOS.meson 27 additions, 0 deletionspackage/crossfiles/arm64-iPhoneOS.meson
- package/crossfiles/x86_64-iPhoneSimulator.meson 27 additions, 0 deletionspackage/crossfiles/x86_64-iPhoneSimulator.meson
- src/arm/32/util.S 22 additions, 4 deletionssrc/arm/32/util.S
- src/arm/64/looprestoration_common.S 179 additions, 112 deletionssrc/arm/64/looprestoration_common.S
- src/arm/64/mc.S 134 additions, 130 deletionssrc/arm/64/mc.S
- src/arm/64/mc16.S 47 additions, 58 deletionssrc/arm/64/mc16.S
- src/arm/arm-arch.h 68 additions, 0 deletionssrc/arm/arm-arch.h
- src/arm/cpu.c 3 additions, 3 deletionssrc/arm/cpu.c
- src/cpu.c 4 additions, 4 deletionssrc/cpu.c
- src/ctx.c 65 additions, 0 deletionssrc/ctx.c
- src/ctx.h 42 additions, 44 deletionssrc/ctx.h
package/crossfiles/arm64-iPhoneOS.meson
0 → 100644
src/arm/arm-arch.h
0 → 100644
src/ctx.c
0 → 100644