- Sep 06, 2024
-
-
Shared object binary size reduction: x84_64 : 16112 bytes ARM64 : 16008 bytes ARM64(+Os) : 21592 bytes ARMv7(+Os+mthumb): 18480 bytes Size reduction of symbols: x84_64 : 15712 bytes ARM64 : 18688 bytes ARM64(+Os) : 18404 bytes ARMv7(+Os+mthumb): 17322 bytes Compiles were done with clang version 18.1.8 and symbol sizes were obtained using nm on the shared object. Provides speed ups on older ARM64 cpus with very little impact on other cpus. Speedup: c7i (skylake) Nature1080p : x0.999 Chimera : x0.998 odroid C4 Nature1080p : x1.007 Chimera : x1.016 Models1080p : x1.005 MountainBike1080p: x1.009 Balloons1080p : x1.008 Raspberry Pi 4 Nature1080p : x1.005 Chimera : x0.999 Models1080p : x0.999 MountainBike1080p: x1.004 Balloons1080p : x1.003 Raspberry Pi 2 (Cortex-A7): (using size optimized build) Nature1080p : x1.003 Models1080p : x0.997
-
Martin Storsjö authored
This allows executing all the tools within e.g. valgrind. This matches the "meson test --wrap <tool>" feature.
-
For the 3x3 part, double the width of the vertical loop. This is done to provide more latency in the new sgr calculation. Initial (master): Cortex A53 A55 A72 A73 A76 Apple M1 sgr_3x3_8bpc_neon: 387702.8 383154.2 295742.4 302100.1 185420.7 472.2 sgr_5x5_8bpc_neon: 261725.1 256919.8 194205.1 197585.6 128311.3 332.9 sgr_mix_8bpc_neon: 628085.0 593664.2 453551.8 450553.8 281956.0 711.2 Current: sgr_3x3_8bpc_neon: 368331.4 363949.7 275499.0 272056.3 169614.4 432.7 sgr_5x5_8bpc_neon: 257866.7 255265.5 195962.5 199557.8 120481.3 319.2 sgr_mix_8bpc_neon: 598234.1 572896.4 418500.4 438910.7 258977.7 659.3 Include a minor improvement that gets rid of a dup instruction.
-
Partial register writes can create long dependency chains, which can reduce performance on out-of-order CPUs. This patch removes most of these kinds of problems in MC functions by filling the full register before other lane loading instructions. Most lane extracting stores can also be optimized using FP scalar stores when the 0th lane would be extracted. Relative runtime of micro benchmarks after this patch on some Neoverse and Cortex CPU cores: 8bpc neon V2 V1 X3 X1 A715 A78 A76 avg w8: 0.942x 1.030x 0.936x 0.935x 1.000x 0.877x 0.976x w_avg w8: 0.908x 0.913x 0.919x 0.914x 0.999x 0.905x 0.910x mask w8: 0.937x 0.905x 0.929x 0.907x 1.009x 0.921x 0.868x w_mask 420 w4: 0.969x 0.968x 0.951x 0.962x 0.995x 0.976x 0.958x w_mask 420 w8: 0.979x 0.935x 0.936x 0.935x 0.996x 0.948x 0.959x blend w4: 0.721x 0.841x 0.764x 0.822x 0.772x 0.826x 0.883x blend w8: 0.692x 0.733x 0.686x 0.730x 0.828x 0.723x 0.762x blend h w2: 0.738x 0.776x 0.746x 0.775x 0.683x 0.827x 0.851x blend h w4: 0.858x 0.942x 0.880x 0.933x 0.784x 0.924x 0.965x blend h w8: 0.804x 0.807x 0.806x 0.805x 0.814x 0.810x 0.748x blend v w2: 0.898x 0.931x 0.903x 0.949x 0.784x 0.867x 0.875x blend v w4: 0.935x 0.905x 0.933x 0.922x 0.763x 0.777x 0.807x blend v w8: 0.803x 0.802x 0.804x 0.815x 0.674x 0.677x 0.678x 16bpc neon V2 V1 X3 X1 A715 A78 A76 avg w4: 0.899x 0.967x 0.897x 0.948x 1.002x 0.901x 0.884x w_avg w4: 0.952x 0.951x 0.936x 0.946x 0.997x 0.937x 0.925x mask w4: 0.893x 0.958x 0.887x 0.948x 1.003x 0.938x 0.934x w_mask 420 w4: 0.933x 0.932x 0.932x 0.939x 1.000x 0.910x 0.955x w_mask 420 w8: 0.966x 0.962x 0.967x 0.961x 1.000x 0.990x 1.010x blend w4: 0.367x 0.361x 0.370x 0.352x 0.418x 0.394x 0.476x blend h w2: 0.365x 0.445x 0.369x 0.437x 0.416x 0.576x 0.699x blend h w4: 0.343x 0.402x 0.342x 0.398x 0.418x 0.525x 0.603x blend v w2: 0.464x 0.460x 0.460x 0.447x 0.494x 0.446x 0.503x blend v w4: 0.432x 0.424x 0.437x 0.416x 0.433x 0.427x 0.534x blend v w8: 0.936x 0.847x 0.949x 0.848x 1.007x 0.811x 0.785x bilinear 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct w4 0: 0.982x 0.983x 0.955x 1.029x 0.784x 0.817x 0.814x mc w2 h: 0.277x 0.333x 0.275x 0.325x 0.299x 0.435x 0.518x mct w4 h: 0.835x 0.862x 0.814x 0.887x 1.074x 0.899x 0.884x mc w2 v: 0.887x 0.966x 0.894x 0.945x 0.808x 0.953x 0.997x mc w4 v: 0.762x 0.899x 0.766x 0.867x 0.695x 0.915x 1.017x mct w4 v: 0.700x 0.812x 0.740x 0.777x 0.777x 0.824x 0.853x mc w2 hv: 0.928x 0.985x 0.929x 0.978x 0.789x 0.969x 1.010x mct w4 hv: 0.887x 0.913x 0.912x 0.920x 1.001x 0.922x 0.937x bilinear 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc w2 0: 0.991x 1.032x 0.993x 0.970x 0.878x 0.925x 0.999x mct w4 0: 0.811x 0.730x 0.797x 0.680x 0.808x 0.711x 0.805x mc w4 h: 0.885x 0.901x 0.895x 0.905x 1.003x 0.909x 0.910x mct w4 h: 0.902x 0.914x 0.898x 0.896x 1.000x 0.897x 0.934x mc w2 v: 0.888x 0.966x 0.913x 0.955x 0.824x 0.958x 1.005x mc w4 v: 0.897x 0.894x 0.903x 0.902x 1.001x 0.895x 0.895x mct w4 v: 0.924x 0.908x 0.921x 0.901x 1.001x 0.904x 0.918x mc w4 hv: 0.927x 0.925x 0.924x 0.933x 1.000x 0.936x 0.959x mct w4 hv: 0.923x 0.944x 0.923x 0.944x 0.999x 0.931x 0.956x 8tap 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct regular w4 0: 0.829x 0.854x 0.735x 0.861x 0.769x 0.766x 0.840x mc regular w2 h: 0.984x 1.008x 0.983x 1.012x 0.986x 0.989x 0.995x mc sharp w2 h: 0.987x 1.008x 0.986x 1.011x 0.985x 0.989x 0.995x mc regular w4 h: 0.907x 0.911x 0.916x 0.908x 0.997x 0.936x 0.932x mc sharp w4 h: 0.916x 0.914x 0.918x 0.913x 0.999x 0.939x 0.905x mct regular w4 h: 0.992x 0.979x 0.993x 0.971x 1.000x 0.986x 0.976x mct sharp w4 h: 0.991x 0.979x 0.989x 0.984x 1.001x 0.979x 0.983x mc regular w2 v: 1.002x 1.001x 1.005x 1.000x 1.000x 0.998x 0.983x mc sharp w2 v: 1.005x 1.001x 1.009x 0.998x 0.994x 0.997x 0.989x mc regular w4 v: 0.985x 0.998x 0.991x 0.998x 1.000x 1.000x 0.983x mc sharp w4 v: 1.005x 1.002x 1.006x 1.002x 0.998x 0.991x 0.999x mct regular w4 v: 0.966x 0.967x 0.961x 0.974x 0.996x 0.954x 0.982x mct sharp w4 v: 0.970x 0.944x 0.967x 0.944x 0.997x 0.951x 0.966x mc regular w2 hv: 0.993x 0.993x 0.994x 0.987x 0.993x 0.985x 0.999x mc sharp w2 hv: 0.994x 0.996x 0.992x 0.998x 0.997x 0.999x 0.999x mc regular w4 hv: 0.964x 0.958x 0.964x 0.960x 0.982x 0.938x 0.958x mc sharp w4 hv: 0.982x 0.981x 0.980x 0.982x 0.995x 0.986x 0.941x mct regular w4 hv: 0.993x 0.994x 0.992x 0.994x 0.996x 0.992x 0.988x mct sharp w4 hv: 0.993x 0.996x 0.991x 0.996x 0.954x 0.992x 1.011x 8tap 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc regular w2 0: 0.869x 1.059x 0.874x 0.956x 0.883x 0.932x 1.000x mct regular w4 0: 0.348x 0.369x 0.354x 0.377x 0.560x 0.409x 0.648x mc regular w2 h: 0.996x 0.988x 0.992x 0.985x 0.989x 0.991x 1.006x mc sharp w2 h: 0.996x 0.989x 0.979x 0.991x 0.987x 0.988x 0.997x mc regular w4 h: 0.957x 0.937x 0.957x 0.948x 0.961x 0.927x 0.994x mc sharp w4 h: 0.966x 0.940x 0.962x 0.954x 0.985x 0.929x 0.970x mct regular w4 h: 0.922x 0.942x 0.932x 0.933x 1.007x 0.938x 0.905x mct sharp w4 h: 0.919x 0.943x 0.919x 0.931x 0.971x 0.943x 0.929x mc regular w2 v: 1.000x 0.997x 1.001x 1.003x 1.001x 0.999x 0.984x mc sharp w2 v: 1.000x 0.999x 1.000x 0.999x 1.000x 1.000x 0.993x mc regular w4 v: 0.936x 0.941x 0.936x 0.939x 0.999x 0.928x 0.981x mc sharp w4 v: 0.955x 0.961x 0.949x 0.956x 0.999x 0.947x 0.953x mct regular w4 v: 0.977x 0.966x 0.979x 0.968x 0.990x 0.972x 0.972x mct sharp w4 v: 0.973x 0.965x 0.981x 0.963x 0.994x 0.977x 0.974x mc regular w2 hv: 0.995x 1.001x 0.995x 0.995x 0.995x 1.000x 0.981x mc sharp w2 hv: 0.993x 1.012x 0.993x 0.988x 0.996x 0.992x 1.008x mc regular w4 hv: 0.938x 0.943x 0.939x 0.943x 0.986x 0.943x 0.997x mc sharp w4 hv: 0.969x 0.959x 0.970x 0.974x 0.986x 0.993x 0.997x mct regular w4 hv: 0.942x 0.970x 0.951x 0.960x 0.977x 0.958x 1.018x mct sharp w4 hv: 0.923x 0.958x 0.934x 0.955x 0.973x 0.946x 0.986x
-
The 6-tap horizontal and the horizontal parts of 6-tap HV subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on Cortex CPU cores: SBD mct h X1 A78 A76 A72 A55 regular w8: 0.878x 0.894x 0.990x 0.923x 0.944x regular w16: 0.962x 0.931x 0.943x 0.949x 0.949x regular w32: 0.937x 0.937x 0.972x 0.938x 0.947x regular w64: 0.920x 0.965x 0.992x 0.936x 0.944x SBD mct hv X1 A78 A76 A72 A55 regular w8: 0.931x 0.970x 0.951x 0.950x 0.971x regular w16: 0.940x 0.971x 0.941x 0.952x 0.967x regular w32: 0.943x 0.972x 0.946x 0.961x 0.974x regular w64: 0.943x 0.973x 0.952x 0.944x 0.975x
-
The horizontal parts of 6-tap HV subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on Cortex CPU cores: HBD mct hv X1 A78 A76 A72 A55 regular w8: 0.952x 0.989x 0.924x 0.973x 0.976x regular w16: 0.961x 0.993x 0.928x 0.952x 0.971x regular w32: 0.964x 0.996x 0.930x 0.973x 0.972x regular w64: 0.963x 0.997x 0.930x 0.969x 0.974x
-
The 6-tap horizontal subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on some Cortex CPU cores: regular: X1 A78 A76 A55 mc w8: 0.915x 0.937x 0.900x 0.982x mc w16: 0.917x 0.947x 0.911x 0.971x mc w32: 0.914x 0.938x 0.873x 0.961x mc w64: 0.918x 0.932x 0.882x 0.964x
-
The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+ SRSHL instruction sequences. In the horizontal case this can be rewritten using a single SQSHRUN instruction with an additional rounding value (34 for 10-bit and 40 for 12-bit). Relative runtime of micro benchmarks after this patch on some Cortex CPU cores: regular: X1 A78 A76 A55 mc w2: 0.847x 0.864x 0.822x 0.859x mc w4: 0.889x 0.994x 0.868x 0.917x mc w8: 0.857x 0.911x 0.915x 0.978x mc w16: 0.890x 0.982x 0.868x 0.974x mc w32: 0.904x 0.991x 0.873x 0.967x mc w64: 0.919x 1.003x 0.860x 0.970x
-
- Sep 05, 2024
-
-
-
This makes `#include <dav1d/dav1d.h>` work correctly as we point to the parent include directory, same as in the normal installation. Also fixes conflict of including "version.h" which may already exist in parent project or another subproject. Be more specific about the headers. Normally it works, but when building as subproject version.h is generated in build directory, so it no longer is prioritized when including from dav1d.h and other header with the same name may be included.
-
- Sep 04, 2024
-
-
- Sep 01, 2024
-
-
Cameron Cawley authored
-
- Aug 30, 2024
-
-
- Aug 29, 2024
-
-
-
-
-
-
Martin Storsjö authored
This should allow executing in environments where the executable memory isn't readable. Use 4 byte entries instead of 2; most object file formats support relocations for a 4 byte symbol difference across sections, which allows keeping the rest of the table lookup code similar to what it was before. Referencing a symbol in an arbitrary location in the executable requires a two instruction sequence (adrp+add, via the movrel macro). Thus, the cost of this rewrite is doubling the size of the jump tables (which were quite small so far), and adding one instruction in each jump table setup prologue. On an ELF build, the .text section shrinks by 1176 bytes, and the .rodata section grows by 3136 bytes, i.e. a 1960 byte increase. While refactoring, prefer doing sign extension during the load (using ldrsw rather than ldr, to avoid using the "sxtw" modifier on the add instruction), as extending ALU arithmetics have a higher latency. MS armasm64 doesn't seem to support calculating symbol differences across sections (see [1]), so keep the jump tables in the text section there, to let the assembler calculate it at assembly time instead. (Keeping the condition as _WIN32 for simplicity, as we don't interact directly with armasm64, but it is wrapped in gas-preprocessor.) [1] https://developercommunity.visualstudio.com/t/armasm64-unable-to-create-cross-section/10722340
-
Martin Storsjö authored
-
-
-
- Aug 26, 2024
-
-
Martin Storsjö authored
WinSDK 10.0.26100 added these processor feature constants. Unfortunately, no constant was added for I8MM, but if SVE_I8MM is available, we can at least be sure that regular I8MM is available too.
-
- Aug 24, 2024
-
-
Martin Storsjö authored
Apparently, this case isn't actually ever executed, at least in most checkasm runs, but some tools could complain about the relocation against 160b, which pointed elsewhere than intended.
-
- Aug 23, 2024
-
-
Martin Storsjö authored
This does the same optimizations as 3329f8d1 and 1790e132 on the rest of the code.
-
Martin Storsjö authored
This makes the code behave as intended, when filling a rectangle with arbitrary width (filling with the largest power of two width until filled); previously, it accidentally fell back on writing 4 pixel wide stripes immediately. No measurable effect on checkasm benchmarks though.
-
- Aug 22, 2024
-
-
MS armasm64 cannot compile some SVE instructions with immediate operands, e.g.: sub z0.h, z0.h, #8192 The proper form is: sub z0.h, z0.h, #32, lsl #8 This patch contains the needed fixes.
-
Martin Storsjö authored
Don't include the BTI landing pad instruction in the loops. If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to a no-op instruction that indicates that indirect jumps can land there. But there's no need for the loops to include that instruction.
-
Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only 2D convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element 16-bit SDOT instruction of SVE2. This patch renames HBD prep/put_neon to prep/put_16bpc_neon and exports put_16bpc_neon. Benchmarks show up-to 17% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 8 KiB. Relative performance to the C reference on some Cortex-A/X CPUs: regular A715 A720 X3 X4 A510 A520 w4 hv neon: 3.93x 4.10x 5.21x 5.17x 3.57x 5.27x w4 hv sve2: 4.99x 5.14x 6.00x 6.05x 4.33x 3.99x w8 hv neon: 1.72x 1.67x 1.98x 2.18x 2.95x 2.94x w8 hv sve2: 2.12x 2.29x 2.52x 2.62x 2.60x 2.60x w16 hv neon: 1.59x 1.53x 1.83x 1.89x 2.35x 2.24x w16 hv sve2: 1.94x 2.12x 2.33x 2.18x 2.06x 2.06x w32 hv neon: 1.49x 1.50x 1.66x 1.76x 2.10x 2.16x w32 hv sve2: 1.81x 2.09x 2.11x 2.09x 1.84x 1.87x w64 hv neon: 1.52x 1.50x 1.55x 1.71x 1.95x 2.05x w64 hv sve2: 1.84x 2.08x 1.97x 1.98x 1.74x 1.77x w4 h neon: 5.35x 5.47x 7.39x 5.78x 3.92x 5.19x w4 h sve2: 7.91x 8.35x 11.95x 10.33x 5.81x 5.42x w8 h neon: 4.49x 4.43x 6.50x 4.87x 7.18x 6.17x w8 h sve2: 6.09x 6.22x 9.59x 7.70x 7.89x 6.83x w16 h neon: 2.53x 2.52x 2.34x 1.86x 2.71x 2.75x w16 h sve2: 3.41x 3.47x 3.53x 3.25x 2.89x 2.96x w32 h neon: 2.07x 2.08x 1.97x 1.56x 2.17x 2.21x w32 h sve2: 2.76x 2.84x 2.94x 2.75x 2.24x 2.29x w64 h neon: 1.86x 1.86x 1.76x 1.41x 1.87x 1.88x w64 h sve2: 2.47x 2.54x 2.65x 2.46x 1.94x 1.94x w4 v neon: 5.22x 5.17x 6.36x 5.60x 4.23x 7.30x w4 v sve2: 5.86x 5.90x 7.81x 7.16x 4.86x 4.15x w8 v neon: 4.83x 4.79x 6.96x 6.45x 4.74x 8.40x w8 v sve2: 5.25x 5.23x 7.76x 6.79x 4.84x 4.13x w16 v neon: 2.59x 2.60x 2.93x 2.47x 1.80x 4.16x w16 v sve2: 2.85x 2.88x 3.36x 2.73x 1.86x 2.00x w32 v neon: 2.12x 2.13x 2.33x 2.03x 1.34x 3.11x w32 v sve2: 2.36x 2.40x 2.73x 2.32x 1.41x 1.48x w64 v neon: 1.94x 1.92x 2.02x 1.78x 1.12x 2.59x w64 v sve2: 2.16x 2.15x 2.37x 2.03x 1.17x 1.22x w4 0 neon: 1.75x 1.71x 1.44x 1.56x 3.18x 2.87x w4 0 sve2: 4.28x 4.39x 5.72x 6.42x 5.50x 4.68x w8 0 neon: 3.05x 3.04x 4.44x 4.64x 3.84x 3.52x w8 0 sve2: 3.85x 3.80x 5.45x 6.01x 4.92x 4.26x w16 0 neon: 2.92x 2.93x 3.82x 3.23x 4.58x 4.44x w16 0 sve2: 4.29x 4.27x 4.25x 4.15x 5.58x 5.29x w32 0 neon: 2.73x 2.76x 3.50x 2.67x 4.44x 4.26x w32 0 sve2: 4.09x 4.10x 3.75x 3.39x 5.67x 5.22x w64 0 neon: 2.73x 2.70x 3.27x 3.14x 4.57x 4.68x w64 0 sve2: 4.06x 3.97x 3.54x 3.18x 6.36x 6.25x sharp A715 A720 X3 X4 A510 A520 w4 hv neon: 3.54x 3.64x 4.43x 4.45x 3.03x 4.72x w4 hv sve2: 4.30x 4.55x 5.38x 5.26x 4.04x 3.76x w8 hv neon: 1.30x 1.25x 1.51x 1.60x 2.44x 2.43x w8 hv sve2: 1.86x 2.06x 2.09x 2.18x 2.37x 2.39x w16 hv neon: 1.19x 1.16x 1.43x 1.36x 1.95x 1.98x w16 hv sve2: 1.68x 1.91x 1.94x 1.84x 1.89x 1.94x w32 hv neon: 1.13x 1.12x 1.30x 1.29x 1.75x 1.81x w32 hv sve2: 1.58x 1.84x 1.75x 1.74x 1.70x 1.76x w64 hv neon: 1.13x 1.13x 1.21x 1.25x 1.65x 1.69x w64 hv sve2: 1.57x 1.84x 1.62x 1.67x 1.62x 1.65x w4 h neon: 5.38x 5.49x 7.46x 5.74x 3.93x 5.23x w4 h sve2: 7.86x 8.37x 11.99x 10.38x 5.81x 5.40x w8 h neon: 3.46x 3.49x 5.36x 4.64x 6.40x 5.62x w8 h sve2: 5.95x 6.23x 9.61x 7.76x 7.86x 6.89x w16 h neon: 1.99x 1.97x 2.07x 1.91x 2.43x 2.51x w16 h sve2: 3.42x 3.46x 3.75x 3.23x 2.89x 2.98x w32 h neon: 1.67x 1.62x 1.66x 1.63x 1.95x 2.01x w32 h sve2: 2.86x 2.84x 2.94x 2.72x 2.21x 2.29x w64 h neon: 1.45x 1.45x 1.51x 1.48x 1.69x 1.70x w64 h sve2: 2.47x 2.54x 2.64x 2.46x 1.93x 1.95x w4 v neon: 4.07x 4.01x 5.15x 4.74x 3.38x 6.56x w4 v sve2: 5.88x 5.86x 7.81x 7.15x 4.85x 4.39x w8 v neon: 3.64x 3.59x 5.38x 4.92x 3.59x 7.23x w8 v sve2: 5.23x 5.19x 7.77x 6.66x 4.81x 4.13x w16 v neon: 1.93x 1.95x 2.25x 1.92x 1.35x 3.46x w16 v sve2: 2.85x 2.88x 3.36x 2.71x 1.86x 1.94x w32 v neon: 1.57x 1.58x 1.78x 1.60x 1.01x 2.67x w32 v sve2: 2.36x 2.39x 2.73x 2.35x 1.41x 1.50x w64 v neon: 1.44x 1.42x 1.54x 1.43x 0.85x 2.19x w64 v sve2: 2.17x 2.15x 2.37x 2.06x 1.18x 1.25x
-
- Aug 21, 2024
-
-
Arpad Panyik authored
Add 6-tap variant of standard bit-depth horizontal subpel filters using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch also extends the HV filter with 6-tap horizontal pass using USMMLA. Benchmarks show up-to 6-7% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 1.2 KiB. Relative runtime of micro benchmarks after this patch on Neoverse and Cortex CPU cores: regular V2 V1 X3 A720 A715 A520 A510 w8 hv: 0.860x 0.895x 0.870x 0.896x 0.896x 0.938x 0.936x w16 hv: 0.829x 0.886x 0.865x 0.908x 0.906x 0.946x 0.944x w32 hv: 0.837x 0.883x 0.862x 0.914x 0.915x 0.953x 0.949x w64 hv: 0.840x 0.883x 0.862x 0.914x 0.914x 0.955x 0.952x w8 h: 0.746x 0.754x 0.747x 0.723x 0.724x 0.874x 0.866x w16 h: 0.749x 0.764x 0.745x 0.731x 0.731x 0.858x 0.852x w32 h: 0.739x 0.754x 0.738x 0.729x 0.729x 0.839x 0.837x w64 h: 0.736x 0.749x 0.733x 0.725x 0.726x 0.847x 0.836x
-
- Aug 12, 2024
-
-
Arpad Panyik authored
The macro parameter \xmy of filter_8tap_fn was used incorrectly as a pointer instead of \lsrc. They refer to the same register but in different context.
-
- Aug 04, 2024
-
-
Kyle Siefring authored
Performance Impact on Sapphire Rapids: Chimera: 0.46% Faster
-
- Jun 26, 2024
-
-
Arpad Panyik authored
The constants used for the subpel filters were placed in the .text section for simplicity and peak performance, but this does not work on systems with execute only .text sections (e.g.: OpenBSD). The performance cost of moving the constants to the .rodata section is small and mostly within the measurable noise.
-
- Jun 25, 2024
-
-
Martin Storsjö authored
The ldr instruction only can handle offsets that are a multiple of the element size; most assemblers implicitly produce the ldur instruction when a non-aligned offset is provided. Older versions of MS armasm64, however, error out on this. Since MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7 and earlier require explicitly writing the instruction as ldur. Despite this, even older versions still fail to build the mc_dotprod.S sources, with errors like this: src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range mov x10, (((0*15-1)<<7)|(3*15-1)) This happens on MSVC 2022 17.1 and older, while 17.2 and newer accept the negative value expression here. In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure script at the moment, as it uses inline assembly to test for external assembler features.
-
Add run-time CPU feature detection for DotProd and i8mm on AArch64.
-
Henrik Gramner authored
-
- Jun 17, 2024
-
-
Ronald S. Bultje authored
-