- Sep 01, 2024
-
-
Cameron Cawley authored
-
- Aug 30, 2024
-
-
- Aug 29, 2024
-
-
-
-
-
-
Martin Storsjö authored
This should allow executing in environments where the executable memory isn't readable. Use 4 byte entries instead of 2; most object file formats support relocations for a 4 byte symbol difference across sections, which allows keeping the rest of the table lookup code similar to what it was before. Referencing a symbol in an arbitrary location in the executable requires a two instruction sequence (adrp+add, via the movrel macro). Thus, the cost of this rewrite is doubling the size of the jump tables (which were quite small so far), and adding one instruction in each jump table setup prologue. On an ELF build, the .text section shrinks by 1176 bytes, and the .rodata section grows by 3136 bytes, i.e. a 1960 byte increase. While refactoring, prefer doing sign extension during the load (using ldrsw rather than ldr, to avoid using the "sxtw" modifier on the add instruction), as extending ALU arithmetics have a higher latency. MS armasm64 doesn't seem to support calculating symbol differences across sections (see [1]), so keep the jump tables in the text section there, to let the assembler calculate it at assembly time instead. (Keeping the condition as _WIN32 for simplicity, as we don't interact directly with armasm64, but it is wrapped in gas-preprocessor.) [1] https://developercommunity.visualstudio.com/t/armasm64-unable-to-create-cross-section/10722340
-
Martin Storsjö authored
-
-
-
- Aug 26, 2024
-
-
Martin Storsjö authored
WinSDK 10.0.26100 added these processor feature constants. Unfortunately, no constant was added for I8MM, but if SVE_I8MM is available, we can at least be sure that regular I8MM is available too.
-
- Aug 24, 2024
-
-
Martin Storsjö authored
Apparently, this case isn't actually ever executed, at least in most checkasm runs, but some tools could complain about the relocation against 160b, which pointed elsewhere than intended.
-
- Aug 23, 2024
-
-
Martin Storsjö authored
This does the same optimizations as 3329f8d1 and 1790e132 on the rest of the code.
-
Martin Storsjö authored
This makes the code behave as intended, when filling a rectangle with arbitrary width (filling with the largest power of two width until filled); previously, it accidentally fell back on writing 4 pixel wide stripes immediately. No measurable effect on checkasm benchmarks though.
-
- Aug 22, 2024
-
-
MS armasm64 cannot compile some SVE instructions with immediate operands, e.g.: sub z0.h, z0.h, #8192 The proper form is: sub z0.h, z0.h, #32, lsl #8 This patch contains the needed fixes.
-
Martin Storsjö authored
Don't include the BTI landing pad instruction in the loops. If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to a no-op instruction that indicates that indirect jumps can land there. But there's no need for the loops to include that instruction.
-
Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only 2D convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element 16-bit SDOT instruction of SVE2. This patch renames HBD prep/put_neon to prep/put_16bpc_neon and exports put_16bpc_neon. Benchmarks show up-to 17% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 8 KiB. Relative performance to the C reference on some Cortex-A/X CPUs: regular A715 A720 X3 X4 A510 A520 w4 hv neon: 3.93x 4.10x 5.21x 5.17x 3.57x 5.27x w4 hv sve2: 4.99x 5.14x 6.00x 6.05x 4.33x 3.99x w8 hv neon: 1.72x 1.67x 1.98x 2.18x 2.95x 2.94x w8 hv sve2: 2.12x 2.29x 2.52x 2.62x 2.60x 2.60x w16 hv neon: 1.59x 1.53x 1.83x 1.89x 2.35x 2.24x w16 hv sve2: 1.94x 2.12x 2.33x 2.18x 2.06x 2.06x w32 hv neon: 1.49x 1.50x 1.66x 1.76x 2.10x 2.16x w32 hv sve2: 1.81x 2.09x 2.11x 2.09x 1.84x 1.87x w64 hv neon: 1.52x 1.50x 1.55x 1.71x 1.95x 2.05x w64 hv sve2: 1.84x 2.08x 1.97x 1.98x 1.74x 1.77x w4 h neon: 5.35x 5.47x 7.39x 5.78x 3.92x 5.19x w4 h sve2: 7.91x 8.35x 11.95x 10.33x 5.81x 5.42x w8 h neon: 4.49x 4.43x 6.50x 4.87x 7.18x 6.17x w8 h sve2: 6.09x 6.22x 9.59x 7.70x 7.89x 6.83x w16 h neon: 2.53x 2.52x 2.34x 1.86x 2.71x 2.75x w16 h sve2: 3.41x 3.47x 3.53x 3.25x 2.89x 2.96x w32 h neon: 2.07x 2.08x 1.97x 1.56x 2.17x 2.21x w32 h sve2: 2.76x 2.84x 2.94x 2.75x 2.24x 2.29x w64 h neon: 1.86x 1.86x 1.76x 1.41x 1.87x 1.88x w64 h sve2: 2.47x 2.54x 2.65x 2.46x 1.94x 1.94x w4 v neon: 5.22x 5.17x 6.36x 5.60x 4.23x 7.30x w4 v sve2: 5.86x 5.90x 7.81x 7.16x 4.86x 4.15x w8 v neon: 4.83x 4.79x 6.96x 6.45x 4.74x 8.40x w8 v sve2: 5.25x 5.23x 7.76x 6.79x 4.84x 4.13x w16 v neon: 2.59x 2.60x 2.93x 2.47x 1.80x 4.16x w16 v sve2: 2.85x 2.88x 3.36x 2.73x 1.86x 2.00x w32 v neon: 2.12x 2.13x 2.33x 2.03x 1.34x 3.11x w32 v sve2: 2.36x 2.40x 2.73x 2.32x 1.41x 1.48x w64 v neon: 1.94x 1.92x 2.02x 1.78x 1.12x 2.59x w64 v sve2: 2.16x 2.15x 2.37x 2.03x 1.17x 1.22x w4 0 neon: 1.75x 1.71x 1.44x 1.56x 3.18x 2.87x w4 0 sve2: 4.28x 4.39x 5.72x 6.42x 5.50x 4.68x w8 0 neon: 3.05x 3.04x 4.44x 4.64x 3.84x 3.52x w8 0 sve2: 3.85x 3.80x 5.45x 6.01x 4.92x 4.26x w16 0 neon: 2.92x 2.93x 3.82x 3.23x 4.58x 4.44x w16 0 sve2: 4.29x 4.27x 4.25x 4.15x 5.58x 5.29x w32 0 neon: 2.73x 2.76x 3.50x 2.67x 4.44x 4.26x w32 0 sve2: 4.09x 4.10x 3.75x 3.39x 5.67x 5.22x w64 0 neon: 2.73x 2.70x 3.27x 3.14x 4.57x 4.68x w64 0 sve2: 4.06x 3.97x 3.54x 3.18x 6.36x 6.25x sharp A715 A720 X3 X4 A510 A520 w4 hv neon: 3.54x 3.64x 4.43x 4.45x 3.03x 4.72x w4 hv sve2: 4.30x 4.55x 5.38x 5.26x 4.04x 3.76x w8 hv neon: 1.30x 1.25x 1.51x 1.60x 2.44x 2.43x w8 hv sve2: 1.86x 2.06x 2.09x 2.18x 2.37x 2.39x w16 hv neon: 1.19x 1.16x 1.43x 1.36x 1.95x 1.98x w16 hv sve2: 1.68x 1.91x 1.94x 1.84x 1.89x 1.94x w32 hv neon: 1.13x 1.12x 1.30x 1.29x 1.75x 1.81x w32 hv sve2: 1.58x 1.84x 1.75x 1.74x 1.70x 1.76x w64 hv neon: 1.13x 1.13x 1.21x 1.25x 1.65x 1.69x w64 hv sve2: 1.57x 1.84x 1.62x 1.67x 1.62x 1.65x w4 h neon: 5.38x 5.49x 7.46x 5.74x 3.93x 5.23x w4 h sve2: 7.86x 8.37x 11.99x 10.38x 5.81x 5.40x w8 h neon: 3.46x 3.49x 5.36x 4.64x 6.40x 5.62x w8 h sve2: 5.95x 6.23x 9.61x 7.76x 7.86x 6.89x w16 h neon: 1.99x 1.97x 2.07x 1.91x 2.43x 2.51x w16 h sve2: 3.42x 3.46x 3.75x 3.23x 2.89x 2.98x w32 h neon: 1.67x 1.62x 1.66x 1.63x 1.95x 2.01x w32 h sve2: 2.86x 2.84x 2.94x 2.72x 2.21x 2.29x w64 h neon: 1.45x 1.45x 1.51x 1.48x 1.69x 1.70x w64 h sve2: 2.47x 2.54x 2.64x 2.46x 1.93x 1.95x w4 v neon: 4.07x 4.01x 5.15x 4.74x 3.38x 6.56x w4 v sve2: 5.88x 5.86x 7.81x 7.15x 4.85x 4.39x w8 v neon: 3.64x 3.59x 5.38x 4.92x 3.59x 7.23x w8 v sve2: 5.23x 5.19x 7.77x 6.66x 4.81x 4.13x w16 v neon: 1.93x 1.95x 2.25x 1.92x 1.35x 3.46x w16 v sve2: 2.85x 2.88x 3.36x 2.71x 1.86x 1.94x w32 v neon: 1.57x 1.58x 1.78x 1.60x 1.01x 2.67x w32 v sve2: 2.36x 2.39x 2.73x 2.35x 1.41x 1.50x w64 v neon: 1.44x 1.42x 1.54x 1.43x 0.85x 2.19x w64 v sve2: 2.17x 2.15x 2.37x 2.06x 1.18x 1.25x
-
- Aug 21, 2024
-
-
Arpad Panyik authored
Add 6-tap variant of standard bit-depth horizontal subpel filters using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch also extends the HV filter with 6-tap horizontal pass using USMMLA. Benchmarks show up-to 6-7% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 1.2 KiB. Relative runtime of micro benchmarks after this patch on Neoverse and Cortex CPU cores: regular V2 V1 X3 A720 A715 A520 A510 w8 hv: 0.860x 0.895x 0.870x 0.896x 0.896x 0.938x 0.936x w16 hv: 0.829x 0.886x 0.865x 0.908x 0.906x 0.946x 0.944x w32 hv: 0.837x 0.883x 0.862x 0.914x 0.915x 0.953x 0.949x w64 hv: 0.840x 0.883x 0.862x 0.914x 0.914x 0.955x 0.952x w8 h: 0.746x 0.754x 0.747x 0.723x 0.724x 0.874x 0.866x w16 h: 0.749x 0.764x 0.745x 0.731x 0.731x 0.858x 0.852x w32 h: 0.739x 0.754x 0.738x 0.729x 0.729x 0.839x 0.837x w64 h: 0.736x 0.749x 0.733x 0.725x 0.726x 0.847x 0.836x
-
- Aug 12, 2024
-
-
Arpad Panyik authored
The macro parameter \xmy of filter_8tap_fn was used incorrectly as a pointer instead of \lsrc. They refer to the same register but in different context.
-
- Aug 04, 2024
-
-
Kyle Siefring authored
Performance Impact on Sapphire Rapids: Chimera: 0.46% Faster
-
- Jun 26, 2024
-
-
Arpad Panyik authored
The constants used for the subpel filters were placed in the .text section for simplicity and peak performance, but this does not work on systems with execute only .text sections (e.g.: OpenBSD). The performance cost of moving the constants to the .rodata section is small and mostly within the measurable noise.
-
- Jun 25, 2024
-
-
Martin Storsjö authored
The ldr instruction only can handle offsets that are a multiple of the element size; most assemblers implicitly produce the ldur instruction when a non-aligned offset is provided. Older versions of MS armasm64, however, error out on this. Since MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7 and earlier require explicitly writing the instruction as ldur. Despite this, even older versions still fail to build the mc_dotprod.S sources, with errors like this: src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range mov x10, (((0*15-1)<<7)|(3*15-1)) This happens on MSVC 2022 17.1 and older, while 17.2 and newer accept the negative value expression here. In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure script at the moment, as it uses inline assembly to test for external assembler features.
-
Add run-time CPU feature detection for DotProd and i8mm on AArch64.
-
Henrik Gramner authored
-
- Jun 17, 2024
-
-
Ronald S. Bultje authored
-
- Jun 10, 2024
-
-
Nathan E. Egge authored
-
- Jun 05, 2024
-
-
Arpad Panyik authored
The DotProd/I8MM horizontal and HV/2D subpel filters use -4 offset for sampling instead of -3 to be better aligned in some cases. This resulted in an out of bounds access, which led to crashes. This patch fixes it.
-
- May 27, 2024
-
-
Henrik Gramner authored
-
Henrik Gramner authored
The conditions for when to (re)allocate those buffers are identical, so they can be merged into a single branch. The allocation of the buffers themselves can also be combined to reduce the number of allocation calls.
-
Henrik Gramner authored
It's only ever called on data which has already been zero-initialized.
-
Henrik Gramner authored
n_tc is always >= n_fc, so we only need to check the latter.
-
Henrik Gramner authored
-
Henrik Gramner authored
-
Henrik Gramner authored
The amount of nested macros caused by having to support SSE2 makes the code very difficult to maintain and modify. It is also of questionable value considering most other asm requires SSSE3.
-
Henrik Gramner authored
-
- May 25, 2024
-
-
Jean-Baptiste Kempf authored
-
- May 20, 2024
-
-
Use a slightly shorter series of instructions to compute cdf update rate.
-
Henrik Gramner authored
Error out early instead of producing bogus mismatch errors in case of an incorrect cpu mask for example.
-
- May 19, 2024
-
-
Martin Storsjö authored
The ldr instruction can take an immediate offset which is a multiple of the loaded element size. If the ldr instruction is given an immediate offset which isn't a multiple of the element size, most assemblers implicitly generate a "ldur" instruction instead. Older versions of MS armasm64.exe don't do this, but instead error out with "error A2518: operand 2: Memory offset must be aligned". (Current versions don't do this but correctly generate "ldur" implicitly.) Switch this instruction to an explicit "ldur", like we do elsewhere, to fix building with these older tools.
-
- May 18, 2024
-
-
NDK 26 dropped support for API versions 19 and 20 (KitKat, Android 4.4). The minimum supported API is now 21 (Lollipop, Android 5.0).
-