- Mar 12, 2025
-
-
Konstantinos Margaritis authored
Provide implementations for functions using the instructions SDOT/UDOT in the DotProd Armv8 extension. Functions implemented: sad_16x8, sad_16x16, sad_x3_16x8_neon, sad_x3_16x16_neon, sad_x4_16x8_neon, sad_x4_16x16_neon, ssd_8x4, ssd_8x8, ssd_8x16, ssd_16x8, ssd_16x16, pixel_vsad Performance improvement against Neon ranges from 5% to 188%. Following is the output of ./checkasm8 --bench (run on a Graviton4 system): sad_16x8_c: 1323 sad_16x8_neon: 224 sad_16x8_dotprod: 211 sad_16x16_c: 2619 sad_16x16_neon: 365 sad_16x16_dotprod: 320 sad_x3_16x8_c: 3836 sad_x3_16x8_neon: 403 sad_x3_16x8_dotprod: 317 sad_x3_16x16_c: 7725 sad_x3_16x16_neon: 714 sad_x3_16x16_dotprod: 532 sad_x4_16x8_c: 5080 sad_x4_16x8_neon: 438 sad_x4_16x8_dotprod: 375 sad_x4_16x16_c: 10260 sad_x4_16x16_neon: 794 sad_x4_16x16_dotprod: 655 ssd_8x4_c: 381 ssd_8x4_neon: 157 ssd_8x4_dotprod: 115 ssd_8x4_sve: 150 ssd_8x8_c: 695 ssd_8x8_neon: 238 ssd_8x8_dotprod: 161 ssd_8x8_sve: 228 ssd_8x16_c: 1335 ssd_8x16_neon: 388 ssd_8x16_dotprod: 267 ssd_16x8_c: 1342 ssd_16x8_neon: 285 ssd_16x8_dotprod: 166 ssd_16x16_c: 2623 ssd_16x16_neon: 503 ssd_16x16_dotprod: 277 vsad_c: 2786 vsad_neon: 311 vsad_dotprod: 235
-
Martin Storsjö authored
-
Martin Storsjö authored
Also add code for detecting them on Linux.
-
Martin Storsjö authored
-
Martin Storsjö authored
By using .arch_extension (if supported) to enable the relevant extensions, we can also disable them afterwards, so we can e.g. cleanly enable one extension only for one subsection of a file. This also makes it easier to enable various combinations of supported architecture extensions.
-
Martin Storsjö authored
This hasn't been needed for SVE/SVE2, as all toolchains have supported just enabling it via ".arch armv8.2-a+sve". For other arch extensions, like dotprod/i8mm, there's more combinations of toolchain bugs in slightly older toolchains; try to detect what is supported. Additionally, when involving more than one architecture extension, we may want to enable/disable individual extensions one at a time, without needing to specify the full list in one single .arch statement. This is a preparatory commit for adding support for the dotprod/i8mm extensions. We intentionally don't add AS_ARCH_LEVEL to the CONFIG_HAVE list, as this define isn't prefixed with "HAVE_", and we don't use the define except in the case where we actually do set it. (It's not a regular 0/1 define like the others.)
-
Martin Storsjö authored
This requires adding the "-c" flag to ASFLAGS before doing the check. This also makes sure to validate the gas-preprocessor is functional for MSVC configurations, by testing whether the "cmeq" instruction can be assembled at this point.
-
Martin Storsjö authored
This is more correct than using cc_check; we're going to assemble standalone external assembly - thus check for whether we can build it in that form, not using inline assembly. This allows sharing checks with the MSVC codepath (where inline assembly isn't supported, and where assembly is built using a tool different from the regular compiler).
-
- Mar 11, 2025
-
-
Martin Storsjö authored
This updates the dependecy information on each successive recompile. When building with MSVC, dependency information is generated with a separate command just like before, but done together with compiling each object file. (This is quite similar to how ffmpeg does the same.) This avoids the serial dependency generation step. In slow environments (in particular if using MSVC) it could take a notable amount of time; this can now all be done in parallel. In one example, this reduces the time for a full build from clean with MSVC (wrapped in wine) from 23 seconds down to 9 seconds, thanks to parallelism. (For non-parallel builds, it doesn't make much of a difference.)
-
- Mar 04, 2025
-
-
Martin Storsjö authored
Previously, MSVC would warn that the .S source is unrecognized, and the script would only produce a depenency on the main source file itself.
-
- Jan 03, 2025
-
-
Anton Mitrofanov authored
-
- Dec 29, 2024
-
-
Brad Smith authored
https://android.googlesource.com/platform/bionic/+/72e6fd42421dca80fb2776a9185c186d4a04e5f7 Android has had sched_getaffinity since Android 3.0. Builds need to use _GNU_SOURCE.
-
Martin Storsjö authored
-
Brad Smith authored
Use __sync_fetch_and_add() wherever detected instead of being limited to just X86.
-
Use of hw.ncpu has long been deprecated.
-
- Nov 04, 2024
-
-
Brad Smith authored
-
- Oct 27, 2024
-
-
Brad Smith authored
Make use of _SC_NPROCESSORS_ONLN if it exists and fallback to _SC_NPROCESSORS_CONF for really old operating systems. This adds support for retrieving the number of CPUs on a few OS's such as NetBSD, DragonFly and a few others.
-
- Oct 26, 2024
-
-
- Oct 22, 2024
-
-
Anton Mitrofanov authored
fseeko() is not available before API 24 with _FILE_OFFSET_BITS=64. x264.c: x264cli.h must be first as it contains _FILE_OFFSET_BITS define.
-
- Oct 20, 2024
-
-
Brad Smith authored
-
- Oct 17, 2024
-
-
Brad Smith authored
-
- Oct 07, 2024
-
-
Brad Smith authored
-
- Sep 17, 2024
-
-
Martin Storsjö authored
This is mostly supported in armasm64 since MSVC 2022 17.10.
-
- May 13, 2024
-
-
Henrik Gramner authored
PLT/GOT indirections are required in some cases. Most commonly when calling functions from other shared libraries, but also in some scenarios when calling functions with default symbol visibility even within the same component on certain elf64 platforms. On elf64 we can simply use PLT relocations for all calls to external functions. Since the linker is able to eliminate unnecessary PLT indirections with the final output binary being identical to non-PLT relocations there isn't really any downside to doing so. This mimics what regular compilers normally do for calls to external functions. On elf32 with PIC we can use a function pointer from the GOT when calling external functions, similar to what regular compilers do when using -fno-plt. Since this both introduces overhead and clobbers one register, which could potentially have been used for custom calling conventions when calling other asm functions within the same library, it's only performed for functions declared using 'cextern_naked'.
-
- Mar 21, 2024
-
- Mar 14, 2024
-
-
Prior to this change dealing with the scenario where the number of XMM registers spilled depends on if a branch is taken or not was complicated to handle well. There was essentially three options: 1) Always spill the largest number of XMM register. Results in unnecessary spills. 2) Do the spilling after the branch. Results in code duplication for the shared subset of spills. 3) Do the spilling manually. Optimal, but overly complex and vexing. This adds an additional optional argument to the WIN64_SPILL_XMM and WIN64_PUSH_XMM macros to make it possible to allocate space for a certain number of registers but initially only push a subset of those, with the option of pushing additional register later.
-
Allows the use of multiple independent stack allocations within a function without having to manually fiddle with stack offsets.
-
-
- Mar 12, 2024
-
-
Anton Mitrofanov authored
Use correct return type for pixel_sad_x3/x4 functions. Bug report by Dominik 'Rathann' Mierzejewski .
-
- Feb 28, 2024
-
-
This makes the code much simpler (especially for adding support for other instruction set extensions), avoids needing inline assembly for this feature, and generally is more of the canonical way to do this. The CPU feature detection was added in 9c3c7168, using HWCAP_CPUID. The argument for using that, was that HWCAP_CPUID was added much earlier in the kernel (in Linux v4.11), while the HWCAP flags for individual features always come later. This allows detecting support for new CPU extensions before the kernel exposes information about them via hwcap flags. However in practice, there's probably quite little advantage in this. E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in v5.10 - later than HWCAP_CPUID, but there's probably very little practical cases where one would run a kernel older than that on a CPU that supports those instructions. Additionally, we provide our own definitions of the flag values to check (as they are fixed constants anyway), with names not conflicting with the ones from system headers. This reduces the number of ifdefs needed, and allows detecting those features even if building with userland headers that are lacking the definitions of those flags. Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04, do expose support for these features via HWCAP flags, but the emulated cpuid registers are missing the bits for exposing e.g. SVE2 (This issue is fixed in later versions of QEMU though.) Also drop the ifdef check for whether AT_HWCAP is defined; it was added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18, which also precedes when aarch64 was commonly used anyway, so don't guard the use of that with an ifdef.
-
Anton Mitrofanov authored
Use same Docker images as VLC for contrib compilation.
-
Anton Mitrofanov authored
-
- Feb 19, 2024
-
-
Henrik Gramner authored
Automatically flag x86-64 asm object files as SHSTK-compatible. Shadow Stack (SHSTK) is a part of Control-flow Enforcement Technology (CET) which is a feature aimed at defending against ROP attacks by verifying that 'call' and 'ret' instructions are correctly matched. For well-written code this works transparently without any code changes, as return addresses popped from the shadow stack should match return addresses popped from the normal stack for performance reasons anyway.
-
Henrik Gramner authored
-
Henrik Gramner authored
-
Henrik Gramner authored
Also make the GFNI cpu flag imply the presence of both AESNI and CLMUL.
-
Henrik Gramner authored
Broadcasting a memory operand is a binary flag, you either broadcast or you don't, and there's only a single possible element size for any given instruction. The instruction syntax however requires the broadcast semanticts to be explicitly defined, which is an issue when using macros to template code for multiple register widths. Add some helper defines to alleviate the issue.
-