- Jan 03, 2025
-
-
Anton Mitrofanov authored
-
- Dec 29, 2024
-
-
Brad Smith authored
https://android.googlesource.com/platform/bionic/+/72e6fd42421dca80fb2776a9185c186d4a04e5f7 Android has had sched_getaffinity since Android 3.0. Builds need to use _GNU_SOURCE.
-
Martin Storsjö authored
-
Brad Smith authored
Use __sync_fetch_and_add() wherever detected instead of being limited to just X86.
-
Use of hw.ncpu has long been deprecated.
-
- Nov 04, 2024
-
-
Brad Smith authored
-
- Oct 27, 2024
-
-
Brad Smith authored
Make use of _SC_NPROCESSORS_ONLN if it exists and fallback to _SC_NPROCESSORS_CONF for really old operating systems. This adds support for retrieving the number of CPUs on a few OS's such as NetBSD, DragonFly and a few others.
-
- Oct 26, 2024
-
-
- Oct 22, 2024
-
-
Anton Mitrofanov authored
fseeko() is not available before API 24 with _FILE_OFFSET_BITS=64. x264.c: x264cli.h must be first as it contains _FILE_OFFSET_BITS define.
-
- Oct 20, 2024
-
-
Brad Smith authored
-
- Oct 17, 2024
-
-
Brad Smith authored
-
- Oct 07, 2024
-
-
Brad Smith authored
-
- Sep 17, 2024
-
-
Martin Storsjö authored
This is mostly supported in armasm64 since MSVC 2022 17.10.
-
- May 13, 2024
-
-
Henrik Gramner authored
PLT/GOT indirections are required in some cases. Most commonly when calling functions from other shared libraries, but also in some scenarios when calling functions with default symbol visibility even within the same component on certain elf64 platforms. On elf64 we can simply use PLT relocations for all calls to external functions. Since the linker is able to eliminate unnecessary PLT indirections with the final output binary being identical to non-PLT relocations there isn't really any downside to doing so. This mimics what regular compilers normally do for calls to external functions. On elf32 with PIC we can use a function pointer from the GOT when calling external functions, similar to what regular compilers do when using -fno-plt. Since this both introduces overhead and clobbers one register, which could potentially have been used for custom calling conventions when calling other asm functions within the same library, it's only performed for functions declared using 'cextern_naked'.
-
- Mar 21, 2024
-
- Mar 14, 2024
-
-
Prior to this change dealing with the scenario where the number of XMM registers spilled depends on if a branch is taken or not was complicated to handle well. There was essentially three options: 1) Always spill the largest number of XMM register. Results in unnecessary spills. 2) Do the spilling after the branch. Results in code duplication for the shared subset of spills. 3) Do the spilling manually. Optimal, but overly complex and vexing. This adds an additional optional argument to the WIN64_SPILL_XMM and WIN64_PUSH_XMM macros to make it possible to allocate space for a certain number of registers but initially only push a subset of those, with the option of pushing additional register later.
-
Allows the use of multiple independent stack allocations within a function without having to manually fiddle with stack offsets.
-
-
- Mar 12, 2024
-
-
Anton Mitrofanov authored
Use correct return type for pixel_sad_x3/x4 functions. Bug report by Dominik 'Rathann' Mierzejewski .
-
- Feb 28, 2024
-
-
This makes the code much simpler (especially for adding support for other instruction set extensions), avoids needing inline assembly for this feature, and generally is more of the canonical way to do this. The CPU feature detection was added in 9c3c7168, using HWCAP_CPUID. The argument for using that, was that HWCAP_CPUID was added much earlier in the kernel (in Linux v4.11), while the HWCAP flags for individual features always come later. This allows detecting support for new CPU extensions before the kernel exposes information about them via hwcap flags. However in practice, there's probably quite little advantage in this. E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in v5.10 - later than HWCAP_CPUID, but there's probably very little practical cases where one would run a kernel older than that on a CPU that supports those instructions. Additionally, we provide our own definitions of the flag values to check (as they are fixed constants anyway), with names not conflicting with the ones from system headers. This reduces the number of ifdefs needed, and allows detecting those features even if building with userland headers that are lacking the definitions of those flags. Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04, do expose support for these features via HWCAP flags, but the emulated cpuid registers are missing the bits for exposing e.g. SVE2 (This issue is fixed in later versions of QEMU though.) Also drop the ifdef check for whether AT_HWCAP is defined; it was added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18, which also precedes when aarch64 was commonly used anyway, so don't guard the use of that with an ifdef.
-
Anton Mitrofanov authored
Use same Docker images as VLC for contrib compilation.
-
Anton Mitrofanov authored
-
- Feb 19, 2024
-
-
Henrik Gramner authored
Automatically flag x86-64 asm object files as SHSTK-compatible. Shadow Stack (SHSTK) is a part of Control-flow Enforcement Technology (CET) which is a feature aimed at defending against ROP attacks by verifying that 'call' and 'ret' instructions are correctly matched. For well-written code this works transparently without any code changes, as return addresses popped from the shadow stack should match return addresses popped from the normal stack for performance reasons anyway.
-
Henrik Gramner authored
-
Henrik Gramner authored
-
Henrik Gramner authored
Also make the GFNI cpu flag imply the presence of both AESNI and CLMUL.
-
Henrik Gramner authored
Broadcasting a memory operand is a binary flag, you either broadcast or you don't, and there's only a single possible element size for any given instruction. The instruction syntax however requires the broadcast semanticts to be explicitly defined, which is an issue when using macros to template code for multiple register widths. Add some helper defines to alleviate the issue.
-
Henrik Gramner authored
-
- Jan 13, 2024
-
-
Anton Mitrofanov authored
-
- Nov 23, 2023
-
-
David Chen authored
Imporve the performance of NEON functions of aarch64/pixel-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=ssd Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: ssd_4x4_c: 235 ssd_4x4_neon: 226 ssd_4x4_sve: 151 ssd_4x8_c: 409 ssd_4x8_neon: 363 ssd_4x8_sve: 201 ssd_4x16_c: 781 ssd_4x16_neon: 653 ssd_4x16_sve: 313 ssd_8x4_c: 402 ssd_8x4_neon: 192 ssd_8x4_sve: 192 ssd_8x8_c: 728 ssd_8x8_neon: 275 ssd_8x8_sve: 275 Command executed: ./checkasm10 --bench=ssd Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: ssd_4x4_c: 256 ssd_4x4_neon: 226 ssd_4x4_sve: 153 ssd_4x8_c: 460 ssd_4x8_neon: 369 ssd_4x8_sve: 215 ssd_4x16_c: 852 ssd_4x16_neon: 651 ssd_4x16_sve: 340 Command executed: ./checkasm8 --bench=ssd Testbed: AWS Graviton3 Results: ssd_4x4_c: 295 ssd_4x4_neon: 288 ssd_4x4_sve: 228 ssd_4x8_c: 454 ssd_4x8_neon: 431 ssd_4x8_sve: 294 ssd_4x16_c: 779 ssd_4x16_neon: 631 ssd_4x16_sve: 438 ssd_8x4_c: 463 ssd_8x4_neon: 247 ssd_8x4_sve: 246 ssd_8x8_c: 781 ssd_8x8_neon: 413 ssd_8x8_sve: 353 Command executed: ./checkasm10 --bench=ssd Testbed: AWS Graviton3 Results: ssd_4x4_c: 322 ssd_4x4_neon: 335 ssd_4x4_sve: 240 ssd_4x8_c: 522 ssd_4x8_neon: 448 ssd_4x8_sve: 294 ssd_4x16_c: 832 ssd_4x16_neon: 603 ssd_4x16_sve: 440 Command executed: ./checkasm8 --bench=sa8d Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: sa8d_8x8_c: 2103 sa8d_8x8_neon: 619 sa8d_8x8_sve: 617 Command executed: ./checkasm8 --bench=sa8d Testbed: AWS Graviton3 Results: sa8d_8x8_c: 2021 sa8d_8x8_neon: 597 sa8d_8x8_sve: 580 Command executed: ./checkasm8 --bench=var Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: var_8x8_c: 595 var_8x8_neon: 262 var_8x8_sve: 262 var_8x16_c: 1193 var_8x16_neon: 435 var_8x16_sve: 419 Command executed: ./checkasm8 --bench=var Testbed: AWS Graviton3 Results: var_8x8_c: 616 var_8x8_neon: 229 var_8x8_sve: 222 var_8x16_c: 1207 var_8x16_neon: 399 var_8x16_sve: 389 Command executed: ./checkasm8 --bench=hadamard_ac Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: hadamard_ac_8x8_c: 2330 hadamard_ac_8x8_neon: 635 hadamard_ac_8x8_sve: 635 hadamard_ac_8x16_c: 4500 hadamard_ac_8x16_neon: 1152 hadamard_ac_8x16_sve: 1151 hadamard_ac_16x8_c: 4499 hadamard_ac_16x8_neon: 1151 hadamard_ac_16x8_sve: 1150 hadamard_ac_16x16_c: 8812 hadamard_ac_16x16_neon: 2187 hadamard_ac_16x16_sve: 2186 Command executed: ./checkasm8 --bench=hadamard_ac Testbed: AWS Graviton3 Results: hadamard_ac_8x8_c: 2266 hadamard_ac_8x8_neon: 517 hadamard_ac_8x8_sve: 513 hadamard_ac_8x16_c: 4444 hadamard_ac_8x16_neon: 867 hadamard_ac_8x16_sve: 849 hadamard_ac_16x8_c: 4443 hadamard_ac_16x8_neon: 880 hadamard_ac_16x8_sve: 868 hadamard_ac_16x16_c: 8595 hadamard_ac_16x16_neon: 1656 hadamard_ac_16x16_sve: 1622
-
David Chen authored
Place NEON pixel-a macros and constants that are intended to be used by SVE/SVE2 functions as well in a common file.
-
David Chen authored
Imporve the performance of NEON functions of aarch64/mc-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=avg Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: avg_4x2_c: 274 avg_4x2_neon: 215 avg_4x2_sve: 171 avg_4x4_c: 461 avg_4x4_neon: 343 avg_4x4_sve: 225 avg_4x8_c: 806 avg_4x8_neon: 619 avg_4x8_sve: 334 avg_4x16_c: 1523 avg_4x16_neon: 1168 avg_4x16_sve: 558 Command executed: ./checkasm8 --bench=avg Testbed: AWS Graviton3 Results: avg_4x2_c: 267 avg_4x2_neon: 213 avg_4x2_sve: 167 avg_4x4_c: 467 avg_4x4_neon: 350 avg_4x4_sve: 221 avg_4x8_c: 784 avg_4x8_neon: 624 avg_4x8_sve: 302 avg_4x16_c: 1445 avg_4x16_neon: 1182 avg_4x16_sve: 485
-
David Chen authored
Place NEON mc-a macros and functions that are intended to be used by SVE/SVE2 functions as well in a common file.
-
- Nov 20, 2023
-
-
David Chen authored
Imporve the performance of NEON functions of aarch64/deblock-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=deblock Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: deblock_chroma[1]_c: 735 deblock_chroma[1]_neon: 427 deblock_chroma[1]_sve: 353 Command executed: ./checkasm8 --bench=deblock Testbed: AWS Graviton3 Results: deblock_chroma[1]_c: 719 deblock_chroma[1]_neon: 442 deblock_chroma[1]_sve: 345
-
David Chen authored
Place NEON deblock-a macros that are intended to be used by SVE/SVE2 functions as well in a common file.
-
David Chen authored
Imporve the performance of NEON functions of aarch64/dct-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=sub Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: sub4x4_dct_c: 528 sub4x4_dct_neon: 322 sub4x4_dct_sve: 247 Command executed: ./checkasm8 --bench=sub Testbed: AWS Graviton3 Results: sub4x4_dct_c: 562 sub4x4_dct_neon: 376 sub4x4_dct_sve: 255 Command executed: ./checkasm8 --bench=add Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: add4x4_idct_c: 698 add4x4_idct_neon: 386 add4x4_idct_sve2: 345 Command executed: ./checkasm8 --bench=zigzag Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: zigzag_interleave_8x8_cavlc_frame_c: 582 zigzag_interleave_8x8_cavlc_frame_neon: 273 zigzag_interleave_8x8_cavlc_frame_sve: 257 Command executed: ./checkasm8 --bench=zigzag Testbed: AWS Graviton3 Results: zigzag_interleave_8x8_cavlc_frame_c: 587 zigzag_interleave_8x8_cavlc_frame_neon: 257 zigzag_interleave_8x8_cavlc_frame_sve: 249
-
- Nov 18, 2023
-
-
David Chen authored
Place NEON dct-a macros that are intended to be used by SVE/SVE2 functions as well in a common file.
-