Commits · master · VideoLAN / x264

Mar 12, 2025

Provide implementations for functions using the instructions SDOT/UDOT in the... · fe9e4a7f

Konstantinos Margaritis authored 1 month ago

Provide implementations for functions using the instructions SDOT/UDOT in the DotProd Armv8 extension.

Functions implemented:
sad_16x8, sad_16x16,
sad_x3_16x8_neon, sad_x3_16x16_neon,
sad_x4_16x8_neon, sad_x4_16x16_neon,
ssd_8x4, ssd_8x8, ssd_8x16, ssd_16x8, ssd_16x16,
pixel_vsad

Performance improvement against Neon ranges from 5% to 188%.
Following is the output of ./checkasm8 --bench (run on a Graviton4 system):

sad_16x8_c: 1323
sad_16x8_neon: 224
sad_16x8_dotprod: 211
sad_16x16_c: 2619
sad_16x16_neon: 365
sad_16x16_dotprod: 320
sad_x3_16x8_c: 3836
sad_x3_16x8_neon: 403
sad_x3_16x8_dotprod: 317
sad_x3_16x16_c: 7725
sad_x3_16x16_neon: 714
sad_x3_16x16_dotprod: 532
sad_x4_16x8_c: 5080
sad_x4_16x8_neon: 438
sad_x4_16x8_dotprod: 375
sad_x4_16x16_c: 10260
sad_x4_16x16_neon: 794
sad_x4_16x16_dotprod: 655
ssd_8x4_c: 381
ssd_8x4_neon: 157
ssd_8x4_dotprod: 115
ssd_8x4_sve: 150
ssd_8x8_c: 695
ssd_8x8_neon: 238
ssd_8x8_dotprod: 161
ssd_8x8_sve: 228
ssd_8x16_c: 1335
ssd_8x16_neon: 388
ssd_8x16_dotprod: 267
ssd_16x8_c: 1342
ssd_16x8_neon: 285
ssd_16x8_dotprod: 166
ssd_16x16_c: 2623
ssd_16x16_neon: 503
ssd_16x16_dotprod: 277
vsad_c: 2786
vsad_neon: 311
vsad_dotprod: 235

fe9e4a7f

aarch64: Add runtime detection of extensions on Windows and macOS · 570f6c70
Martin Storsjö authored 1 month ago

570f6c70
aarch64: Add flags for runtime detection of dotprod and i8mm · 0e48d072
Martin Storsjö authored 1 month ago
```
Also add code for detecting them on Linux.
```
0e48d072
configure: Check for the dotprod and i8mm aarch64 extensions · fc4012fb
Martin Storsjö authored 1 month ago

fc4012fb

aarch64: Use configure detected directives for enabling SVE/SVE2 · 87044b21

Martin Storsjö authored 1 month ago

By using .arch_extension (if supported) to enable the relevant
extensions, we can also disable them afterwards, so we can e.g.
cleanly enable one extension only for one subsection of a file.

This also makes it easier to enable various combinations of
supported architecture extensions.

87044b21

configure: Check for .arch and .arch_extension for enabling aarch64 extensions · f87ca183

Martin Storsjö authored 1 month ago

This hasn't been needed for SVE/SVE2, as all toolchains have
supported just enabling it via ".arch armv8.2-a+sve". For other
arch extensions, like dotprod/i8mm, there's more combinations of
toolchain bugs in slightly older toolchains; try to detect what is
supported.

Additionally, when involving more than one architecture extension,
we may want to enable/disable individual extensions one at a time,
without needing to specify the full list in one single .arch
statement.

This is a preparatory commit for adding support for the dotprod/i8mm
extensions.

We intentionally don't add AS_ARCH_LEVEL to the CONFIG_HAVE list,
as this define isn't prefixed with "HAVE_", and we don't use the
define except in the case where we actually do set it. (It's not
a regular 0/1 define like the others.)

f87ca183

configure: Use as_check for the main check for whether NEON is supported · 72ce1cde

Martin Storsjö authored 1 month ago

This requires adding the "-c" flag to ASFLAGS before doing the
check.

This also makes sure to validate the gas-preprocessor is functional
for MSVC configurations, by testing whether the "cmeq" instruction
can be assembled at this point.

72ce1cde

configure: Use as_check for checking for aarch64 features · a0191bd8

Martin Storsjö authored 1 month ago

This is more correct than using cc_check; we're going to assemble
standalone external assembly - thus check for whether we can
build it in that form, not using inline assembly.

This allows sharing checks with the MSVC codepath (where inline
assembly isn't supported, and where assembly is built using
a tool different from the regular compiler).

a0191bd8

Mar 11, 2025

Makefile: Generate dependency information implicitly while compiling · 27d83708

Martin Storsjö authored 4 weeks ago

This updates the dependecy information on each successive recompile.

When building with MSVC, dependency information is generated with
a separate command just like before, but done together with
compiling each object file. (This is quite similar to how ffmpeg does
the same.)

This avoids the serial dependency generation step. In slow
environments (in particular if using MSVC) it could take a notable
amount of time; this can now all be done in parallel.

In one example, this reduces the time for a full build from clean
with MSVC (wrapped in wine) from 23 seconds down to 9 seconds,
thanks to parallelism. (For non-parallel builds, it doesn't make
much of a difference.)

27d83708

Mar 04, 2025

msvsdepend: Allow using the script for .S sources too · c80f8a28

Martin Storsjö authored 4 weeks ago

Previously, MSVC would warn that the .S source is unrecognized,
and the script would only produce a depenency on the main source
file itself.

c80f8a28

Jan 03, 2025
- Bump dates to 2025 · 373697b4
  Anton Mitrofanov authored 2 months ago
  
  373697b4
Dec 29, 2024
- Use sched_getaffinity on Android · 52f7694d
  Brad Smith authored 5 months ago
```
https://android.googlesource.com/platform/bionic/+/72e6fd42421dca80fb2776a9185c186d4a04e5f7

Android has had sched_getaffinity since Android 3.0. Builds need
to use _GNU_SOURCE.
```
  52f7694d
- ci: Test compiling for Android · 450946f9
  Martin Storsjö authored 3 months ago
  
  450946f9
- Enable use of __sync_fetch_and_add() wherever detected instead of just X86 · a64111b1
  Brad Smith authored 3 months ago
```
Use __sync_fetch_and_add() wherever detected instead of being limited to
just X86.
```
  a64111b1
- Use sysctlbyname(3) hw.logicalcpu on macOS · 938601b9
  Brad Smith authored 5 months ago and Anton Mitrofanov committed 3 months ago
```
Use of hw.ncpu has long been deprecated.
```
  938601b9
Nov 04, 2024
- aarch64: defines involving bit shifts should be unsigned · 023112c6
  Brad Smith authored 4 months ago
  
  023112c6
Oct 27, 2024

Make use of sysconf(3) _SC_NPROCESSORS_ONLN and _SC_NPROCESSORS_CONF · da14df55

Brad Smith authored 5 months ago

Make use of _SC_NPROCESSORS_ONLN if it exists and fallback to
_SC_NPROCESSORS_CONF for really old operating systems. This adds
support for retrieving the number of CPUs on a few OS's such as
NetBSD, DragonFly and a few others.

da14df55

Oct 26, 2024
- Use getauxval() on Linux and elf_aux_info() on FreeBSD/OpenBSD on arm/ppc · b1d2de88
  Brad Smith authored 5 months ago and Anton Mitrofanov committed 5 months ago
  
  b1d2de88
Oct 22, 2024

Fix build with Android NDK and API < 24 for 32-bit targets · 3a21e97b

Anton Mitrofanov authored 5 months ago

fseeko() is not available before API 24 with _FILE_OFFSET_BITS=64.
x264.c: x264cli.h must be first as it contains _FILE_OFFSET_BITS define.

3a21e97b

Oct 20, 2024
- configure: Add DragonFly support · 80c1c47c
  Brad Smith authored 10 months ago
  
  80c1c47c
Oct 17, 2024
- Provide x264_getauxval() wrapper for getauxvaul() and elf_aux_info() · 1243d9ff
  Brad Smith authored 5 months ago
  
  1243d9ff
Oct 07, 2024
- aarch64: Use elf_aux_info() for CPU feature detection on FreeBSD/OpenBSD · 3a8b5be2
  Brad Smith authored 6 months ago
  
  3a8b5be2
Sep 17, 2024
- configure: Check for SVE support in MS armasm64 via as_check · c24e06c2
  Martin Storsjö authored 1 year ago
```
This is mostly supported in armasm64 since MSVC 2022 17.10.
```
  c24e06c2
May 13, 2024

x86inc: Improve ELF PIC support for external function calls · 4613ac3c

Henrik Gramner authored 10 months ago

PLT/GOT indirections are required in some cases. Most commonly when
calling functions from other shared libraries, but also in some
scenarios when calling functions with default symbol visibility
even within the same component on certain elf64 platforms.

On elf64 we can simply use PLT relocations for all calls to external
functions. Since the linker is able to eliminate unnecessary PLT
indirections with the final output binary being identical to non-PLT
relocations there isn't really any downside to doing so. This mimics
what regular compilers normally do for calls to external functions.

On elf32 with PIC we can use a function pointer from the GOT when
calling external functions, similar to what regular compilers do when
using -fno-plt. Since this both introduces overhead and clobbers one
register, which could potentially have been used for custom calling
conventions when calling other asm functions within the same library,
it's only performed for functions declared using 'cextern_naked'.

4613ac3c

Mar 21, 2024
- loongarch: Enhance ultrafast encoding performance · 7ed753b1
  guxiwei authored 1 year ago and guxiwei committed 1 year ago
```
Using the following command, ultrafast encoding
has improved from 182fps to 189fps:
./x264 --preset ultrafast -o out.mkv yuv_1920x1080.yuv
```
  7ed753b1
- loongarch: Fixed pixel_sa8d_16x16_lasx · 16262286
  guxiwei authored 1 year ago and guxiwei committed 1 year ago
```
Save and restore FPR
```
  16262286
- loongarch: Add checkasm_call · 5a61afdb
  guxiwei authored 1 year ago and guxiwei committed 1 year ago
  
  5a61afdb
- loongarch: Update loongson_asm.S version to 0.4.0 · 982d3240
  guxiwei authored 1 year ago and guxiwei committed 1 year ago
  
  982d3240
Mar 14, 2024

x86inc: Improve XMM-spilling functionality on 64-bit Windows · 585e0199

Henrik Gramner authored 1 year ago and

Henrik Gramner committed 1 year ago

Prior to this change dealing with the scenario where the number of
XMM registers spilled depends on if a branch is taken or not was
complicated to handle well. There was essentially three options:

1) Always spill the largest number of XMM register. Results in
   unnecessary spills.

2) Do the spilling after the branch. Results in code duplication
   for the shared subset of spills.

3) Do the spilling manually. Optimal, but overly complex and vexing.

This adds an additional optional argument to the WIN64_SPILL_XMM
and WIN64_PUSH_XMM macros to make it possible to allocate space
for a certain number of registers but initially only push a subset
of those, with the option of pushing additional register later.

585e0199

x86inc: Restore the stack state between stack allocations · 4df71a75
Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
```
Allows the use of multiple independent stack allocations within
a function without having to manually fiddle with stack offsets.
```
4df71a75
x86inc: Fix warnings with old nasm versions · 3d8aff7e
Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago

3d8aff7e

Mar 12, 2024

ppc: Fix incompatible pointer type errors · de1bea53

Anton Mitrofanov authored 1 year ago

Use correct return type for pixel_sad_x3/x4 functions.
Bug report by Dominik 'Rathann' Mierzejewski .

de1bea53

Feb 28, 2024

aarch64: Use regular hwcaps flags instead of HWCAP_CPUID for CPU feature detection on Linux · be4f0200

Martin Storsjö authored 1 year ago and

Anton Mitrofanov committed 1 year ago

This makes the code much simpler (especially for adding support
for other instruction set extensions), avoids needing inline
assembly for this feature, and generally is more of the canonical
way to do this.

The CPU feature detection was added in
9c3c7168, using HWCAP_CPUID.

The argument for using that, was that HWCAP_CPUID was added much
earlier in the kernel (in Linux v4.11), while the HWCAP flags for
individual features always come later. This allows detecting support
for new CPU extensions before the kernel exposes information about
them via hwcap flags.

However in practice, there's probably quite little advantage in this.
E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in
v5.10 - later than HWCAP_CPUID, but there's probably very little
practical cases where one would run a kernel older than that on a CPU
that supports those instructions.

Additionally, we provide our own definitions of the flag values to
check (as they are fixed constants anyway), with names not conflicting
with the ones from system headers. This reduces the number of ifdefs
needed, and allows detecting those features even if building with
userland headers that are lacking the definitions of those flags.

Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04,
do expose support for these features via HWCAP flags, but the
emulated cpuid registers are missing the bits for exposing e.g. SVE2
(This issue is fixed in later versions of QEMU though.)

Also drop the ifdef check for whether AT_HWCAP is defined; it was
added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18,
which also precedes when aarch64 was commonly used anyway, so
don't guard the use of that with an ifdef.

be4f0200

CI: Switch 32/64-bit windows builds to LLVM · 7241d020
Anton Mitrofanov authored 1 year ago
```
Use same Docker images as VLC for contrib compilation.
```
7241d020
CI: Add config.log to job artifacts · ea08f586
Anton Mitrofanov authored 1 year ago

ea08f586

Feb 19, 2024

x86inc: Add support for ELF CET properties · 12426f5f

Henrik Gramner authored 1 year ago

Automatically flag x86-64 asm object files as SHSTK-compatible.

Shadow Stack (SHSTK) is a part of Control-flow Enforcement Technology
(CET) which is a feature aimed at defending against ROP attacks by
verifying that 'call' and 'ret' instructions are correctly matched.

For well-written code this works transparently without any code changes,
as return addresses popped from the shadow stack should match return
addresses popped from the normal stack for performance reasons anyway.

12426f5f

x86inc.asm: Add the crc32 SSE4.2 GPR instruction · 6fc4480c
Henrik Gramner authored 1 year ago

6fc4480c
x86inc: Add a cpu flag for the Ice Lake AVX-512 subset · 87476b4c
Henrik Gramner authored 1 year ago

87476b4c
x86inc: Add CLMUL cpu flag · a6b56179
Henrik Gramner authored 1 year ago
```
Also make the GFNI cpu flag imply the presence of both AESNI and CLMUL.
```
a6b56179

x86inc: Add template defines for EVEX broadcasts · 5207a74e

Henrik Gramner authored 1 year ago

Broadcasting a memory operand is a binary flag, you either broadcast
or you don't, and there's only a single possible element size for
any given instruction.

The instruction syntax however requires the broadcast semanticts
to be explicitly defined, which is an issue when using macros to
template code for multiple register widths.

Add some helper defines to alleviate the issue.

5207a74e