Skip to content
Snippets Groups Projects
  1. Mar 12, 2025
    • Konstantinos Margaritis's avatar
      Provide implementations for functions using the instructions SDOT/UDOT in the... · fe9e4a7f
      Konstantinos Margaritis authored
      Provide implementations for functions using the instructions SDOT/UDOT in the DotProd Armv8 extension.
      
      Functions implemented:
      sad_16x8, sad_16x16,
      sad_x3_16x8_neon, sad_x3_16x16_neon,
      sad_x4_16x8_neon, sad_x4_16x16_neon,
      ssd_8x4, ssd_8x8, ssd_8x16, ssd_16x8, ssd_16x16,
      pixel_vsad
      
      Performance improvement against Neon ranges from 5% to 188%.
      Following is the output of ./checkasm8 --bench (run on a Graviton4 system):
      
      sad_16x8_c: 1323
      sad_16x8_neon: 224
      sad_16x8_dotprod: 211
      sad_16x16_c: 2619
      sad_16x16_neon: 365
      sad_16x16_dotprod: 320
      sad_x3_16x8_c: 3836
      sad_x3_16x8_neon: 403
      sad_x3_16x8_dotprod: 317
      sad_x3_16x16_c: 7725
      sad_x3_16x16_neon: 714
      sad_x3_16x16_dotprod: 532
      sad_x4_16x8_c: 5080
      sad_x4_16x8_neon: 438
      sad_x4_16x8_dotprod: 375
      sad_x4_16x16_c: 10260
      sad_x4_16x16_neon: 794
      sad_x4_16x16_dotprod: 655
      ssd_8x4_c: 381
      ssd_8x4_neon: 157
      ssd_8x4_dotprod: 115
      ssd_8x4_sve: 150
      ssd_8x8_c: 695
      ssd_8x8_neon: 238
      ssd_8x8_dotprod: 161
      ssd_8x8_sve: 228
      ssd_8x16_c: 1335
      ssd_8x16_neon: 388
      ssd_8x16_dotprod: 267
      ssd_16x8_c: 1342
      ssd_16x8_neon: 285
      ssd_16x8_dotprod: 166
      ssd_16x16_c: 2623
      ssd_16x16_neon: 503
      ssd_16x16_dotprod: 277
      vsad_c: 2786
      vsad_neon: 311
      vsad_dotprod: 235
      fe9e4a7f
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: Add flags for runtime detection of dotprod and i8mm · 0e48d072
      Martin Storsjö authored
      Also add code for detecting them on Linux.
      0e48d072
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: Use configure detected directives for enabling SVE/SVE2 · 87044b21
      Martin Storsjö authored
      By using .arch_extension (if supported) to enable the relevant
      extensions, we can also disable them afterwards, so we can e.g.
      cleanly enable one extension only for one subsection of a file.
      
      This also makes it easier to enable various combinations of
      supported architecture extensions.
      87044b21
    • Martin Storsjö's avatar
      configure: Check for .arch and .arch_extension for enabling aarch64 extensions · f87ca183
      Martin Storsjö authored
      This hasn't been needed for SVE/SVE2, as all toolchains have
      supported just enabling it via ".arch armv8.2-a+sve". For other
      arch extensions, like dotprod/i8mm, there's more combinations of
      toolchain bugs in slightly older toolchains; try to detect what is
      supported.
      
      Additionally, when involving more than one architecture extension,
      we may want to enable/disable individual extensions one at a time,
      without needing to specify the full list in one single .arch
      statement.
      
      This is a preparatory commit for adding support for the dotprod/i8mm
      extensions.
      
      We intentionally don't add AS_ARCH_LEVEL to the CONFIG_HAVE list,
      as this define isn't prefixed with "HAVE_", and we don't use the
      define except in the case where we actually do set it. (It's not
      a regular 0/1 define like the others.)
      f87ca183
    • Martin Storsjö's avatar
      configure: Use as_check for the main check for whether NEON is supported · 72ce1cde
      Martin Storsjö authored
      This requires adding the "-c" flag to ASFLAGS before doing the
      check.
      
      This also makes sure to validate the gas-preprocessor is functional
      for MSVC configurations, by testing whether the "cmeq" instruction
      can be assembled at this point.
      72ce1cde
    • Martin Storsjö's avatar
      configure: Use as_check for checking for aarch64 features · a0191bd8
      Martin Storsjö authored
      This is more correct than using cc_check; we're going to assemble
      standalone external assembly - thus check for whether we can
      build it in that form, not using inline assembly.
      
      This allows sharing checks with the MSVC codepath (where inline
      assembly isn't supported, and where assembly is built using
      a tool different from the regular compiler).
      a0191bd8
  2. Mar 11, 2025
    • Martin Storsjö's avatar
      Makefile: Generate dependency information implicitly while compiling · 27d83708
      Martin Storsjö authored
      This updates the dependecy information on each successive recompile.
      
      When building with MSVC, dependency information is generated with
      a separate command just like before, but done together with
      compiling each object file. (This is quite similar to how ffmpeg does
      the same.)
      
      This avoids the serial dependency generation step. In slow
      environments (in particular if using MSVC) it could take a notable
      amount of time; this can now all be done in parallel.
      
      In one example, this reduces the time for a full build from clean
      with MSVC (wrapped in wine) from 23 seconds down to 9 seconds,
      thanks to parallelism. (For non-parallel builds, it doesn't make
      much of a difference.)
      27d83708
  3. Mar 04, 2025
  4. Jan 03, 2025
  5. Dec 29, 2024
  6. Nov 04, 2024
  7. Oct 27, 2024
  8. Oct 26, 2024
  9. Oct 22, 2024
  10. Oct 20, 2024
  11. Oct 17, 2024
  12. Oct 07, 2024
  13. Sep 17, 2024
  14. May 13, 2024
    • Henrik Gramner's avatar
      x86inc: Improve ELF PIC support for external function calls · 4613ac3c
      Henrik Gramner authored
      PLT/GOT indirections are required in some cases. Most commonly when
      calling functions from other shared libraries, but also in some
      scenarios when calling functions with default symbol visibility
      even within the same component on certain elf64 platforms.
      
      On elf64 we can simply use PLT relocations for all calls to external
      functions. Since the linker is able to eliminate unnecessary PLT
      indirections with the final output binary being identical to non-PLT
      relocations there isn't really any downside to doing so. This mimics
      what regular compilers normally do for calls to external functions.
      
      On elf32 with PIC we can use a function pointer from the GOT when
      calling external functions, similar to what regular compilers do when
      using -fno-plt. Since this both introduces overhead and clobbers one
      register, which could potentially have been used for custom calling
      conventions when calling other asm functions within the same library,
      it's only performed for functions declared using 'cextern_naked'.
      4613ac3c
  15. Mar 21, 2024
  16. Mar 14, 2024
    • Henrik Gramner's avatar
      x86inc: Improve XMM-spilling functionality on 64-bit Windows · 585e0199
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Prior to this change dealing with the scenario where the number of
      XMM registers spilled depends on if a branch is taken or not was
      complicated to handle well. There was essentially three options:
      
      1) Always spill the largest number of XMM register. Results in
         unnecessary spills.
      
      2) Do the spilling after the branch. Results in code duplication
         for the shared subset of spills.
      
      3) Do the spilling manually. Optimal, but overly complex and vexing.
      
      This adds an additional optional argument to the WIN64_SPILL_XMM
      and WIN64_PUSH_XMM macros to make it possible to allocate space
      for a certain number of registers but initially only push a subset
      of those, with the option of pushing additional register later.
      585e0199
    • Henrik Gramner's avatar
      x86inc: Restore the stack state between stack allocations · 4df71a75
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Allows the use of multiple independent stack allocations within
      a function without having to manually fiddle with stack offsets.
      4df71a75
    • Henrik Gramner's avatar
      x86inc: Fix warnings with old nasm versions · 3d8aff7e
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      3d8aff7e
  17. Mar 12, 2024
  18. Feb 28, 2024
    • Martin Storsjö's avatar
      aarch64: Use regular hwcaps flags instead of HWCAP_CPUID for CPU feature detection on Linux · be4f0200
      Martin Storsjö authored and Anton Mitrofanov's avatar Anton Mitrofanov committed
      This makes the code much simpler (especially for adding support
      for other instruction set extensions), avoids needing inline
      assembly for this feature, and generally is more of the canonical
      way to do this.
      
      The CPU feature detection was added in
      9c3c7168, using HWCAP_CPUID.
      
      The argument for using that, was that HWCAP_CPUID was added much
      earlier in the kernel (in Linux v4.11), while the HWCAP flags for
      individual features always come later. This allows detecting support
      for new CPU extensions before the kernel exposes information about
      them via hwcap flags.
      
      However in practice, there's probably quite little advantage in this.
      E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in
      v5.10 - later than HWCAP_CPUID, but there's probably very little
      practical cases where one would run a kernel older than that on a CPU
      that supports those instructions.
      
      Additionally, we provide our own definitions of the flag values to
      check (as they are fixed constants anyway), with names not conflicting
      with the ones from system headers. This reduces the number of ifdefs
      needed, and allows detecting those features even if building with
      userland headers that are lacking the definitions of those flags.
      
      Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04,
      do expose support for these features via HWCAP flags, but the
      emulated cpuid registers are missing the bits for exposing e.g. SVE2
      (This issue is fixed in later versions of QEMU though.)
      
      Also drop the ifdef check for whether AT_HWCAP is defined; it was
      added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18,
      which also precedes when aarch64 was commonly used anyway, so
      don't guard the use of that with an ifdef.
      be4f0200
    • Anton Mitrofanov's avatar
      CI: Switch 32/64-bit windows builds to LLVM · 7241d020
      Anton Mitrofanov authored
      Use same Docker images as VLC for contrib compilation.
      7241d020
    • Anton Mitrofanov's avatar
      CI: Add config.log to job artifacts · ea08f586
      Anton Mitrofanov authored
      ea08f586
  19. Feb 19, 2024
Loading