Skip to content
Snippets Groups Projects
  1. Jan 03, 2025
  2. Dec 29, 2024
  3. Nov 04, 2024
  4. Oct 27, 2024
  5. Oct 26, 2024
  6. Oct 22, 2024
  7. Oct 20, 2024
  8. Oct 17, 2024
  9. Oct 07, 2024
  10. Sep 17, 2024
  11. May 13, 2024
    • Henrik Gramner's avatar
      x86inc: Improve ELF PIC support for external function calls · 4613ac3c
      Henrik Gramner authored
      PLT/GOT indirections are required in some cases. Most commonly when
      calling functions from other shared libraries, but also in some
      scenarios when calling functions with default symbol visibility
      even within the same component on certain elf64 platforms.
      
      On elf64 we can simply use PLT relocations for all calls to external
      functions. Since the linker is able to eliminate unnecessary PLT
      indirections with the final output binary being identical to non-PLT
      relocations there isn't really any downside to doing so. This mimics
      what regular compilers normally do for calls to external functions.
      
      On elf32 with PIC we can use a function pointer from the GOT when
      calling external functions, similar to what regular compilers do when
      using -fno-plt. Since this both introduces overhead and clobbers one
      register, which could potentially have been used for custom calling
      conventions when calling other asm functions within the same library,
      it's only performed for functions declared using 'cextern_naked'.
      4613ac3c
  12. Mar 21, 2024
  13. Mar 14, 2024
    • Henrik Gramner's avatar
      x86inc: Improve XMM-spilling functionality on 64-bit Windows · 585e0199
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Prior to this change dealing with the scenario where the number of
      XMM registers spilled depends on if a branch is taken or not was
      complicated to handle well. There was essentially three options:
      
      1) Always spill the largest number of XMM register. Results in
         unnecessary spills.
      
      2) Do the spilling after the branch. Results in code duplication
         for the shared subset of spills.
      
      3) Do the spilling manually. Optimal, but overly complex and vexing.
      
      This adds an additional optional argument to the WIN64_SPILL_XMM
      and WIN64_PUSH_XMM macros to make it possible to allocate space
      for a certain number of registers but initially only push a subset
      of those, with the option of pushing additional register later.
      585e0199
    • Henrik Gramner's avatar
      x86inc: Restore the stack state between stack allocations · 4df71a75
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Allows the use of multiple independent stack allocations within
      a function without having to manually fiddle with stack offsets.
      4df71a75
    • Henrik Gramner's avatar
      x86inc: Fix warnings with old nasm versions · 3d8aff7e
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      3d8aff7e
  14. Mar 12, 2024
  15. Feb 28, 2024
    • Martin Storsjö's avatar
      aarch64: Use regular hwcaps flags instead of HWCAP_CPUID for CPU feature detection on Linux · be4f0200
      Martin Storsjö authored and Anton Mitrofanov's avatar Anton Mitrofanov committed
      This makes the code much simpler (especially for adding support
      for other instruction set extensions), avoids needing inline
      assembly for this feature, and generally is more of the canonical
      way to do this.
      
      The CPU feature detection was added in
      9c3c7168, using HWCAP_CPUID.
      
      The argument for using that, was that HWCAP_CPUID was added much
      earlier in the kernel (in Linux v4.11), while the HWCAP flags for
      individual features always come later. This allows detecting support
      for new CPU extensions before the kernel exposes information about
      them via hwcap flags.
      
      However in practice, there's probably quite little advantage in this.
      E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in
      v5.10 - later than HWCAP_CPUID, but there's probably very little
      practical cases where one would run a kernel older than that on a CPU
      that supports those instructions.
      
      Additionally, we provide our own definitions of the flag values to
      check (as they are fixed constants anyway), with names not conflicting
      with the ones from system headers. This reduces the number of ifdefs
      needed, and allows detecting those features even if building with
      userland headers that are lacking the definitions of those flags.
      
      Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04,
      do expose support for these features via HWCAP flags, but the
      emulated cpuid registers are missing the bits for exposing e.g. SVE2
      (This issue is fixed in later versions of QEMU though.)
      
      Also drop the ifdef check for whether AT_HWCAP is defined; it was
      added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18,
      which also precedes when aarch64 was commonly used anyway, so
      don't guard the use of that with an ifdef.
      be4f0200
    • Anton Mitrofanov's avatar
      CI: Switch 32/64-bit windows builds to LLVM · 7241d020
      Anton Mitrofanov authored
      Use same Docker images as VLC for contrib compilation.
      7241d020
    • Anton Mitrofanov's avatar
      CI: Add config.log to job artifacts · ea08f586
      Anton Mitrofanov authored
      ea08f586
  16. Feb 19, 2024
  17. Jan 13, 2024
  18. Nov 23, 2023
    • David Chen's avatar
      Improve pixel-a.S Performance by Using SVE/SVE2 · c1c9931d
      David Chen authored
      Imporve the performance of NEON functions of aarch64/pixel-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=ssd
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      ssd_4x4_c: 235
      ssd_4x4_neon: 226
      ssd_4x4_sve: 151
      ssd_4x8_c: 409
      ssd_4x8_neon: 363
      ssd_4x8_sve: 201
      ssd_4x16_c: 781
      ssd_4x16_neon: 653
      ssd_4x16_sve: 313
      ssd_8x4_c: 402
      ssd_8x4_neon: 192
      ssd_8x4_sve: 192
      ssd_8x8_c: 728
      ssd_8x8_neon: 275
      ssd_8x8_sve: 275
      
      Command executed: ./checkasm10 --bench=ssd
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      ssd_4x4_c: 256
      ssd_4x4_neon: 226
      ssd_4x4_sve: 153
      ssd_4x8_c: 460
      ssd_4x8_neon: 369
      ssd_4x8_sve: 215
      ssd_4x16_c: 852
      ssd_4x16_neon: 651
      ssd_4x16_sve: 340
      
      Command executed: ./checkasm8 --bench=ssd
      Testbed: AWS Graviton3
      Results:
      ssd_4x4_c: 295
      ssd_4x4_neon: 288
      ssd_4x4_sve: 228
      ssd_4x8_c: 454
      ssd_4x8_neon: 431
      ssd_4x8_sve: 294
      ssd_4x16_c: 779
      ssd_4x16_neon: 631
      ssd_4x16_sve: 438
      ssd_8x4_c: 463
      ssd_8x4_neon: 247
      ssd_8x4_sve: 246
      ssd_8x8_c: 781
      ssd_8x8_neon: 413
      ssd_8x8_sve: 353
      
      Command executed: ./checkasm10 --bench=ssd
      Testbed: AWS Graviton3
      Results:
      ssd_4x4_c: 322
      ssd_4x4_neon: 335
      ssd_4x4_sve: 240
      ssd_4x8_c: 522
      ssd_4x8_neon: 448
      ssd_4x8_sve: 294
      ssd_4x16_c: 832
      ssd_4x16_neon: 603
      ssd_4x16_sve: 440
      
      Command executed: ./checkasm8 --bench=sa8d
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      sa8d_8x8_c: 2103
      sa8d_8x8_neon: 619
      sa8d_8x8_sve: 617
      
      Command executed: ./checkasm8 --bench=sa8d
      Testbed: AWS Graviton3
      Results:
      sa8d_8x8_c: 2021
      sa8d_8x8_neon: 597
      sa8d_8x8_sve: 580
      
      Command executed: ./checkasm8 --bench=var
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      var_8x8_c: 595
      var_8x8_neon: 262
      var_8x8_sve: 262
      var_8x16_c: 1193
      var_8x16_neon: 435
      var_8x16_sve: 419
      
      Command executed: ./checkasm8 --bench=var
      Testbed: AWS Graviton3
      Results:
      var_8x8_c: 616
      var_8x8_neon: 229
      var_8x8_sve: 222
      var_8x16_c: 1207
      var_8x16_neon: 399
      var_8x16_sve: 389
      
      Command executed: ./checkasm8 --bench=hadamard_ac
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      hadamard_ac_8x8_c: 2330
      hadamard_ac_8x8_neon: 635
      hadamard_ac_8x8_sve: 635
      hadamard_ac_8x16_c: 4500
      hadamard_ac_8x16_neon: 1152
      hadamard_ac_8x16_sve: 1151
      hadamard_ac_16x8_c: 4499
      hadamard_ac_16x8_neon: 1151
      hadamard_ac_16x8_sve: 1150
      hadamard_ac_16x16_c: 8812
      hadamard_ac_16x16_neon: 2187
      hadamard_ac_16x16_sve: 2186
      
      Command executed: ./checkasm8 --bench=hadamard_ac
      Testbed: AWS Graviton3
      Results:
      hadamard_ac_8x8_c: 2266
      hadamard_ac_8x8_neon: 517
      hadamard_ac_8x8_sve: 513
      hadamard_ac_8x16_c: 4444
      hadamard_ac_8x16_neon: 867
      hadamard_ac_8x16_sve: 849
      hadamard_ac_16x8_c: 4443
      hadamard_ac_16x8_neon: 880
      hadamard_ac_16x8_sve: 868
      hadamard_ac_16x16_c: 8595
      hadamard_ac_16x16_neon: 1656
      hadamard_ac_16x16_sve: 1622
      c1c9931d
    • David Chen's avatar
      Create Common NEON pixel-a Macros and Constants · 0ac52d29
      David Chen authored
      Place NEON pixel-a macros and constants that are intended
      to be used by SVE/SVE2 functions as well in a common file.
      0ac52d29
    • David Chen's avatar
      Improve mc-a.S Performance by Using SVE/SVE2 · 06dcf3f9
      David Chen authored
      Imporve the performance of NEON functions of aarch64/mc-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=avg
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      avg_4x2_c: 274
      avg_4x2_neon: 215
      avg_4x2_sve: 171
      avg_4x4_c: 461
      avg_4x4_neon: 343
      avg_4x4_sve: 225
      avg_4x8_c: 806
      avg_4x8_neon: 619
      avg_4x8_sve: 334
      avg_4x16_c: 1523
      avg_4x16_neon: 1168
      avg_4x16_sve: 558
      
      Command executed: ./checkasm8 --bench=avg
      Testbed: AWS Graviton3
      Results:
      avg_4x2_c: 267
      avg_4x2_neon: 213
      avg_4x2_sve: 167
      avg_4x4_c: 467
      avg_4x4_neon: 350
      avg_4x4_sve: 221
      avg_4x8_c: 784
      avg_4x8_neon: 624
      avg_4x8_sve: 302
      avg_4x16_c: 1445
      avg_4x16_neon: 1182
      avg_4x16_sve: 485
      06dcf3f9
    • David Chen's avatar
      Create Common NEON mc-a Macros and Functions · 21a788f1
      David Chen authored
      Place NEON mc-a macros and functions that are intended
      to be used by SVE/SVE2 functions as well in a common file.
      21a788f1
  19. Nov 20, 2023
    • David Chen's avatar
      Improve deblock-a.S Performance by Using SVE/SVE2 · 5ad5e5d8
      David Chen authored
      Imporve the performance of NEON functions of aarch64/deblock-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=deblock
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      deblock_chroma[1]_c: 735
      deblock_chroma[1]_neon: 427
      deblock_chroma[1]_sve: 353
      
      Command executed: ./checkasm8 --bench=deblock
      Testbed: AWS Graviton3
      Results:
      deblock_chroma[1]_c: 719
      deblock_chroma[1]_neon: 442
      deblock_chroma[1]_sve: 345
      5ad5e5d8
    • David Chen's avatar
      Create Common NEON deblock-a Macros · 37949a99
      David Chen authored
      Place NEON deblock-a macros that are intended to be
      used by SVE/SVE2 functions as well in a common file.
      37949a99
    • David Chen's avatar
      Improve dct-a.S Performance by Using SVE/SVE2 · 5c382660
      David Chen authored
      Imporve the performance of NEON functions of aarch64/dct-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=sub
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      sub4x4_dct_c: 528
      sub4x4_dct_neon: 322
      sub4x4_dct_sve: 247
      
      Command executed: ./checkasm8 --bench=sub
      Testbed: AWS Graviton3
      Results:
      sub4x4_dct_c: 562
      sub4x4_dct_neon: 376
      sub4x4_dct_sve: 255
      
      Command executed: ./checkasm8 --bench=add
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      add4x4_idct_c: 698
      add4x4_idct_neon: 386
      add4x4_idct_sve2: 345
      
      Command executed: ./checkasm8 --bench=zigzag
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      zigzag_interleave_8x8_cavlc_frame_c: 582
      zigzag_interleave_8x8_cavlc_frame_neon: 273
      zigzag_interleave_8x8_cavlc_frame_sve: 257
      
      Command executed: ./checkasm8 --bench=zigzag
      Testbed: AWS Graviton3
      Results:
      zigzag_interleave_8x8_cavlc_frame_c: 587
      zigzag_interleave_8x8_cavlc_frame_neon: 257
      zigzag_interleave_8x8_cavlc_frame_sve: 249
      5c382660
  20. Nov 18, 2023
Loading