Skip to content
Snippets Groups Projects
  1. Mar 14, 2024
    • Henrik Gramner's avatar
      x86inc: Improve XMM-spilling functionality on 64-bit Windows · 585e0199
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Prior to this change dealing with the scenario where the number of
      XMM registers spilled depends on if a branch is taken or not was
      complicated to handle well. There was essentially three options:
      
      1) Always spill the largest number of XMM register. Results in
         unnecessary spills.
      
      2) Do the spilling after the branch. Results in code duplication
         for the shared subset of spills.
      
      3) Do the spilling manually. Optimal, but overly complex and vexing.
      
      This adds an additional optional argument to the WIN64_SPILL_XMM
      and WIN64_PUSH_XMM macros to make it possible to allocate space
      for a certain number of registers but initially only push a subset
      of those, with the option of pushing additional register later.
      585e0199
    • Henrik Gramner's avatar
      x86inc: Restore the stack state between stack allocations · 4df71a75
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Allows the use of multiple independent stack allocations within
      a function without having to manually fiddle with stack offsets.
      4df71a75
    • Henrik Gramner's avatar
      x86inc: Fix warnings with old nasm versions · 3d8aff7e
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      3d8aff7e
  2. Mar 12, 2024
  3. Feb 28, 2024
    • Martin Storsjö's avatar
      aarch64: Use regular hwcaps flags instead of HWCAP_CPUID for CPU feature detection on Linux · be4f0200
      Martin Storsjö authored and Anton Mitrofanov's avatar Anton Mitrofanov committed
      This makes the code much simpler (especially for adding support
      for other instruction set extensions), avoids needing inline
      assembly for this feature, and generally is more of the canonical
      way to do this.
      
      The CPU feature detection was added in
      9c3c7168, using HWCAP_CPUID.
      
      The argument for using that, was that HWCAP_CPUID was added much
      earlier in the kernel (in Linux v4.11), while the HWCAP flags for
      individual features always come later. This allows detecting support
      for new CPU extensions before the kernel exposes information about
      them via hwcap flags.
      
      However in practice, there's probably quite little advantage in this.
      E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in
      v5.10 - later than HWCAP_CPUID, but there's probably very little
      practical cases where one would run a kernel older than that on a CPU
      that supports those instructions.
      
      Additionally, we provide our own definitions of the flag values to
      check (as they are fixed constants anyway), with names not conflicting
      with the ones from system headers. This reduces the number of ifdefs
      needed, and allows detecting those features even if building with
      userland headers that are lacking the definitions of those flags.
      
      Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04,
      do expose support for these features via HWCAP flags, but the
      emulated cpuid registers are missing the bits for exposing e.g. SVE2
      (This issue is fixed in later versions of QEMU though.)
      
      Also drop the ifdef check for whether AT_HWCAP is defined; it was
      added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18,
      which also precedes when aarch64 was commonly used anyway, so
      don't guard the use of that with an ifdef.
      be4f0200
    • Anton Mitrofanov's avatar
      CI: Switch 32/64-bit windows builds to LLVM · 7241d020
      Anton Mitrofanov authored
      Use same Docker images as VLC for contrib compilation.
      7241d020
    • Anton Mitrofanov's avatar
      CI: Add config.log to job artifacts · ea08f586
      Anton Mitrofanov authored
      ea08f586
  4. Feb 19, 2024
  5. Jan 13, 2024
  6. Nov 23, 2023
    • David Chen's avatar
      Improve pixel-a.S Performance by Using SVE/SVE2 · c1c9931d
      David Chen authored
      Imporve the performance of NEON functions of aarch64/pixel-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=ssd
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      ssd_4x4_c: 235
      ssd_4x4_neon: 226
      ssd_4x4_sve: 151
      ssd_4x8_c: 409
      ssd_4x8_neon: 363
      ssd_4x8_sve: 201
      ssd_4x16_c: 781
      ssd_4x16_neon: 653
      ssd_4x16_sve: 313
      ssd_8x4_c: 402
      ssd_8x4_neon: 192
      ssd_8x4_sve: 192
      ssd_8x8_c: 728
      ssd_8x8_neon: 275
      ssd_8x8_sve: 275
      
      Command executed: ./checkasm10 --bench=ssd
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      ssd_4x4_c: 256
      ssd_4x4_neon: 226
      ssd_4x4_sve: 153
      ssd_4x8_c: 460
      ssd_4x8_neon: 369
      ssd_4x8_sve: 215
      ssd_4x16_c: 852
      ssd_4x16_neon: 651
      ssd_4x16_sve: 340
      
      Command executed: ./checkasm8 --bench=ssd
      Testbed: AWS Graviton3
      Results:
      ssd_4x4_c: 295
      ssd_4x4_neon: 288
      ssd_4x4_sve: 228
      ssd_4x8_c: 454
      ssd_4x8_neon: 431
      ssd_4x8_sve: 294
      ssd_4x16_c: 779
      ssd_4x16_neon: 631
      ssd_4x16_sve: 438
      ssd_8x4_c: 463
      ssd_8x4_neon: 247
      ssd_8x4_sve: 246
      ssd_8x8_c: 781
      ssd_8x8_neon: 413
      ssd_8x8_sve: 353
      
      Command executed: ./checkasm10 --bench=ssd
      Testbed: AWS Graviton3
      Results:
      ssd_4x4_c: 322
      ssd_4x4_neon: 335
      ssd_4x4_sve: 240
      ssd_4x8_c: 522
      ssd_4x8_neon: 448
      ssd_4x8_sve: 294
      ssd_4x16_c: 832
      ssd_4x16_neon: 603
      ssd_4x16_sve: 440
      
      Command executed: ./checkasm8 --bench=sa8d
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      sa8d_8x8_c: 2103
      sa8d_8x8_neon: 619
      sa8d_8x8_sve: 617
      
      Command executed: ./checkasm8 --bench=sa8d
      Testbed: AWS Graviton3
      Results:
      sa8d_8x8_c: 2021
      sa8d_8x8_neon: 597
      sa8d_8x8_sve: 580
      
      Command executed: ./checkasm8 --bench=var
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      var_8x8_c: 595
      var_8x8_neon: 262
      var_8x8_sve: 262
      var_8x16_c: 1193
      var_8x16_neon: 435
      var_8x16_sve: 419
      
      Command executed: ./checkasm8 --bench=var
      Testbed: AWS Graviton3
      Results:
      var_8x8_c: 616
      var_8x8_neon: 229
      var_8x8_sve: 222
      var_8x16_c: 1207
      var_8x16_neon: 399
      var_8x16_sve: 389
      
      Command executed: ./checkasm8 --bench=hadamard_ac
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      hadamard_ac_8x8_c: 2330
      hadamard_ac_8x8_neon: 635
      hadamard_ac_8x8_sve: 635
      hadamard_ac_8x16_c: 4500
      hadamard_ac_8x16_neon: 1152
      hadamard_ac_8x16_sve: 1151
      hadamard_ac_16x8_c: 4499
      hadamard_ac_16x8_neon: 1151
      hadamard_ac_16x8_sve: 1150
      hadamard_ac_16x16_c: 8812
      hadamard_ac_16x16_neon: 2187
      hadamard_ac_16x16_sve: 2186
      
      Command executed: ./checkasm8 --bench=hadamard_ac
      Testbed: AWS Graviton3
      Results:
      hadamard_ac_8x8_c: 2266
      hadamard_ac_8x8_neon: 517
      hadamard_ac_8x8_sve: 513
      hadamard_ac_8x16_c: 4444
      hadamard_ac_8x16_neon: 867
      hadamard_ac_8x16_sve: 849
      hadamard_ac_16x8_c: 4443
      hadamard_ac_16x8_neon: 880
      hadamard_ac_16x8_sve: 868
      hadamard_ac_16x16_c: 8595
      hadamard_ac_16x16_neon: 1656
      hadamard_ac_16x16_sve: 1622
      c1c9931d
    • David Chen's avatar
      Create Common NEON pixel-a Macros and Constants · 0ac52d29
      David Chen authored
      Place NEON pixel-a macros and constants that are intended
      to be used by SVE/SVE2 functions as well in a common file.
      0ac52d29
    • David Chen's avatar
      Improve mc-a.S Performance by Using SVE/SVE2 · 06dcf3f9
      David Chen authored
      Imporve the performance of NEON functions of aarch64/mc-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=avg
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      avg_4x2_c: 274
      avg_4x2_neon: 215
      avg_4x2_sve: 171
      avg_4x4_c: 461
      avg_4x4_neon: 343
      avg_4x4_sve: 225
      avg_4x8_c: 806
      avg_4x8_neon: 619
      avg_4x8_sve: 334
      avg_4x16_c: 1523
      avg_4x16_neon: 1168
      avg_4x16_sve: 558
      
      Command executed: ./checkasm8 --bench=avg
      Testbed: AWS Graviton3
      Results:
      avg_4x2_c: 267
      avg_4x2_neon: 213
      avg_4x2_sve: 167
      avg_4x4_c: 467
      avg_4x4_neon: 350
      avg_4x4_sve: 221
      avg_4x8_c: 784
      avg_4x8_neon: 624
      avg_4x8_sve: 302
      avg_4x16_c: 1445
      avg_4x16_neon: 1182
      avg_4x16_sve: 485
      06dcf3f9
    • David Chen's avatar
      Create Common NEON mc-a Macros and Functions · 21a788f1
      David Chen authored
      Place NEON mc-a macros and functions that are intended
      to be used by SVE/SVE2 functions as well in a common file.
      21a788f1
  7. Nov 20, 2023
    • David Chen's avatar
      Improve deblock-a.S Performance by Using SVE/SVE2 · 5ad5e5d8
      David Chen authored
      Imporve the performance of NEON functions of aarch64/deblock-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=deblock
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      deblock_chroma[1]_c: 735
      deblock_chroma[1]_neon: 427
      deblock_chroma[1]_sve: 353
      
      Command executed: ./checkasm8 --bench=deblock
      Testbed: AWS Graviton3
      Results:
      deblock_chroma[1]_c: 719
      deblock_chroma[1]_neon: 442
      deblock_chroma[1]_sve: 345
      5ad5e5d8
    • David Chen's avatar
      Create Common NEON deblock-a Macros · 37949a99
      David Chen authored
      Place NEON deblock-a macros that are intended to be
      used by SVE/SVE2 functions as well in a common file.
      37949a99
    • David Chen's avatar
      Improve dct-a.S Performance by Using SVE/SVE2 · 5c382660
      David Chen authored
      Imporve the performance of NEON functions of aarch64/dct-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=sub
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      sub4x4_dct_c: 528
      sub4x4_dct_neon: 322
      sub4x4_dct_sve: 247
      
      Command executed: ./checkasm8 --bench=sub
      Testbed: AWS Graviton3
      Results:
      sub4x4_dct_c: 562
      sub4x4_dct_neon: 376
      sub4x4_dct_sve: 255
      
      Command executed: ./checkasm8 --bench=add
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      add4x4_idct_c: 698
      add4x4_idct_neon: 386
      add4x4_idct_sve2: 345
      
      Command executed: ./checkasm8 --bench=zigzag
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      zigzag_interleave_8x8_cavlc_frame_c: 582
      zigzag_interleave_8x8_cavlc_frame_neon: 273
      zigzag_interleave_8x8_cavlc_frame_sve: 257
      
      Command executed: ./checkasm8 --bench=zigzag
      Testbed: AWS Graviton3
      Results:
      zigzag_interleave_8x8_cavlc_frame_c: 587
      zigzag_interleave_8x8_cavlc_frame_neon: 257
      zigzag_interleave_8x8_cavlc_frame_sve: 249
      5c382660
  8. Nov 18, 2023
  9. Nov 14, 2023
  10. Nov 02, 2023
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: Make the assembly indentation slightly more consistent · ef572b9f
      Martin Storsjö authored
      The assembly currently uses a mixture of different styles. Don't
      make all of it entirely consistent now, but try to make functions
      more consistent within themselves at least.
      
      In particular, get rid of the convention to have braces hanging
      outside of the alignment line.
      
      Some functions have the whole content indented off by one char
      compared to other functions; adjust those (but retain the functions
      that are self-consistent and match either of the common styles).
      ef572b9f
    • Martin Storsjö's avatar
      arm: Make the assembly indentation slightly more consistent · 3bc7c362
      Martin Storsjö authored
      The assembly currently uses a mixture of different styles. Don't
      make all of it entirely consistent now, but try to make functions
      more consistent within themselves at least.
      
      In particular, get rid of the convention to have braces hanging
      outside of the alignment line.
      3bc7c362
    • Martin Storsjö's avatar
      aarch64: Use rounded right shifts in dequant · dc755eab
      Martin Storsjö authored
      Don't manually add in the rounding constant (via a fused multiply-add
      instruction) when we can just do a plain rounded right shift.
      
                           Cortex A53   A72   A73
      8bpc:
      Before:
      dequant_4x4_cqm_neon:       515   246   267
      dequant_4x4_dc_cqm_neon:    410   265   266
      dequant_4x4_dc_flat_neon:   413   271   271
      dequant_4x4_flat_neon:      519   254   274
      dequant_8x8_cqm_neon:      1555   980  1002
      dequant_8x8_flat_neon:     1562   994  1014
      After:
      dequant_4x4_cqm_neon:       499   246   255
      dequant_4x4_dc_cqm_neon:    376   265   255
      dequant_4x4_dc_flat_neon:   378   271   260
      dequant_4x4_flat_neon:      500   254   262
      dequant_8x8_cqm_neon:      1489   900   925
      dequant_8x8_flat_neon:     1493   915   938
      
      10bpc:
      Before:
      dequant_4x4_cqm_neon:       483   275   275
      dequant_4x4_dc_cqm_neon:    429   256   261
      dequant_4x4_dc_flat_neon:   435   267   267
      dequant_4x4_flat_neon:      487   283   288
      dequant_8x8_cqm_neon:      1511  1112  1076
      dequant_8x8_flat_neon:     1518  1139  1089
      After:
      dequant_4x4_cqm_neon:       472   255   239
      dequant_4x4_dc_cqm_neon:    404   256   232
      dequant_4x4_dc_flat_neon:   406   267   234
      dequant_4x4_flat_neon:      472   255   239
      dequant_8x8_cqm_neon:      1462   922   978
      dequant_8x8_flat_neon:     1462   922   978
      
      This makes it around 3% faster on the Cortex A53, around 8% faster
      for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp
      on A72/A73.
      dc755eab
    • Martin Storsjö's avatar
      aarch64: Improve scheduling in sad_x3/sad_x4 · 4664f5aa
      Martin Storsjö authored
                     Cortex A53    A72    A73
      8 bpc:
      Before:
      sad_x3_4x4_neon:      580    303    204
      sad_x3_4x8_neon:     1065    516    323
      sad_x3_8x4_neon:      668    262    282
      sad_x3_8x8_neon:     1238    454    471
      sad_x3_8x16_neon:    2378    842    847
      sad_x3_16x8_neon:    2136    738    776
      sad_x3_16x16_neon:   4162   1378   1463
      After:
      sad_x3_4x4_neon:      477    298    206
      sad_x3_4x8_neon:      842    515    327
      sad_x3_8x4_neon:      603    260    279
      sad_x3_8x8_neon:     1110    451    464
      sad_x3_8x16_neon:    2125    841    843
      sad_x3_16x8_neon:    2124    730    766
      sad_x3_16x16_neon:   4145   1370   1434
      
      10 bpc:
      Before:
      sad_x3_4x4_neon:      632    247    254
      sad_x3_4x8_neon:     1162    419    443
      sad_x3_8x4_neon:      890    358    416
      sad_x3_8x8_neon:     1670    632    759
      sad_x3_8x16_neon:    3230   1179   1458
      sad_x3_16x8_neon:    3070   1209   1403
      sad_x3_16x16_neon:   6030   2333   2699
      
      After:
      sad_x3_4x4_neon:      522    253    255
      sad_x3_4x8_neon:      932    443    431
      sad_x3_8x4_neon:      880    354    406
      sad_x3_8x8_neon:     1660    626    736
      sad_x3_8x16_neon:    3220   1170   1397
      sad_x3_16x8_neon:    3060   1184   1362
      sad_x3_16x16_neon:   6020   2272   2579
      
      Thus, this is around a 20-25% speedup on Cortex A53 for the small
      sizes (much smaller difference for bigger sizes though), while it
      doesn't make much of a difference at all (mostly within measurement
      noise) for the out-of-order cores (A72 and A73).
      4664f5aa
  11. Oct 24, 2023
  12. Oct 19, 2023
    • Martin Storsjö's avatar
      Add cpu flags and runtime detection of SVE and SVE2 · 9c3c7168
      Martin Storsjö authored
      We could also use HWCAP_SVE and HWCAP2_SVE2 for detecting this,
      but these might not be available in all userland headers, while
      HWCAP_CPUID is available much earlier.
      
      The register ID_AA64ZFR0_EL1, which indicates if SVE2 is available,
      can only be accessed if SVE is available. If not building all the
      C code with SVE enabled (which could make it impossible to run on
      on HW without SVE), binutils refuses to assemble an instruction
      reading ID_AA64ZFR0_EL1 - but if referring to it with the technical
      name S3_0_C0_C4_4, it can be assembled even without any extra
      extensions enabled.
      9c3c7168
  13. Oct 18, 2023
    • Martin Storsjö's avatar
      configure: Check for support for AArch64 SVE and SVE2 · db9bc75b
      Martin Storsjö authored
      We don't expect the user to build the whole x264 codebase with
      SVE/SVE2 enabled, as we only enable this feature for the assembly
      files that use it, in order to have binaries that are portable
      and enable the SVE codepaths at runtime if supported.
      db9bc75b
  14. Oct 12, 2023
    • Yin Shiyou's avatar
      loongarch: Improve the performance of pixel series functions · 5f84d403
      Yin Shiyou authored
      
      Performance has improved from 11.27fps to 20.50fps by using the
      following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      hadamard_ac_8x8          117             21
      hadamard_ac_8x16         236             42
      hadamard_ac_16x8         235             31
      hadamard_ac_16x16        473             60
      intra_sad_x3_4x4         50              21
      intra_sad_x3_8x8         183             34
      intra_sad_x3_8x8c        181             36
      intra_sad_x3_16x16       643             68
      intra_satd_x3_4x4        83              61
      intra_satd_x3_8x8c       344             81
      intra_satd_x3_16x16      1389            136
      sa8d_8x8                 97              19
      sa8d_16x16               394             68
      satd_4x4                 24              8
      satd_4x8                 51              11
      satd_4x16                103             24
      satd_8x4                 52              9
      satd_8x8                 108             12
      satd_8x16                218             24
      satd_16x8                218             19
      satd_16x16               437             38
      ssd_4x4                  10              5
      ssd_4x8                  24              8
      ssd_4x16                 42              15
      ssd_8x4                  23              5
      ssd_8x8                  37              9
      ssd_8x16                 74              17
      ssd_16x8                 72              11
      ssd_16x16                140             23
      var2_8x8                 91              37
      var2_8x16                176             66
      var_8x8                  50              15
      var_8x16                 65              29
      var_16x16                132             56
      
      Signed-off-by: default avatarHecai Yuan <yuanhecai@loongson.cn>
      5f84d403
    • Yin Shiyou's avatar
      loongarch: Improve the performance of dct series functions · fa7f1fce
      Yin Shiyou authored
      
      Performance has improved from 10.53fps to 11.27fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      add4x4_idct              34              9
      add8x8_idct              139             31
      add8x8_idct8             269             39
      add8x8_idct_dc           67              7
      add16x16_idct            564             123
      add16x16_idct_dc         260             22
      dct4x4dc                 18              10
      idct4x4dc                16              9
      sub4x4_dct               25              7
      sub8x8_dct               101             12
      sub8x8_dct8              160             25
      sub16x16_dct             403             52
      sub16x16_dct8            646             68
      zigzag_scan_4x4_frame    4               1
      
      Signed-off-by: default avatarzhoupeng <zhoupeng@loongson.cn>
      fa7f1fce
    • Yin Shiyou's avatar
      loongarch: Improve the performance of mc series functions · 981c8f25
      Yin Shiyou authored
      
      Performance has improved from 6.78fps to 10.53fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      avg_4x2                  16              5
      avg_4x4                  30              6
      avg_4x8                  63              10
      avg_4x16                 124             19
      avg_8x4                  60              6
      avg_8x8                  119             10
      avg_8x16                 233             19
      avg_16x8                 229             21
      avg_16x16                451             41
      get_ref_4x4              30              9
      get_ref_4x8              52              11
      get_ref_8x4              45              9
      get_ref_8x8              80              11
      get_ref_8x16             156             16
      get_ref_12x10            137             13
      get_ref_16x8             147             11
      get_ref_16x16            282             16
      get_ref_20x18            278             22
      hpel_filter              5163            686
      lowres_init              5440            286
      mc_chroma_2x2            24              7
      mc_chroma_2x4            42              10
      mc_chroma_4x2            41              7
      mc_chroma_4x4            75              10
      mc_chroma_4x8            144             19
      mc_chroma_8x4            137             15
      mc_chroma_8x8            269             28
      mc_luma_4x4              30              10
      mc_luma_4x8              52              12
      mc_luma_8x4              44              10
      mc_luma_8x8              80              13
      mc_luma_8x16             156             19
      mc_luma_16x8             147             13
      mc_luma_16x16            281             19
      memcpy_aligned           14              9
      memzero_aligned          24              4
      offsetadd_w4             79              18
      offsetadd_w8             142             18
      offsetadd_w16            277             25
      offsetadd_w20            1118            38
      offsetsub_w4             75              18
      offsetsub_w8             140             18
      offsetsub_w16            265             25
      offsetsub_w20            989             39
      weight_w4                111             19
      weight_w8                205             19
      weight_w16               396             29
      weight_w20               1143            45
      deinterleave_chroma_fdec 76              9
      deinterleave_chroma_fenc 86              9
      plane_copy_deinterleave  733             90
      plane_copy_interleave    791             245
      store_interleave_chroma  82              12
      
      Signed-off-by: default avatarXiwei Gu <guxiwei-hf@loongson.cn>
      981c8f25
  15. Oct 10, 2023
    • Yin Shiyou's avatar
      loongarch: Improve the performance of quant series functions · 65e7bac5
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      Performance has improved from 6.34fps to 6.78fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      coeff_last15             3               2
      coeff_last16             3               1
      coeff_last64             42              6
      decimate_score15         8               12
      decimate_score16         8               11
      decimate_score64         61              43
      dequant_4x4_cqm          16              5
      dequant_4x4_dc_cqm       13              5
      dequant_4x4_dc_flat      13              5
      dequant_4x4_flat         16              5
      dequant_8x8_cqm          71              9
      dequant_8x8_flat         71              9
      
      Signed-off-by: default avatarShiyou Yin <yinshiyou-hf@loongson.cn>
      65e7bac5
    • Yin Shiyou's avatar
      loongarch: Improve the performance of predict series functions · d8ed272a
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      Performance has improved from 6.32fps to 6.34fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      intra_predict_4x4_dc     3               2
      intra_predict_4x4_dc8    1               1
      intra_predict_4x4_dcl    2               1
      intra_predict_4x4_dct    2               1
      intra_predict_4x4_ddl    7               2
      intra_predict_4x4_h      2               1
      intra_predict_4x4_v      1               1
      intra_predict_8x8_dc     8               2
      intra_predict_8x8_dc8    1               1
      intra_predict_8x8_dcl    5               2
      intra_predict_8x8_dct    5               2
      intra_predict_8x8_ddl    27              3
      intra_predict_8x8_ddr    26              3
      intra_predict_8x8_h      4               2
      intra_predict_8x8_v      3               1
      intra_predict_8x8_vl     29              3
      intra_predict_8x8_vr     31              4
      intra_predict_8x8c_dc    8               5
      intra_predict_8x8c_dc8   1               1
      intra_predict_8x8c_dcl   5               3
      intra_predict_8x8c_dct   5               3
      intra_predict_8x8c_h     4               2
      intra_predict_8x8c_p     58              30
      intra_predict_8x8c_v     4               1
      intra_predict_16x16_dc   32              8
      intra_predict_16x16_dc8  9               4
      intra_predict_16x16_dcl  26              6
      intra_predict_16x16_dct  26              6
      intra_predict_16x16_h    23              7
      intra_predict_16x16_p    182             44
      intra_predict_16x16_v    22              4
      
      Signed-off-by: default avatarXiwei Gu <guxiwei-hf@loongson.cn>
      d8ed272a
    • Yin Shiyou's avatar
      loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions · 00b8e3b9
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      Performance has improved from 4.92fps to 6.32fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      sad_4x4                 13               3
      sad_4x8                 26               7
      sad_4x16                57               13
      sad_8x4                 24               3
      sad_8x8                 54               8
      sad_8x16                108              13
      sad_16x8                95               8
      sad_16x16               189              13
      sad_x3_4x4              37               6
      sad_x3_4x8              71               13
      sad_x3_8x4              70               8
      sad_x3_8x8              162              14
      sad_x3_8x16             323              25
      sad_x3_16x8             279              15
      sad_x3_16x16            555              27
      sad_x4_4x4              49               8
      sad_x4_4x8              95               17
      sad_x4_8x4              94               8
      sad_x4_8x8              214              16
      sad_x4_8x16             429              33
      sad_x4_16x8             372              18
      sad_x4_16x16            740              34
      
      Signed-off-by: default avatarwanglu <wanglu@loongson.cn>
      00b8e3b9
    • Yin Shiyou's avatar
      loongarch: Improve the performance of deblock series functions. · d7d283f6
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      Performance has improved from 4.76fps to 4.92fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      deblock_luma[0]         79               39
      deblock_luma[1]         91               18
      deblock_luma_intra[0]   63               44
      deblock_luma_intra[1]   71               18
      deblock_strength        104              33
      
      Signed-off-by: default avatarHao Chen <chenhao@loongson.cn>
      d7d283f6
Loading