Skip to content
Snippets Groups Projects
  1. Nov 20, 2023
    • David Chen's avatar
      Improve deblock-a.S Performance by Using SVE/SVE2 · 5ad5e5d8
      David Chen authored
      Imporve the performance of NEON functions of aarch64/deblock-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=deblock
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      deblock_chroma[1]_c: 735
      deblock_chroma[1]_neon: 427
      deblock_chroma[1]_sve: 353
      
      Command executed: ./checkasm8 --bench=deblock
      Testbed: AWS Graviton3
      Results:
      deblock_chroma[1]_c: 719
      deblock_chroma[1]_neon: 442
      deblock_chroma[1]_sve: 345
      5ad5e5d8
    • David Chen's avatar
      Create Common NEON deblock-a Macros · 37949a99
      David Chen authored
      Place NEON deblock-a macros that are intended to be
      used by SVE/SVE2 functions as well in a common file.
      37949a99
    • David Chen's avatar
      Improve dct-a.S Performance by Using SVE/SVE2 · 5c382660
      David Chen authored
      Imporve the performance of NEON functions of aarch64/dct-a.S
      by using the SVE/SVE2 instruction set. Below, the specific functions
      are listed together with the improved performance results.
      
      Command executed: ./checkasm8 --bench=sub
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      sub4x4_dct_c: 528
      sub4x4_dct_neon: 322
      sub4x4_dct_sve: 247
      
      Command executed: ./checkasm8 --bench=sub
      Testbed: AWS Graviton3
      Results:
      sub4x4_dct_c: 562
      sub4x4_dct_neon: 376
      sub4x4_dct_sve: 255
      
      Command executed: ./checkasm8 --bench=add
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      add4x4_idct_c: 698
      add4x4_idct_neon: 386
      add4x4_idct_sve2: 345
      
      Command executed: ./checkasm8 --bench=zigzag
      Testbed: Alibaba g8y instance based on Yitian 710 CPU
      Results:
      zigzag_interleave_8x8_cavlc_frame_c: 582
      zigzag_interleave_8x8_cavlc_frame_neon: 273
      zigzag_interleave_8x8_cavlc_frame_sve: 257
      
      Command executed: ./checkasm8 --bench=zigzag
      Testbed: AWS Graviton3
      Results:
      zigzag_interleave_8x8_cavlc_frame_c: 587
      zigzag_interleave_8x8_cavlc_frame_neon: 257
      zigzag_interleave_8x8_cavlc_frame_sve: 249
      5c382660
  2. Nov 18, 2023
  3. Nov 14, 2023
  4. Nov 02, 2023
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: Make the assembly indentation slightly more consistent · ef572b9f
      Martin Storsjö authored
      The assembly currently uses a mixture of different styles. Don't
      make all of it entirely consistent now, but try to make functions
      more consistent within themselves at least.
      
      In particular, get rid of the convention to have braces hanging
      outside of the alignment line.
      
      Some functions have the whole content indented off by one char
      compared to other functions; adjust those (but retain the functions
      that are self-consistent and match either of the common styles).
      ef572b9f
    • Martin Storsjö's avatar
      arm: Make the assembly indentation slightly more consistent · 3bc7c362
      Martin Storsjö authored
      The assembly currently uses a mixture of different styles. Don't
      make all of it entirely consistent now, but try to make functions
      more consistent within themselves at least.
      
      In particular, get rid of the convention to have braces hanging
      outside of the alignment line.
      3bc7c362
    • Martin Storsjö's avatar
      aarch64: Use rounded right shifts in dequant · dc755eab
      Martin Storsjö authored
      Don't manually add in the rounding constant (via a fused multiply-add
      instruction) when we can just do a plain rounded right shift.
      
                           Cortex A53   A72   A73
      8bpc:
      Before:
      dequant_4x4_cqm_neon:       515   246   267
      dequant_4x4_dc_cqm_neon:    410   265   266
      dequant_4x4_dc_flat_neon:   413   271   271
      dequant_4x4_flat_neon:      519   254   274
      dequant_8x8_cqm_neon:      1555   980  1002
      dequant_8x8_flat_neon:     1562   994  1014
      After:
      dequant_4x4_cqm_neon:       499   246   255
      dequant_4x4_dc_cqm_neon:    376   265   255
      dequant_4x4_dc_flat_neon:   378   271   260
      dequant_4x4_flat_neon:      500   254   262
      dequant_8x8_cqm_neon:      1489   900   925
      dequant_8x8_flat_neon:     1493   915   938
      
      10bpc:
      Before:
      dequant_4x4_cqm_neon:       483   275   275
      dequant_4x4_dc_cqm_neon:    429   256   261
      dequant_4x4_dc_flat_neon:   435   267   267
      dequant_4x4_flat_neon:      487   283   288
      dequant_8x8_cqm_neon:      1511  1112  1076
      dequant_8x8_flat_neon:     1518  1139  1089
      After:
      dequant_4x4_cqm_neon:       472   255   239
      dequant_4x4_dc_cqm_neon:    404   256   232
      dequant_4x4_dc_flat_neon:   406   267   234
      dequant_4x4_flat_neon:      472   255   239
      dequant_8x8_cqm_neon:      1462   922   978
      dequant_8x8_flat_neon:     1462   922   978
      
      This makes it around 3% faster on the Cortex A53, around 8% faster
      for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp
      on A72/A73.
      dc755eab
    • Martin Storsjö's avatar
      aarch64: Improve scheduling in sad_x3/sad_x4 · 4664f5aa
      Martin Storsjö authored
                     Cortex A53    A72    A73
      8 bpc:
      Before:
      sad_x3_4x4_neon:      580    303    204
      sad_x3_4x8_neon:     1065    516    323
      sad_x3_8x4_neon:      668    262    282
      sad_x3_8x8_neon:     1238    454    471
      sad_x3_8x16_neon:    2378    842    847
      sad_x3_16x8_neon:    2136    738    776
      sad_x3_16x16_neon:   4162   1378   1463
      After:
      sad_x3_4x4_neon:      477    298    206
      sad_x3_4x8_neon:      842    515    327
      sad_x3_8x4_neon:      603    260    279
      sad_x3_8x8_neon:     1110    451    464
      sad_x3_8x16_neon:    2125    841    843
      sad_x3_16x8_neon:    2124    730    766
      sad_x3_16x16_neon:   4145   1370   1434
      
      10 bpc:
      Before:
      sad_x3_4x4_neon:      632    247    254
      sad_x3_4x8_neon:     1162    419    443
      sad_x3_8x4_neon:      890    358    416
      sad_x3_8x8_neon:     1670    632    759
      sad_x3_8x16_neon:    3230   1179   1458
      sad_x3_16x8_neon:    3070   1209   1403
      sad_x3_16x16_neon:   6030   2333   2699
      
      After:
      sad_x3_4x4_neon:      522    253    255
      sad_x3_4x8_neon:      932    443    431
      sad_x3_8x4_neon:      880    354    406
      sad_x3_8x8_neon:     1660    626    736
      sad_x3_8x16_neon:    3220   1170   1397
      sad_x3_16x8_neon:    3060   1184   1362
      sad_x3_16x16_neon:   6020   2272   2579
      
      Thus, this is around a 20-25% speedup on Cortex A53 for the small
      sizes (much smaller difference for bigger sizes though), while it
      doesn't make much of a difference at all (mostly within measurement
      noise) for the out-of-order cores (A72 and A73).
      4664f5aa
  5. Oct 24, 2023
  6. Oct 19, 2023
    • Martin Storsjö's avatar
      Add cpu flags and runtime detection of SVE and SVE2 · 9c3c7168
      Martin Storsjö authored
      We could also use HWCAP_SVE and HWCAP2_SVE2 for detecting this,
      but these might not be available in all userland headers, while
      HWCAP_CPUID is available much earlier.
      
      The register ID_AA64ZFR0_EL1, which indicates if SVE2 is available,
      can only be accessed if SVE is available. If not building all the
      C code with SVE enabled (which could make it impossible to run on
      on HW without SVE), binutils refuses to assemble an instruction
      reading ID_AA64ZFR0_EL1 - but if referring to it with the technical
      name S3_0_C0_C4_4, it can be assembled even without any extra
      extensions enabled.
      9c3c7168
  7. Oct 18, 2023
    • Martin Storsjö's avatar
      configure: Check for support for AArch64 SVE and SVE2 · db9bc75b
      Martin Storsjö authored
      We don't expect the user to build the whole x264 codebase with
      SVE/SVE2 enabled, as we only enable this feature for the assembly
      files that use it, in order to have binaries that are portable
      and enable the SVE codepaths at runtime if supported.
      db9bc75b
  8. Oct 12, 2023
    • Yin Shiyou's avatar
      loongarch: Improve the performance of pixel series functions · 5f84d403
      Yin Shiyou authored
      
      Performance has improved from 11.27fps to 20.50fps by using the
      following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      hadamard_ac_8x8          117             21
      hadamard_ac_8x16         236             42
      hadamard_ac_16x8         235             31
      hadamard_ac_16x16        473             60
      intra_sad_x3_4x4         50              21
      intra_sad_x3_8x8         183             34
      intra_sad_x3_8x8c        181             36
      intra_sad_x3_16x16       643             68
      intra_satd_x3_4x4        83              61
      intra_satd_x3_8x8c       344             81
      intra_satd_x3_16x16      1389            136
      sa8d_8x8                 97              19
      sa8d_16x16               394             68
      satd_4x4                 24              8
      satd_4x8                 51              11
      satd_4x16                103             24
      satd_8x4                 52              9
      satd_8x8                 108             12
      satd_8x16                218             24
      satd_16x8                218             19
      satd_16x16               437             38
      ssd_4x4                  10              5
      ssd_4x8                  24              8
      ssd_4x16                 42              15
      ssd_8x4                  23              5
      ssd_8x8                  37              9
      ssd_8x16                 74              17
      ssd_16x8                 72              11
      ssd_16x16                140             23
      var2_8x8                 91              37
      var2_8x16                176             66
      var_8x8                  50              15
      var_8x16                 65              29
      var_16x16                132             56
      
      Signed-off-by: default avatarHecai Yuan <yuanhecai@loongson.cn>
      5f84d403
    • Yin Shiyou's avatar
      loongarch: Improve the performance of dct series functions · fa7f1fce
      Yin Shiyou authored
      
      Performance has improved from 10.53fps to 11.27fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      add4x4_idct              34              9
      add8x8_idct              139             31
      add8x8_idct8             269             39
      add8x8_idct_dc           67              7
      add16x16_idct            564             123
      add16x16_idct_dc         260             22
      dct4x4dc                 18              10
      idct4x4dc                16              9
      sub4x4_dct               25              7
      sub8x8_dct               101             12
      sub8x8_dct8              160             25
      sub16x16_dct             403             52
      sub16x16_dct8            646             68
      zigzag_scan_4x4_frame    4               1
      
      Signed-off-by: default avatarzhoupeng <zhoupeng@loongson.cn>
      fa7f1fce
    • Yin Shiyou's avatar
      loongarch: Improve the performance of mc series functions · 981c8f25
      Yin Shiyou authored
      
      Performance has improved from 6.78fps to 10.53fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      avg_4x2                  16              5
      avg_4x4                  30              6
      avg_4x8                  63              10
      avg_4x16                 124             19
      avg_8x4                  60              6
      avg_8x8                  119             10
      avg_8x16                 233             19
      avg_16x8                 229             21
      avg_16x16                451             41
      get_ref_4x4              30              9
      get_ref_4x8              52              11
      get_ref_8x4              45              9
      get_ref_8x8              80              11
      get_ref_8x16             156             16
      get_ref_12x10            137             13
      get_ref_16x8             147             11
      get_ref_16x16            282             16
      get_ref_20x18            278             22
      hpel_filter              5163            686
      lowres_init              5440            286
      mc_chroma_2x2            24              7
      mc_chroma_2x4            42              10
      mc_chroma_4x2            41              7
      mc_chroma_4x4            75              10
      mc_chroma_4x8            144             19
      mc_chroma_8x4            137             15
      mc_chroma_8x8            269             28
      mc_luma_4x4              30              10
      mc_luma_4x8              52              12
      mc_luma_8x4              44              10
      mc_luma_8x8              80              13
      mc_luma_8x16             156             19
      mc_luma_16x8             147             13
      mc_luma_16x16            281             19
      memcpy_aligned           14              9
      memzero_aligned          24              4
      offsetadd_w4             79              18
      offsetadd_w8             142             18
      offsetadd_w16            277             25
      offsetadd_w20            1118            38
      offsetsub_w4             75              18
      offsetsub_w8             140             18
      offsetsub_w16            265             25
      offsetsub_w20            989             39
      weight_w4                111             19
      weight_w8                205             19
      weight_w16               396             29
      weight_w20               1143            45
      deinterleave_chroma_fdec 76              9
      deinterleave_chroma_fenc 86              9
      plane_copy_deinterleave  733             90
      plane_copy_interleave    791             245
      store_interleave_chroma  82              12
      
      Signed-off-by: default avatarXiwei Gu <guxiwei-hf@loongson.cn>
      981c8f25
  9. Oct 10, 2023
    • Yin Shiyou's avatar
      loongarch: Improve the performance of quant series functions · 65e7bac5
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      Performance has improved from 6.34fps to 6.78fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      coeff_last15             3               2
      coeff_last16             3               1
      coeff_last64             42              6
      decimate_score15         8               12
      decimate_score16         8               11
      decimate_score64         61              43
      dequant_4x4_cqm          16              5
      dequant_4x4_dc_cqm       13              5
      dequant_4x4_dc_flat      13              5
      dequant_4x4_flat         16              5
      dequant_8x8_cqm          71              9
      dequant_8x8_flat         71              9
      
      Signed-off-by: default avatarShiyou Yin <yinshiyou-hf@loongson.cn>
      65e7bac5
    • Yin Shiyou's avatar
      loongarch: Improve the performance of predict series functions · d8ed272a
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      Performance has improved from 6.32fps to 6.34fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      intra_predict_4x4_dc     3               2
      intra_predict_4x4_dc8    1               1
      intra_predict_4x4_dcl    2               1
      intra_predict_4x4_dct    2               1
      intra_predict_4x4_ddl    7               2
      intra_predict_4x4_h      2               1
      intra_predict_4x4_v      1               1
      intra_predict_8x8_dc     8               2
      intra_predict_8x8_dc8    1               1
      intra_predict_8x8_dcl    5               2
      intra_predict_8x8_dct    5               2
      intra_predict_8x8_ddl    27              3
      intra_predict_8x8_ddr    26              3
      intra_predict_8x8_h      4               2
      intra_predict_8x8_v      3               1
      intra_predict_8x8_vl     29              3
      intra_predict_8x8_vr     31              4
      intra_predict_8x8c_dc    8               5
      intra_predict_8x8c_dc8   1               1
      intra_predict_8x8c_dcl   5               3
      intra_predict_8x8c_dct   5               3
      intra_predict_8x8c_h     4               2
      intra_predict_8x8c_p     58              30
      intra_predict_8x8c_v     4               1
      intra_predict_16x16_dc   32              8
      intra_predict_16x16_dc8  9               4
      intra_predict_16x16_dcl  26              6
      intra_predict_16x16_dct  26              6
      intra_predict_16x16_h    23              7
      intra_predict_16x16_p    182             44
      intra_predict_16x16_v    22              4
      
      Signed-off-by: default avatarXiwei Gu <guxiwei-hf@loongson.cn>
      d8ed272a
    • Yin Shiyou's avatar
      loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions · 00b8e3b9
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      Performance has improved from 4.92fps to 6.32fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      sad_4x4                 13               3
      sad_4x8                 26               7
      sad_4x16                57               13
      sad_8x4                 24               3
      sad_8x8                 54               8
      sad_8x16                108              13
      sad_16x8                95               8
      sad_16x16               189              13
      sad_x3_4x4              37               6
      sad_x3_4x8              71               13
      sad_x3_8x4              70               8
      sad_x3_8x8              162              14
      sad_x3_8x16             323              25
      sad_x3_16x8             279              15
      sad_x3_16x16            555              27
      sad_x4_4x4              49               8
      sad_x4_4x8              95               17
      sad_x4_8x4              94               8
      sad_x4_8x8              214              16
      sad_x4_8x16             429              33
      sad_x4_16x8             372              18
      sad_x4_16x16            740              34
      
      Signed-off-by: default avatarwanglu <wanglu@loongson.cn>
      00b8e3b9
    • Yin Shiyou's avatar
      loongarch: Improve the performance of deblock series functions. · d7d283f6
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      Performance has improved from 4.76fps to 4.92fps.
      Tested with following command:
      ./configure && make -j5
      ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
      
      functions           performance     performance
                              (c)            (asm)
      deblock_luma[0]         79               39
      deblock_luma[1]         91               18
      deblock_luma_intra[0]   63               44
      deblock_luma_intra[1]   71               18
      deblock_strength        104              33
      
      Signed-off-by: default avatarHao Chen <chenhao@loongson.cn>
      d7d283f6
    • Yin Shiyou's avatar
      loongarch: Add loongson_asm.S and loongson_utils.S · 25ffd616
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      Common macros and functions for loongson optimization.
      
      Signed-off-by: default avatarShiyou Yin <yinshiyou-hf@loongson.cn>
      25ffd616
    • Yin Shiyou's avatar
      loongarch: Init LSX/LASX support · 1ecc51ee
      Yin Shiyou authored and Yin Shiyou's avatar Yin Shiyou committed
      
      LSX/LASX is the LOONGARCH 128-bit/256-bit SIMD Architecture.
      
      Signed-off-by: default avatarShiyou Yin <yinshiyou-hf@loongson.cn>
      Signed-off-by: default avatarXiwei Gu <guxiwei-hf@loongson.cn>
      1ecc51ee
  10. Oct 01, 2023
Loading