Commits · master · gu xiwei / x264

Mar 14, 2024

x86inc: Improve XMM-spilling functionality on 64-bit Windows · 585e0199

Henrik Gramner authored 1 year ago and

Henrik Gramner committed 1 year ago

Prior to this change dealing with the scenario where the number of
XMM registers spilled depends on if a branch is taken or not was
complicated to handle well. There was essentially three options:

1) Always spill the largest number of XMM register. Results in
   unnecessary spills.

2) Do the spilling after the branch. Results in code duplication
   for the shared subset of spills.

3) Do the spilling manually. Optimal, but overly complex and vexing.

This adds an additional optional argument to the WIN64_SPILL_XMM
and WIN64_PUSH_XMM macros to make it possible to allocate space
for a certain number of registers but initially only push a subset
of those, with the option of pushing additional register later.

585e0199

x86inc: Restore the stack state between stack allocations · 4df71a75
Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
```
Allows the use of multiple independent stack allocations within
a function without having to manually fiddle with stack offsets.
```
4df71a75
x86inc: Fix warnings with old nasm versions · 3d8aff7e
Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago

3d8aff7e

Mar 12, 2024

ppc: Fix incompatible pointer type errors · de1bea53

Anton Mitrofanov authored 1 year ago

Use correct return type for pixel_sad_x3/x4 functions.
Bug report by Dominik 'Rathann' Mierzejewski .

de1bea53

Feb 28, 2024

aarch64: Use regular hwcaps flags instead of HWCAP_CPUID for CPU feature detection on Linux · be4f0200

Martin Storsjö authored 1 year ago and

Anton Mitrofanov committed 1 year ago

This makes the code much simpler (especially for adding support
for other instruction set extensions), avoids needing inline
assembly for this feature, and generally is more of the canonical
way to do this.

The CPU feature detection was added in
9c3c7168, using HWCAP_CPUID.

The argument for using that, was that HWCAP_CPUID was added much
earlier in the kernel (in Linux v4.11), while the HWCAP flags for
individual features always come later. This allows detecting support
for new CPU extensions before the kernel exposes information about
them via hwcap flags.

However in practice, there's probably quite little advantage in this.
E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in
v5.10 - later than HWCAP_CPUID, but there's probably very little
practical cases where one would run a kernel older than that on a CPU
that supports those instructions.

Additionally, we provide our own definitions of the flag values to
check (as they are fixed constants anyway), with names not conflicting
with the ones from system headers. This reduces the number of ifdefs
needed, and allows detecting those features even if building with
userland headers that are lacking the definitions of those flags.

Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04,
do expose support for these features via HWCAP flags, but the
emulated cpuid registers are missing the bits for exposing e.g. SVE2
(This issue is fixed in later versions of QEMU though.)

Also drop the ifdef check for whether AT_HWCAP is defined; it was
added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18,
which also precedes when aarch64 was commonly used anyway, so
don't guard the use of that with an ifdef.

be4f0200

CI: Switch 32/64-bit windows builds to LLVM · 7241d020
Anton Mitrofanov authored 1 year ago
```
Use same Docker images as VLC for contrib compilation.
```
7241d020
CI: Add config.log to job artifacts · ea08f586
Anton Mitrofanov authored 1 year ago

ea08f586

Feb 19, 2024

x86inc: Add support for ELF CET properties · 12426f5f

Henrik Gramner authored 1 year ago

Automatically flag x86-64 asm object files as SHSTK-compatible.

Shadow Stack (SHSTK) is a part of Control-flow Enforcement Technology
(CET) which is a feature aimed at defending against ROP attacks by
verifying that 'call' and 'ret' instructions are correctly matched.

For well-written code this works transparently without any code changes,
as return addresses popped from the shadow stack should match return
addresses popped from the normal stack for performance reasons anyway.

12426f5f

x86inc.asm: Add the crc32 SSE4.2 GPR instruction · 6fc4480c
Henrik Gramner authored 1 year ago

6fc4480c
x86inc: Add a cpu flag for the Ice Lake AVX-512 subset · 87476b4c
Henrik Gramner authored 1 year ago

87476b4c
x86inc: Add CLMUL cpu flag · a6b56179
Henrik Gramner authored 1 year ago
```
Also make the GFNI cpu flag imply the presence of both AESNI and CLMUL.
```
a6b56179

x86inc: Add template defines for EVEX broadcasts · 5207a74e

Henrik Gramner authored 1 year ago

Broadcasting a memory operand is a binary flag, you either broadcast
or you don't, and there's only a single possible element size for
any given instruction.

The instruction syntax however requires the broadcast semanticts
to be explicitly defined, which is an issue when using macros to
template code for multiple register widths.

Add some helper defines to alleviate the issue.

5207a74e

x86inc: Properly sort instructions in alphabetical order · 436be41f
Henrik Gramner authored 1 year ago

436be41f

Jan 13, 2024
- Bump dates to 2024 · 4815ccad
  Anton Mitrofanov authored 1 year ago
  
  4815ccad
Nov 23, 2023

Improve pixel-a.S Performance by Using SVE/SVE2 · c1c9931d

David Chen authored 1 year ago

Imporve the performance of NEON functions of aarch64/pixel-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.

Command executed: ./checkasm8 --bench=ssd
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
ssd_4x4_c: 235
ssd_4x4_neon: 226
ssd_4x4_sve: 151
ssd_4x8_c: 409
ssd_4x8_neon: 363
ssd_4x8_sve: 201
ssd_4x16_c: 781
ssd_4x16_neon: 653
ssd_4x16_sve: 313
ssd_8x4_c: 402
ssd_8x4_neon: 192
ssd_8x4_sve: 192
ssd_8x8_c: 728
ssd_8x8_neon: 275
ssd_8x8_sve: 275

Command executed: ./checkasm10 --bench=ssd
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
ssd_4x4_c: 256
ssd_4x4_neon: 226
ssd_4x4_sve: 153
ssd_4x8_c: 460
ssd_4x8_neon: 369
ssd_4x8_sve: 215
ssd_4x16_c: 852
ssd_4x16_neon: 651
ssd_4x16_sve: 340

Command executed: ./checkasm8 --bench=ssd
Testbed: AWS Graviton3
Results:
ssd_4x4_c: 295
ssd_4x4_neon: 288
ssd_4x4_sve: 228
ssd_4x8_c: 454
ssd_4x8_neon: 431
ssd_4x8_sve: 294
ssd_4x16_c: 779
ssd_4x16_neon: 631
ssd_4x16_sve: 438
ssd_8x4_c: 463
ssd_8x4_neon: 247
ssd_8x4_sve: 246
ssd_8x8_c: 781
ssd_8x8_neon: 413
ssd_8x8_sve: 353

Command executed: ./checkasm10 --bench=ssd
Testbed: AWS Graviton3
Results:
ssd_4x4_c: 322
ssd_4x4_neon: 335
ssd_4x4_sve: 240
ssd_4x8_c: 522
ssd_4x8_neon: 448
ssd_4x8_sve: 294
ssd_4x16_c: 832
ssd_4x16_neon: 603
ssd_4x16_sve: 440

Command executed: ./checkasm8 --bench=sa8d
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
sa8d_8x8_c: 2103
sa8d_8x8_neon: 619
sa8d_8x8_sve: 617

Command executed: ./checkasm8 --bench=sa8d
Testbed: AWS Graviton3
Results:
sa8d_8x8_c: 2021
sa8d_8x8_neon: 597
sa8d_8x8_sve: 580

Command executed: ./checkasm8 --bench=var
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
var_8x8_c: 595
var_8x8_neon: 262
var_8x8_sve: 262
var_8x16_c: 1193
var_8x16_neon: 435
var_8x16_sve: 419

Command executed: ./checkasm8 --bench=var
Testbed: AWS Graviton3
Results:
var_8x8_c: 616
var_8x8_neon: 229
var_8x8_sve: 222
var_8x16_c: 1207
var_8x16_neon: 399
var_8x16_sve: 389

Command executed: ./checkasm8 --bench=hadamard_ac
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
hadamard_ac_8x8_c: 2330
hadamard_ac_8x8_neon: 635
hadamard_ac_8x8_sve: 635
hadamard_ac_8x16_c: 4500
hadamard_ac_8x16_neon: 1152
hadamard_ac_8x16_sve: 1151
hadamard_ac_16x8_c: 4499
hadamard_ac_16x8_neon: 1151
hadamard_ac_16x8_sve: 1150
hadamard_ac_16x16_c: 8812
hadamard_ac_16x16_neon: 2187
hadamard_ac_16x16_sve: 2186

Command executed: ./checkasm8 --bench=hadamard_ac
Testbed: AWS Graviton3
Results:
hadamard_ac_8x8_c: 2266
hadamard_ac_8x8_neon: 517
hadamard_ac_8x8_sve: 513
hadamard_ac_8x16_c: 4444
hadamard_ac_8x16_neon: 867
hadamard_ac_8x16_sve: 849
hadamard_ac_16x8_c: 4443
hadamard_ac_16x8_neon: 880
hadamard_ac_16x8_sve: 868
hadamard_ac_16x16_c: 8595
hadamard_ac_16x16_neon: 1656
hadamard_ac_16x16_sve: 1622

c1c9931d

Create Common NEON pixel-a Macros and Constants · 0ac52d29

David Chen authored 1 year ago

Place NEON pixel-a macros and constants that are intended
to be used by SVE/SVE2 functions as well in a common file.

0ac52d29

Improve mc-a.S Performance by Using SVE/SVE2 · 06dcf3f9

David Chen authored 1 year ago

Imporve the performance of NEON functions of aarch64/mc-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.

Command executed: ./checkasm8 --bench=avg
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
avg_4x2_c: 274
avg_4x2_neon: 215
avg_4x2_sve: 171
avg_4x4_c: 461
avg_4x4_neon: 343
avg_4x4_sve: 225
avg_4x8_c: 806
avg_4x8_neon: 619
avg_4x8_sve: 334
avg_4x16_c: 1523
avg_4x16_neon: 1168
avg_4x16_sve: 558

Command executed: ./checkasm8 --bench=avg
Testbed: AWS Graviton3
Results:
avg_4x2_c: 267
avg_4x2_neon: 213
avg_4x2_sve: 167
avg_4x4_c: 467
avg_4x4_neon: 350
avg_4x4_sve: 221
avg_4x8_c: 784
avg_4x8_neon: 624
avg_4x8_sve: 302
avg_4x16_c: 1445
avg_4x16_neon: 1182
avg_4x16_sve: 485

06dcf3f9

Create Common NEON mc-a Macros and Functions · 21a788f1

David Chen authored 1 year ago

Place NEON mc-a macros and functions that are intended
to be used by SVE/SVE2 functions as well in a common file.

21a788f1

Nov 20, 2023

Improve deblock-a.S Performance by Using SVE/SVE2 · 5ad5e5d8

David Chen authored 1 year ago

Imporve the performance of NEON functions of aarch64/deblock-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.

Command executed: ./checkasm8 --bench=deblock
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
deblock_chroma[1]_c: 735
deblock_chroma[1]_neon: 427
deblock_chroma[1]_sve: 353

Command executed: ./checkasm8 --bench=deblock
Testbed: AWS Graviton3
Results:
deblock_chroma[1]_c: 719
deblock_chroma[1]_neon: 442
deblock_chroma[1]_sve: 345

5ad5e5d8

Create Common NEON deblock-a Macros · 37949a99

David Chen authored 1 year ago

Place NEON deblock-a macros that are intended to be
used by SVE/SVE2 functions as well in a common file.

37949a99

Improve dct-a.S Performance by Using SVE/SVE2 · 5c382660

David Chen authored 1 year ago

Imporve the performance of NEON functions of aarch64/dct-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.

Command executed: ./checkasm8 --bench=sub
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
sub4x4_dct_c: 528
sub4x4_dct_neon: 322
sub4x4_dct_sve: 247

Command executed: ./checkasm8 --bench=sub
Testbed: AWS Graviton3
Results:
sub4x4_dct_c: 562
sub4x4_dct_neon: 376
sub4x4_dct_sve: 255

Command executed: ./checkasm8 --bench=add
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
add4x4_idct_c: 698
add4x4_idct_neon: 386
add4x4_idct_sve2: 345

Command executed: ./checkasm8 --bench=zigzag
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
zigzag_interleave_8x8_cavlc_frame_c: 582
zigzag_interleave_8x8_cavlc_frame_neon: 273
zigzag_interleave_8x8_cavlc_frame_sve: 257

Command executed: ./checkasm8 --bench=zigzag
Testbed: AWS Graviton3
Results:
zigzag_interleave_8x8_cavlc_frame_c: 587
zigzag_interleave_8x8_cavlc_frame_neon: 257
zigzag_interleave_8x8_cavlc_frame_sve: 249

5c382660

Nov 18, 2023

Create Common NEON dct-a Macros · b6190c6f

David Chen authored 1 year ago

Place NEON dct-a macros that are intended to be
used by SVE/SVE2 functions as well in a common file.

b6190c6f

Nov 14, 2023

ci: Test the aarch64 build in QEMU with varying SVE sizes · c1962404

Martin Storsjö authored 1 year ago

The sve-default-vector-length property sets the maximum vector
length in bytes; the default is 64, i.e. handling up to 512
bit vectors. In order to be able to test 1024 and 2048 bit vectors,
this has to be raised separately from setting the sve<n>=on
property.

c1962404

ci: Update the build-debian-amd64 job to a new base image · 9b3e653b

Martin Storsjö authored 1 year ago

In the new version, there's no longer any "wine64" executable,
but both i386 and x86_64 are handled with the same "wine" frontend.

9b3e653b

checkasm: Print the actual SVE vector length · 611b87b7
Martin Storsjö authored 1 year ago

611b87b7

Nov 02, 2023

aarch64: Consistently use lowercase vector element specifiers · a354f11f
Martin Storsjö authored 1 year ago

a354f11f

aarch64: Make the assembly indentation slightly more consistent · ef572b9f

Martin Storsjö authored 1 year ago

The assembly currently uses a mixture of different styles. Don't
make all of it entirely consistent now, but try to make functions
more consistent within themselves at least.

In particular, get rid of the convention to have braces hanging
outside of the alignment line.

Some functions have the whole content indented off by one char
compared to other functions; adjust those (but retain the functions
that are self-consistent and match either of the common styles).

ef572b9f

arm: Make the assembly indentation slightly more consistent · 3bc7c362

Martin Storsjö authored 1 year ago

The assembly currently uses a mixture of different styles. Don't
make all of it entirely consistent now, but try to make functions
more consistent within themselves at least.

In particular, get rid of the convention to have braces hanging
outside of the alignment line.

3bc7c362

aarch64: Use rounded right shifts in dequant · dc755eab

Martin Storsjö authored 2 years ago

Don't manually add in the rounding constant (via a fused multiply-add
instruction) when we can just do a plain rounded right shift.

                     Cortex A53   A72   A73
8bpc:
Before:
dequant_4x4_cqm_neon:       515   246   267
dequant_4x4_dc_cqm_neon:    410   265   266
dequant_4x4_dc_flat_neon:   413   271   271
dequant_4x4_flat_neon:      519   254   274
dequant_8x8_cqm_neon:      1555   980  1002
dequant_8x8_flat_neon:     1562   994  1014
After:
dequant_4x4_cqm_neon:       499   246   255
dequant_4x4_dc_cqm_neon:    376   265   255
dequant_4x4_dc_flat_neon:   378   271   260
dequant_4x4_flat_neon:      500   254   262
dequant_8x8_cqm_neon:      1489   900   925
dequant_8x8_flat_neon:     1493   915   938

10bpc:
Before:
dequant_4x4_cqm_neon:       483   275   275
dequant_4x4_dc_cqm_neon:    429   256   261
dequant_4x4_dc_flat_neon:   435   267   267
dequant_4x4_flat_neon:      487   283   288
dequant_8x8_cqm_neon:      1511  1112  1076
dequant_8x8_flat_neon:     1518  1139  1089
After:
dequant_4x4_cqm_neon:       472   255   239
dequant_4x4_dc_cqm_neon:    404   256   232
dequant_4x4_dc_flat_neon:   406   267   234
dequant_4x4_flat_neon:      472   255   239
dequant_8x8_cqm_neon:      1462   922   978
dequant_8x8_flat_neon:     1462   922   978

This makes it around 3% faster on the Cortex A53, around 8% faster
for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp
on A72/A73.

dc755eab

aarch64: Improve scheduling in sad_x3/sad_x4 · 4664f5aa

Martin Storsjö authored 2 years ago

               Cortex A53    A72    A73
8 bpc:
Before:
sad_x3_4x4_neon:      580    303    204
sad_x3_4x8_neon:     1065    516    323
sad_x3_8x4_neon:      668    262    282
sad_x3_8x8_neon:     1238    454    471
sad_x3_8x16_neon:    2378    842    847
sad_x3_16x8_neon:    2136    738    776
sad_x3_16x16_neon:   4162   1378   1463
After:
sad_x3_4x4_neon:      477    298    206
sad_x3_4x8_neon:      842    515    327
sad_x3_8x4_neon:      603    260    279
sad_x3_8x8_neon:     1110    451    464
sad_x3_8x16_neon:    2125    841    843
sad_x3_16x8_neon:    2124    730    766
sad_x3_16x16_neon:   4145   1370   1434

10 bpc:
Before:
sad_x3_4x4_neon:      632    247    254
sad_x3_4x8_neon:     1162    419    443
sad_x3_8x4_neon:      890    358    416
sad_x3_8x8_neon:     1670    632    759
sad_x3_8x16_neon:    3230   1179   1458
sad_x3_16x8_neon:    3070   1209   1403
sad_x3_16x16_neon:   6030   2333   2699

After:
sad_x3_4x4_neon:      522    253    255
sad_x3_4x8_neon:      932    443    431
sad_x3_8x4_neon:      880    354    406
sad_x3_8x8_neon:     1660    626    736
sad_x3_8x16_neon:    3220   1170   1397
sad_x3_16x8_neon:    3060   1184   1362
sad_x3_16x16_neon:   6020   2272   2579

Thus, this is around a 20-25% speedup on Cortex A53 for the small
sizes (much smaller difference for bigger sizes though), while it
doesn't make much of a difference at all (mostly within measurement
noise) for the out-of-order cores (A72 and A73).

4664f5aa

Oct 24, 2023
- Fix VBV with sliced threads · d46938de
  Anton Mitrofanov authored 1 year ago
  
  d46938de
Oct 19, 2023

Add cpu flags and runtime detection of SVE and SVE2 · 9c3c7168

Martin Storsjö authored 1 year ago

We could also use HWCAP_SVE and HWCAP2_SVE2 for detecting this,
but these might not be available in all userland headers, while
HWCAP_CPUID is available much earlier.

The register ID_AA64ZFR0_EL1, which indicates if SVE2 is available,
can only be accessed if SVE is available. If not building all the
C code with SVE enabled (which could make it impossible to run on
on HW without SVE), binutils refuses to assemble an instruction
reading ID_AA64ZFR0_EL1 - but if referring to it with the technical
name S3_0_C0_C4_4, it can be assembled even without any extra
extensions enabled.

9c3c7168

Oct 18, 2023

configure: Check for support for AArch64 SVE and SVE2 · db9bc75b

Martin Storsjö authored 1 year ago

We don't expect the user to build the whole x264 codebase with
SVE/SVE2 enabled, as we only enable this feature for the assembly
files that use it, in order to have binaries that are portable
and enable the SVE codepaths at runtime if supported.

db9bc75b

Oct 12, 2023

loongarch: Improve the performance of pixel series functions · 5f84d403

Yin Shiyou authored 1 year ago


Performance has improved from 11.27fps to 20.50fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
hadamard_ac_8x8          117             21
hadamard_ac_8x16         236             42
hadamard_ac_16x8         235             31
hadamard_ac_16x16        473             60
intra_sad_x3_4x4         50              21
intra_sad_x3_8x8         183             34
intra_sad_x3_8x8c        181             36
intra_sad_x3_16x16       643             68
intra_satd_x3_4x4        83              61
intra_satd_x3_8x8c       344             81
intra_satd_x3_16x16      1389            136
sa8d_8x8                 97              19
sa8d_16x16               394             68
satd_4x4                 24              8
satd_4x8                 51              11
satd_4x16                103             24
satd_8x4                 52              9
satd_8x8                 108             12
satd_8x16                218             24
satd_16x8                218             19
satd_16x16               437             38
ssd_4x4                  10              5
ssd_4x8                  24              8
ssd_4x16                 42              15
ssd_8x4                  23              5
ssd_8x8                  37              9
ssd_8x16                 74              17
ssd_16x8                 72              11
ssd_16x16                140             23
var2_8x8                 91              37
var2_8x16                176             66
var_8x8                  50              15
var_8x16                 65              29
var_16x16                132             56

Signed-off-by: Hecai Yuan <yuanhecai@loongson.cn>

5f84d403

loongarch: Improve the performance of dct series functions · fa7f1fce

Yin Shiyou authored 1 year ago


Performance has improved from 10.53fps to 11.27fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
add4x4_idct              34              9
add8x8_idct              139             31
add8x8_idct8             269             39
add8x8_idct_dc           67              7
add16x16_idct            564             123
add16x16_idct_dc         260             22
dct4x4dc                 18              10
idct4x4dc                16              9
sub4x4_dct               25              7
sub8x8_dct               101             12
sub8x8_dct8              160             25
sub16x16_dct             403             52
sub16x16_dct8            646             68
zigzag_scan_4x4_frame    4               1

Signed-off-by: zhoupeng <zhoupeng@loongson.cn>

fa7f1fce

loongarch: Improve the performance of mc series functions · 981c8f25

Yin Shiyou authored 1 year ago


Performance has improved from 6.78fps to 10.53fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
avg_4x2                  16              5
avg_4x4                  30              6
avg_4x8                  63              10
avg_4x16                 124             19
avg_8x4                  60              6
avg_8x8                  119             10
avg_8x16                 233             19
avg_16x8                 229             21
avg_16x16                451             41
get_ref_4x4              30              9
get_ref_4x8              52              11
get_ref_8x4              45              9
get_ref_8x8              80              11
get_ref_8x16             156             16
get_ref_12x10            137             13
get_ref_16x8             147             11
get_ref_16x16            282             16
get_ref_20x18            278             22
hpel_filter              5163            686
lowres_init              5440            286
mc_chroma_2x2            24              7
mc_chroma_2x4            42              10
mc_chroma_4x2            41              7
mc_chroma_4x4            75              10
mc_chroma_4x8            144             19
mc_chroma_8x4            137             15
mc_chroma_8x8            269             28
mc_luma_4x4              30              10
mc_luma_4x8              52              12
mc_luma_8x4              44              10
mc_luma_8x8              80              13
mc_luma_8x16             156             19
mc_luma_16x8             147             13
mc_luma_16x16            281             19
memcpy_aligned           14              9
memzero_aligned          24              4
offsetadd_w4             79              18
offsetadd_w8             142             18
offsetadd_w16            277             25
offsetadd_w20            1118            38
offsetsub_w4             75              18
offsetsub_w8             140             18
offsetsub_w16            265             25
offsetsub_w20            989             39
weight_w4                111             19
weight_w8                205             19
weight_w16               396             29
weight_w20               1143            45
deinterleave_chroma_fdec 76              9
deinterleave_chroma_fenc 86              9
plane_copy_deinterleave  733             90
plane_copy_interleave    791             245
store_interleave_chroma  82              12

Signed-off-by: Xiwei Gu <guxiwei-hf@loongson.cn>

981c8f25

Oct 10, 2023

loongarch: Improve the performance of quant series functions · 65e7bac5

Yin Shiyou authored 1 year ago and

Yin Shiyou committed 1 year ago


Performance has improved from 6.34fps to 6.78fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
coeff_last15             3               2
coeff_last16             3               1
coeff_last64             42              6
decimate_score15         8               12
decimate_score16         8               11
decimate_score64         61              43
dequant_4x4_cqm          16              5
dequant_4x4_dc_cqm       13              5
dequant_4x4_dc_flat      13              5
dequant_4x4_flat         16              5
dequant_8x8_cqm          71              9
dequant_8x8_flat         71              9

Signed-off-by: Shiyou Yin <yinshiyou-hf@loongson.cn>

65e7bac5

loongarch: Improve the performance of predict series functions · d8ed272a

Yin Shiyou authored 1 year ago and

Yin Shiyou committed 1 year ago


Performance has improved from 6.32fps to 6.34fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
intra_predict_4x4_dc     3               2
intra_predict_4x4_dc8    1               1
intra_predict_4x4_dcl    2               1
intra_predict_4x4_dct    2               1
intra_predict_4x4_ddl    7               2
intra_predict_4x4_h      2               1
intra_predict_4x4_v      1               1
intra_predict_8x8_dc     8               2
intra_predict_8x8_dc8    1               1
intra_predict_8x8_dcl    5               2
intra_predict_8x8_dct    5               2
intra_predict_8x8_ddl    27              3
intra_predict_8x8_ddr    26              3
intra_predict_8x8_h      4               2
intra_predict_8x8_v      3               1
intra_predict_8x8_vl     29              3
intra_predict_8x8_vr     31              4
intra_predict_8x8c_dc    8               5
intra_predict_8x8c_dc8   1               1
intra_predict_8x8c_dcl   5               3
intra_predict_8x8c_dct   5               3
intra_predict_8x8c_h     4               2
intra_predict_8x8c_p     58              30
intra_predict_8x8c_v     4               1
intra_predict_16x16_dc   32              8
intra_predict_16x16_dc8  9               4
intra_predict_16x16_dcl  26              6
intra_predict_16x16_dct  26              6
intra_predict_16x16_h    23              7
intra_predict_16x16_p    182             44
intra_predict_16x16_v    22              4

Signed-off-by: Xiwei Gu <guxiwei-hf@loongson.cn>

d8ed272a

loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions · 00b8e3b9

Yin Shiyou authored 1 year ago and

Yin Shiyou committed 1 year ago


Performance has improved from 4.92fps to 6.32fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
sad_4x4                 13               3
sad_4x8                 26               7
sad_4x16                57               13
sad_8x4                 24               3
sad_8x8                 54               8
sad_8x16                108              13
sad_16x8                95               8
sad_16x16               189              13
sad_x3_4x4              37               6
sad_x3_4x8              71               13
sad_x3_8x4              70               8
sad_x3_8x8              162              14
sad_x3_8x16             323              25
sad_x3_16x8             279              15
sad_x3_16x16            555              27
sad_x4_4x4              49               8
sad_x4_4x8              95               17
sad_x4_8x4              94               8
sad_x4_8x8              214              16
sad_x4_8x16             429              33
sad_x4_16x8             372              18
sad_x4_16x16            740              34

Signed-off-by: wanglu <wanglu@loongson.cn>

00b8e3b9

loongarch: Improve the performance of deblock series functions. · d7d283f6

Yin Shiyou authored 1 year ago and

Yin Shiyou committed 1 year ago


Performance has improved from 4.76fps to 4.92fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
deblock_luma[0]         79               39
deblock_luma[1]         91               18
deblock_luma_intra[0]   63               44
deblock_luma_intra[1]   71               18
deblock_strength        104              33

Signed-off-by: Hao Chen <chenhao@loongson.cn>

d7d283f6