Commits · master · Lu Wang / x264

Jan 03, 2023

loongarch: Improve the performance of pixel series functions · 3bfa1c2f

guxiwei authored 2 years ago and

Lu Wang committed 2 years ago


Performance has improved from 11.27fps to 20.50fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
hadamard_ac_8x8          117             21
hadamard_ac_8x16         236             42
hadamard_ac_16x8         235             31
hadamard_ac_16x16        473             60
intra_sad_x3_4x4         50              21
intra_sad_x3_8x8         183             34
intra_sad_x3_8x8c        181             36
intra_sad_x3_16x16       643             68
intra_satd_x3_4x4        83              61
intra_satd_x3_8x8c       344             81
intra_satd_x3_16x16      1389            136
sa8d_8x8                 97              19
sa8d_16x16               394             68
satd_4x4                 24              8
satd_4x8                 51              11
satd_4x16                103             24
satd_8x4                 52              9
satd_8x8                 108             12
satd_8x16                218             24
satd_16x8                218             19
satd_16x16               437             38
ssd_4x4                  10              5
ssd_4x8                  24              8
ssd_4x16                 42              15
ssd_8x4                  23              5
ssd_8x8                  37              9
ssd_8x16                 74              17
ssd_16x8                 72              11
ssd_16x16                140             23
var2_8x8                 91              37
var2_8x16                176             66
var_8x8                  50              15
var_8x16                 65              29
var_16x16                132             56

Signed-off-by: gxw <guxiwei-hf@loongson.cn>

3bfa1c2f

loongarch: Improve the performance of dct series functions · a064e87b

guxiwei authored 2 years ago and

Lu Wang committed 2 years ago


Performance has improved from 10.53fps to 11.27fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
add4x4_idct              34              9
add8x8_idct              139             31
add8x8_idct8             269             39
add8x8_idct_dc           67              7
add16x16_idct            564             123
add16x16_idct_dc         260             22
dct4x4dc                 18              10
idct4x4dc                16              9
sub4x4_dct               25              7
sub8x8_dct               101             12
sub8x8_dct8              160             25
sub16x16_dct             403             52
sub16x16_dct8            646             68
zigzag_scan_4x4_frame    4               1

Signed-off-by: gxw <guxiwei-hf@loongson.cn>

a064e87b

loongarch: Improve the performance of mc series functions · 46e520a3

Hecai Yuan authored 2 years ago and

Lu Wang committed 2 years ago


Performance has improved from 6.78fps to 10.53fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
avg_4x2                  16              5
avg_4x4                  30              6
avg_4x8                  63              10
avg_4x16                 124             19
avg_8x4                  60              6
avg_8x8                  119             10
avg_8x16                 233             19
avg_16x8                 229             21
avg_16x16                451             41
get_ref_4x4              30              9
get_ref_4x8              52              11
get_ref_8x4              45              9
get_ref_8x8              80              11
get_ref_8x16             156             16
get_ref_12x10            137             13
get_ref_16x8             147             11
get_ref_16x16            282             16
get_ref_20x18            278             22
hpel_filter              5163            686
lowres_init              5440            286
mc_chroma_2x2            24              7
mc_chroma_2x4            42              10
mc_chroma_4x2            41              7
mc_chroma_4x4            75              10
mc_chroma_4x8            144             19
mc_chroma_8x4            137             15
mc_chroma_8x8            269             28
mc_luma_4x4              30              10
mc_luma_4x8              52              12
mc_luma_8x4              44              10
mc_luma_8x8              80              13
mc_luma_8x16             156             19
mc_luma_16x8             147             13
mc_luma_16x16            281             19
memcpy_aligned           14              9
memzero_aligned          24              4
offsetadd_w4             79              18
offsetadd_w8             142             18
offsetadd_w16            277             25
offsetadd_w20            1118            38
offsetsub_w4             75              18
offsetsub_w8             140             18
offsetsub_w16            265             25
offsetsub_w20            989             39
weight_w4                111             19
weight_w8                205             19
weight_w16               396             29
weight_w20               1143            45
deinterleave_chroma_fdec 76              9
deinterleave_chroma_fenc 86              9
plane_copy_deinterleave  733             90
plane_copy_interleave    791             245
store_interleave_chroma  82              12

Signed-off-by: yuanhecai <yuanhecai@loongson.cn>

46e520a3

loongarch: Improve the performance of quant series functions · cfea0ec1

Hecai Yuan authored 2 years ago and

Lu Wang committed 2 years ago


Performance has improved from 6.34fps to 6.78fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
coeff_last15             3               2
coeff_last16             3               1
coeff_last64             42              6
decimate_score15         8               12
decimate_score16         8               11
decimate_score64         61              43
dequant_4x4_cqm          16              5
dequant_4x4_dc_cqm       13              5
dequant_4x4_dc_flat      13              5
dequant_4x4_flat         16              5
dequant_8x8_cqm          71              9
dequant_8x8_flat         71              9

Signed-off-by: yuanhecai <yuanhecai@loongson.cn>

cfea0ec1

loongarch: Improve the performance of predict series functions · 72a635d6

Lu Wang authored 2 years ago and

Lu Wang committed 2 years ago


Performance has improved from 6.32fps to 6.34fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
intra_predict_4x4_dc     3               2
intra_predict_4x4_dc8    1               1
intra_predict_4x4_dcl    2               1
intra_predict_4x4_dct    2               1
intra_predict_4x4_ddl    7               2
intra_predict_4x4_h      2               1
intra_predict_4x4_v      1               1
intra_predict_8x8_dc     8               2
intra_predict_8x8_dc8    1               1
intra_predict_8x8_dcl    5               2
intra_predict_8x8_dct    5               2
intra_predict_8x8_ddl    27              3
intra_predict_8x8_ddr    26              3
intra_predict_8x8_h      4               2
intra_predict_8x8_v      3               1
intra_predict_8x8_vl     29              3
intra_predict_8x8_vr     31              4
intra_predict_8x8c_dc    8               5
intra_predict_8x8c_dc8   1               1
intra_predict_8x8c_dcl   5               3
intra_predict_8x8c_dct   5               3
intra_predict_8x8c_h     4               2
intra_predict_8x8c_p     58              30
intra_predict_8x8c_v     4               1
intra_predict_16x16_dc   32              8
intra_predict_16x16_dc8  9               4
intra_predict_16x16_dcl  26              6
intra_predict_16x16_dct  26              6
intra_predict_16x16_h    23              7
intra_predict_16x16_p    182             44
intra_predict_16x16_v    22              4

Signed-off-by: wanglu <wanglu@loongson.cn>

72a635d6

loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions · 8018af72

Lu Wang authored 2 years ago and

Lu Wang committed 2 years ago


Performance has improved from 4.92fps to 6.32fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
sad_4x4                 13               3
sad_4x8                 26               7
sad_4x16                57               13
sad_8x4                 24               3
sad_8x8                 54               8
sad_8x16                108              13
sad_16x8                95               8
sad_16x16               189              13
sad_x3_4x4              37               6
sad_x3_4x8              71               13
sad_x3_8x4              70               8
sad_x3_8x8              162              14
sad_x3_8x16             323              25
sad_x3_16x8             279              15
sad_x3_16x16            555              27
sad_x4_4x4              49               8
sad_x4_4x8              95               17
sad_x4_8x4              94               8
sad_x4_8x8              214              16
sad_x4_8x16             429              33
sad_x4_16x8             372              18
sad_x4_16x16            740              34

Signed-off-by: wanglu <wanglu@loongson.cn>

8018af72

loongarch: Improve the performance of deblock series functions · b5110b16

guxiwei authored 2 years ago and

Lu Wang committed 2 years ago


Performance has improved from 4.76fps to 4.92fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
deblock_luma[0]         79               39
deblock_luma[1]         91               18
deblock_luma_intra[0]   63               44
deblock_luma_intra[1]   71               18
deblock_strength        104              33

Signed-off-by: gxw <guxiwei-hf@loongson.cn>

b5110b16

loongarch: Add asm.S file · 596eaa09
guxiwei authored 2 years ago and Lu Wang committed 2 years ago
```
Signed-off-by: gxw <guxiwei-hf@loongson.cn>
```
596eaa09
loongarch: Add checkasm support · 3dfcf99b
guxiwei authored 3 years ago and Lu Wang committed 2 years ago
```
Signed-off-by: gxw <guxiwei-hf@loongson.cn>
```
3dfcf99b
loongarch: Initial LSX/LASX support · 78c79319
guxiwei authored 3 years ago and Lu Wang committed 2 years ago
```
LSX/LASX is the LOONGARCH 128-bit/256-bit SIMD Architecture.

Signed-off-by: gxw <guxiwei-hf@loongson.cn>
```
78c79319

Dec 17, 2022
- Add Risc-V 64 bit · 941cae6d
  Roger Hardiman authored 2 years ago
  
  941cae6d
Oct 28, 2022

aarch64: pixel: add 10bits sad functions · 416e3eb2

Hubert Mazur authored 2 years ago


Provide routines for sad functions for high bit depth, i.e. 10 bits.
Benchmarks run on AWS Gravtion 2 instances.

sad_4x4_c: 583
sad_4x4_neon: 273
sad_4x8_c: 1179
sad_4x8_neon: 366
sad_4x16_c: 2121
sad_4x16_neon: 550
sad_8x4_c: 924
sad_8x4_neon: 213
sad_8x8_c: 1711
sad_8x8_neon: 316
sad_8x16_c: 3505
sad_8x16_neon: 497
sad_16x8_c: 3070
sad_16x8_neon: 635
sad_16x16_c: 6113
sad_16x16_neon: 1118

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>

416e3eb2

Oct 05, 2022
- ffms: Fix crash if stream properties changes · b093bbe7
  Anton Mitrofanov authored 2 years ago
  
  b093bbe7
Oct 01, 2022
- cli: Use space instead of newline as autocomplete delimiter · ed0f7a63
  Henrik Gramner authored 2 years ago
```
On most systems any whitespace is fine, but MSYS2 wants ASCII 0x20.
```
  ed0f7a63
Sep 19, 2022

Makefile: Add missing dependency of '.depend' on 'oclobj.h' · e067ab0b

Sergei Trofimovich authored 2 years ago

Without the change parallel build occasionally fails as:

    $ make --shuffle
    ...
    gcc ... -c common/opencl.c -o common/opencl-8.o ...
    common/opencl.c:116:10: fatal error: common/oclobj.h: No such file or directory
      116 | #include "common/oclobj.h"
          |          ^~~~~~~~~~~~~~~~~

Best reproducible with `make --shuffle` mode:
   https://savannah.gnu.org/bugs/index.php?62100

This happens because `common/oclobj.h` is an autogenerated file.
Normally `.depend` would contain this autogenerated dependency.
But nothing forces `common/oclobj.h` to be generated.

The change moves dependency of $(GENERATED) from final binaries
to `.depend` itself:

    .depend: $(GENERATED)

e067ab0b

Sep 05, 2022
- Fix memory overread in mbtree · 7628a569
  Anton Mitrofanov authored 2 years ago and Anton Mitrofanov committed 2 years ago
  
  7628a569
Sep 01, 2022
- CI: Fix vlc-contrib linking on macOS · 8bdd8b89
  Anton Mitrofanov authored 2 years ago
```
Use pkg-config from the custom PATH.
```
  8bdd8b89
Aug 31, 2022
- CI: Migrate build runners to macOS Monterey · f7074e12
  Anton Mitrofanov authored 2 years ago
  
  f7074e12
Jun 01, 2022
- CI: Fix vlc-contrib processing on macos · baee400f
  Anton Mitrofanov authored 2 years ago
```
Use perl for in-place editing because sed doesn't work with symlinks.
```
  baee400f
Feb 22, 2022
- configure: Allow AviSynth+ on *BSD and Haiku · bfc87b7a
  Stephen Hutchinson authored 3 years ago and Anton Mitrofanov committed 3 years ago
  
  bfc87b7a
- Fix build on MIPS with AviSynth+ support · 95634be6
  Anton Mitrofanov authored 3 years ago
  
  95634be6
Feb 21, 2022

Replace AvxSynth with AviSynth+ on POSIX systems · 35fe20d1
Anton Mitrofanov authored 3 years ago and Anton Mitrofanov committed 3 years ago

35fe20d1
Check user-entered fps parameter range · f53fbffd
Anton Mitrofanov authored 3 years ago and Anton Mitrofanov committed 3 years ago

f53fbffd
Fix -Wchar-subscripts and -Wstrict-aliasing warnings · ff8a127e
Anton Mitrofanov authored 3 years ago and Anton Mitrofanov committed 3 years ago

ff8a127e

x86inc: Add REPX macro to repeat instructions/operations · 6d10612a

Henrik Gramner authored 3 years ago

When operating on large blocks of data it's common to repeatedly use
an instruction on multiple registers. Using the REPX macro makes it
easy to quickly write dense code to achieve this without having to
explicitly duplicate the same instruction over and over.

For example,

    REPX {paddw x, m4}, m0, m1, m2, m3
    REPX {mova [r0+16*x], m5}, 0, 1, 2, 3

will expand to

    paddw       m0, m4
    paddw       m1, m4
    paddw       m2, m4
    paddw       m3, m4
    mova [r0+16*0], m5
    mova [r0+16*1], m5
    mova [r0+16*2], m5
    mova [r0+16*3], m5

6d10612a

x86inc: Fix edge case in forced VEX-encoding · f52e5e11

Henrik Gramner authored 3 years ago

Correctly handle emulation of 4-operand instructions (e.g. 'shufps')
where src1 is a memory operand.

f52e5e11

Feb 19, 2022

x86inc: Enable 4-operand emulation for variable blend instructions · 3e2a0d4c

Henrik Gramner authored 3 years ago

With legacy encoding the last operand (the index) must be xmm0,
but aside from that emulating non-destructive forms works
the same as any other instruction.

3e2a0d4c

Feb 05, 2022
- Fix build on OpenBSD and Android · 5585eafe
  Anton Mitrofanov authored 3 years ago
  
  5585eafe
Jan 26, 2022
- Fix implicit integer sign change and truncation (part 2) · 0bb85e8b
  Anton Mitrofanov authored 3 years ago
  
  0bb85e8b
- checkasm: Print all errors to stderr · 4127923a
  Anton Mitrofanov authored 3 years ago and Anton Mitrofanov committed 3 years ago
  
  4127923a
- Fix integer overflow with special CQM and 10-bit encoding · a6cbc988
  Anton Mitrofanov authored 3 years ago and Anton Mitrofanov committed 3 years ago
  
  a6cbc988
Jan 24, 2022
- Bump dates to 2022 · ab393c85
  Anton Mitrofanov authored 3 years ago
  
  ab393c85
Dec 30, 2021

configure: Always make shared imply PIC · 19856cc4

Jessica Clarke authored 3 years ago and

Anton Mitrofanov committed 3 years ago

Building a shared library without -fPIC does not make sense. On most
architectures, especially recent ones, doing so will give link-time
errors due to relocations in read-only sections like .text. On some
legacy architectures, including i386, it is allowed by default, but will
warn, and is highly discouraged due to the overheads it adds at library
load time. Most architectures were already listed here as having shared
imply PIC, but not all, such as i386 which ends up with unwanted text
relocations, as well as architectures not known to the build system
currently like RISC-V, which does not permit text relocations by
default. There is no good reason to want shared without PIC on any
architecture, so just remove the architecture list.

19856cc4

Dec 12, 2021

Remove thread priority tweaking · 8a43cc14

Henrik Gramner authored 3 years ago

Back in 2009 when this was added it improved scheduling of lookahead
threads on prevalent operating systems at the time.

According to more recent testing by Intel however, lowering thread
priorities does not improve performance on modern operating systems.
And more importantly, doing so on systems with heterogeneous CPU
topologies may actually result in a severe performance reduction.

Removing this code altogether eliminates the issue with performance
degradation on such systems, while having no noticeable impact on
regular systems with homogeneous CPU topologies.

8a43cc14

Dec 07, 2021
- Makefile: Do not create multiple directories in one go · d9a19f0d
  Claes Nästén authored 3 years ago
```
/usr/ucb/bin/install on Solaris does not support creating multiple
directories in one go, issue multiple install commands instead.
```
  d9a19f0d
- Remove redundant dot in help · ab5c502b
  Anton Mitrofanov authored 3 years ago
  
  ab5c502b
- lavf: Remove use of deprecated av_init_packet · a66ca247
  Anton Mitrofanov authored 3 years ago and Anton Mitrofanov committed 3 years ago
  
  a66ca247
Dec 06, 2021
- CI: Update docker images · fda9b831
  Anton Mitrofanov authored 3 years ago
  
  fda9b831
Sep 29, 2021

lookahead: Keep b_exit_thread under ifbuf.mutex · 66a5bc1b

Jean-Baptiste Kempf authored 3 years ago and

Anton Mitrofanov committed 3 years ago

The lookahead_thread main loop checks b_exit_thread and exits if it is set.
That flag is set by x264_lookahead_delete, which uses ifbuf.mutex to guard accessing it.
However, the read in the while-loop condition of lookahead_thread is not guarded,
and so TSAN sometimes reports a data race.

66a5bc1b

checkasm: Correctly parse the seed argument as unsigned · d2907f67
Martin Storsjö authored 3 years ago and Anton Mitrofanov committed 3 years ago
```
This fixes rerunning checkasm with an earlier printed seed, when
it's outside of the signed range.
```
d2907f67