Commits · 5ad5e5d8f119127146c733e50ee87f95339eb2ae · Brad Smith / x264

Nov 20, 2023

Improve deblock-a.S Performance by Using SVE/SVE2 · 5ad5e5d8

David Chen authored 1 year ago

Imporve the performance of NEON functions of aarch64/deblock-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.

Command executed: ./checkasm8 --bench=deblock
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
deblock_chroma[1]_c: 735
deblock_chroma[1]_neon: 427
deblock_chroma[1]_sve: 353

Command executed: ./checkasm8 --bench=deblock
Testbed: AWS Graviton3
Results:
deblock_chroma[1]_c: 719
deblock_chroma[1]_neon: 442
deblock_chroma[1]_sve: 345

5ad5e5d8

Create Common NEON deblock-a Macros · 37949a99

David Chen authored 1 year ago

Place NEON deblock-a macros that are intended to be
used by SVE/SVE2 functions as well in a common file.

37949a99

Improve dct-a.S Performance by Using SVE/SVE2 · 5c382660

David Chen authored 1 year ago

Imporve the performance of NEON functions of aarch64/dct-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.

Command executed: ./checkasm8 --bench=sub
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
sub4x4_dct_c: 528
sub4x4_dct_neon: 322
sub4x4_dct_sve: 247

Command executed: ./checkasm8 --bench=sub
Testbed: AWS Graviton3
Results:
sub4x4_dct_c: 562
sub4x4_dct_neon: 376
sub4x4_dct_sve: 255

Command executed: ./checkasm8 --bench=add
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
add4x4_idct_c: 698
add4x4_idct_neon: 386
add4x4_idct_sve2: 345

Command executed: ./checkasm8 --bench=zigzag
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
zigzag_interleave_8x8_cavlc_frame_c: 582
zigzag_interleave_8x8_cavlc_frame_neon: 273
zigzag_interleave_8x8_cavlc_frame_sve: 257

Command executed: ./checkasm8 --bench=zigzag
Testbed: AWS Graviton3
Results:
zigzag_interleave_8x8_cavlc_frame_c: 587
zigzag_interleave_8x8_cavlc_frame_neon: 257
zigzag_interleave_8x8_cavlc_frame_sve: 249

5c382660

Nov 18, 2023

Create Common NEON dct-a Macros · b6190c6f

David Chen authored 1 year ago

Place NEON dct-a macros that are intended to be
used by SVE/SVE2 functions as well in a common file.

b6190c6f

Nov 14, 2023

ci: Test the aarch64 build in QEMU with varying SVE sizes · c1962404

Martin Storsjö authored 1 year ago

The sve-default-vector-length property sets the maximum vector
length in bytes; the default is 64, i.e. handling up to 512
bit vectors. In order to be able to test 1024 and 2048 bit vectors,
this has to be raised separately from setting the sve<n>=on
property.

c1962404

ci: Update the build-debian-amd64 job to a new base image · 9b3e653b

Martin Storsjö authored 1 year ago

In the new version, there's no longer any "wine64" executable,
but both i386 and x86_64 are handled with the same "wine" frontend.

9b3e653b

checkasm: Print the actual SVE vector length · 611b87b7
Martin Storsjö authored 1 year ago

611b87b7

Nov 02, 2023

aarch64: Consistently use lowercase vector element specifiers · a354f11f
Martin Storsjö authored 1 year ago

a354f11f

aarch64: Make the assembly indentation slightly more consistent · ef572b9f

Martin Storsjö authored 1 year ago

The assembly currently uses a mixture of different styles. Don't
make all of it entirely consistent now, but try to make functions
more consistent within themselves at least.

In particular, get rid of the convention to have braces hanging
outside of the alignment line.

Some functions have the whole content indented off by one char
compared to other functions; adjust those (but retain the functions
that are self-consistent and match either of the common styles).

ef572b9f

arm: Make the assembly indentation slightly more consistent · 3bc7c362

Martin Storsjö authored 1 year ago

The assembly currently uses a mixture of different styles. Don't
make all of it entirely consistent now, but try to make functions
more consistent within themselves at least.

In particular, get rid of the convention to have braces hanging
outside of the alignment line.

3bc7c362

aarch64: Use rounded right shifts in dequant · dc755eab

Martin Storsjö authored 2 years ago

Don't manually add in the rounding constant (via a fused multiply-add
instruction) when we can just do a plain rounded right shift.

                     Cortex A53   A72   A73
8bpc:
Before:
dequant_4x4_cqm_neon:       515   246   267
dequant_4x4_dc_cqm_neon:    410   265   266
dequant_4x4_dc_flat_neon:   413   271   271
dequant_4x4_flat_neon:      519   254   274
dequant_8x8_cqm_neon:      1555   980  1002
dequant_8x8_flat_neon:     1562   994  1014
After:
dequant_4x4_cqm_neon:       499   246   255
dequant_4x4_dc_cqm_neon:    376   265   255
dequant_4x4_dc_flat_neon:   378   271   260
dequant_4x4_flat_neon:      500   254   262
dequant_8x8_cqm_neon:      1489   900   925
dequant_8x8_flat_neon:     1493   915   938

10bpc:
Before:
dequant_4x4_cqm_neon:       483   275   275
dequant_4x4_dc_cqm_neon:    429   256   261
dequant_4x4_dc_flat_neon:   435   267   267
dequant_4x4_flat_neon:      487   283   288
dequant_8x8_cqm_neon:      1511  1112  1076
dequant_8x8_flat_neon:     1518  1139  1089
After:
dequant_4x4_cqm_neon:       472   255   239
dequant_4x4_dc_cqm_neon:    404   256   232
dequant_4x4_dc_flat_neon:   406   267   234
dequant_4x4_flat_neon:      472   255   239
dequant_8x8_cqm_neon:      1462   922   978
dequant_8x8_flat_neon:     1462   922   978

This makes it around 3% faster on the Cortex A53, around 8% faster
for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp
on A72/A73.

dc755eab

aarch64: Improve scheduling in sad_x3/sad_x4 · 4664f5aa

Martin Storsjö authored 2 years ago

               Cortex A53    A72    A73
8 bpc:
Before:
sad_x3_4x4_neon:      580    303    204
sad_x3_4x8_neon:     1065    516    323
sad_x3_8x4_neon:      668    262    282
sad_x3_8x8_neon:     1238    454    471
sad_x3_8x16_neon:    2378    842    847
sad_x3_16x8_neon:    2136    738    776
sad_x3_16x16_neon:   4162   1378   1463
After:
sad_x3_4x4_neon:      477    298    206
sad_x3_4x8_neon:      842    515    327
sad_x3_8x4_neon:      603    260    279
sad_x3_8x8_neon:     1110    451    464
sad_x3_8x16_neon:    2125    841    843
sad_x3_16x8_neon:    2124    730    766
sad_x3_16x16_neon:   4145   1370   1434

10 bpc:
Before:
sad_x3_4x4_neon:      632    247    254
sad_x3_4x8_neon:     1162    419    443
sad_x3_8x4_neon:      890    358    416
sad_x3_8x8_neon:     1670    632    759
sad_x3_8x16_neon:    3230   1179   1458
sad_x3_16x8_neon:    3070   1209   1403
sad_x3_16x16_neon:   6030   2333   2699

After:
sad_x3_4x4_neon:      522    253    255
sad_x3_4x8_neon:      932    443    431
sad_x3_8x4_neon:      880    354    406
sad_x3_8x8_neon:     1660    626    736
sad_x3_8x16_neon:    3220   1170   1397
sad_x3_16x8_neon:    3060   1184   1362
sad_x3_16x16_neon:   6020   2272   2579

Thus, this is around a 20-25% speedup on Cortex A53 for the small
sizes (much smaller difference for bigger sizes though), while it
doesn't make much of a difference at all (mostly within measurement
noise) for the out-of-order cores (A72 and A73).

4664f5aa

Oct 24, 2023
- Fix VBV with sliced threads · d46938de
  Anton Mitrofanov authored 1 year ago
  
  d46938de
Oct 19, 2023

Add cpu flags and runtime detection of SVE and SVE2 · 9c3c7168

Martin Storsjö authored 1 year ago

We could also use HWCAP_SVE and HWCAP2_SVE2 for detecting this,
but these might not be available in all userland headers, while
HWCAP_CPUID is available much earlier.

The register ID_AA64ZFR0_EL1, which indicates if SVE2 is available,
can only be accessed if SVE is available. If not building all the
C code with SVE enabled (which could make it impossible to run on
on HW without SVE), binutils refuses to assemble an instruction
reading ID_AA64ZFR0_EL1 - but if referring to it with the technical
name S3_0_C0_C4_4, it can be assembled even without any extra
extensions enabled.

9c3c7168

Oct 18, 2023

configure: Check for support for AArch64 SVE and SVE2 · db9bc75b

Martin Storsjö authored 1 year ago

We don't expect the user to build the whole x264 codebase with
SVE/SVE2 enabled, as we only enable this feature for the assembly
files that use it, in order to have binaries that are portable
and enable the SVE codepaths at runtime if supported.

db9bc75b

Oct 12, 2023

loongarch: Improve the performance of pixel series functions · 5f84d403

Yin Shiyou authored 1 year ago


Performance has improved from 11.27fps to 20.50fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
hadamard_ac_8x8          117             21
hadamard_ac_8x16         236             42
hadamard_ac_16x8         235             31
hadamard_ac_16x16        473             60
intra_sad_x3_4x4         50              21
intra_sad_x3_8x8         183             34
intra_sad_x3_8x8c        181             36
intra_sad_x3_16x16       643             68
intra_satd_x3_4x4        83              61
intra_satd_x3_8x8c       344             81
intra_satd_x3_16x16      1389            136
sa8d_8x8                 97              19
sa8d_16x16               394             68
satd_4x4                 24              8
satd_4x8                 51              11
satd_4x16                103             24
satd_8x4                 52              9
satd_8x8                 108             12
satd_8x16                218             24
satd_16x8                218             19
satd_16x16               437             38
ssd_4x4                  10              5
ssd_4x8                  24              8
ssd_4x16                 42              15
ssd_8x4                  23              5
ssd_8x8                  37              9
ssd_8x16                 74              17
ssd_16x8                 72              11
ssd_16x16                140             23
var2_8x8                 91              37
var2_8x16                176             66
var_8x8                  50              15
var_8x16                 65              29
var_16x16                132             56

Signed-off-by: Hecai Yuan <yuanhecai@loongson.cn>

5f84d403

loongarch: Improve the performance of dct series functions · fa7f1fce

Yin Shiyou authored 1 year ago


Performance has improved from 10.53fps to 11.27fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
add4x4_idct              34              9
add8x8_idct              139             31
add8x8_idct8             269             39
add8x8_idct_dc           67              7
add16x16_idct            564             123
add16x16_idct_dc         260             22
dct4x4dc                 18              10
idct4x4dc                16              9
sub4x4_dct               25              7
sub8x8_dct               101             12
sub8x8_dct8              160             25
sub16x16_dct             403             52
sub16x16_dct8            646             68
zigzag_scan_4x4_frame    4               1

Signed-off-by: zhoupeng <zhoupeng@loongson.cn>

fa7f1fce

loongarch: Improve the performance of mc series functions · 981c8f25

Yin Shiyou authored 1 year ago


Performance has improved from 6.78fps to 10.53fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
avg_4x2                  16              5
avg_4x4                  30              6
avg_4x8                  63              10
avg_4x16                 124             19
avg_8x4                  60              6
avg_8x8                  119             10
avg_8x16                 233             19
avg_16x8                 229             21
avg_16x16                451             41
get_ref_4x4              30              9
get_ref_4x8              52              11
get_ref_8x4              45              9
get_ref_8x8              80              11
get_ref_8x16             156             16
get_ref_12x10            137             13
get_ref_16x8             147             11
get_ref_16x16            282             16
get_ref_20x18            278             22
hpel_filter              5163            686
lowres_init              5440            286
mc_chroma_2x2            24              7
mc_chroma_2x4            42              10
mc_chroma_4x2            41              7
mc_chroma_4x4            75              10
mc_chroma_4x8            144             19
mc_chroma_8x4            137             15
mc_chroma_8x8            269             28
mc_luma_4x4              30              10
mc_luma_4x8              52              12
mc_luma_8x4              44              10
mc_luma_8x8              80              13
mc_luma_8x16             156             19
mc_luma_16x8             147             13
mc_luma_16x16            281             19
memcpy_aligned           14              9
memzero_aligned          24              4
offsetadd_w4             79              18
offsetadd_w8             142             18
offsetadd_w16            277             25
offsetadd_w20            1118            38
offsetsub_w4             75              18
offsetsub_w8             140             18
offsetsub_w16            265             25
offsetsub_w20            989             39
weight_w4                111             19
weight_w8                205             19
weight_w16               396             29
weight_w20               1143            45
deinterleave_chroma_fdec 76              9
deinterleave_chroma_fenc 86              9
plane_copy_deinterleave  733             90
plane_copy_interleave    791             245
store_interleave_chroma  82              12

Signed-off-by: Xiwei Gu <guxiwei-hf@loongson.cn>

981c8f25

Oct 10, 2023

loongarch: Improve the performance of quant series functions · 65e7bac5

Yin Shiyou authored 1 year ago and

Yin Shiyou committed 1 year ago


Performance has improved from 6.34fps to 6.78fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
coeff_last15             3               2
coeff_last16             3               1
coeff_last64             42              6
decimate_score15         8               12
decimate_score16         8               11
decimate_score64         61              43
dequant_4x4_cqm          16              5
dequant_4x4_dc_cqm       13              5
dequant_4x4_dc_flat      13              5
dequant_4x4_flat         16              5
dequant_8x8_cqm          71              9
dequant_8x8_flat         71              9

Signed-off-by: Shiyou Yin <yinshiyou-hf@loongson.cn>

65e7bac5

loongarch: Improve the performance of predict series functions · d8ed272a

Yin Shiyou authored 1 year ago and

Yin Shiyou committed 1 year ago


Performance has improved from 6.32fps to 6.34fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
intra_predict_4x4_dc     3               2
intra_predict_4x4_dc8    1               1
intra_predict_4x4_dcl    2               1
intra_predict_4x4_dct    2               1
intra_predict_4x4_ddl    7               2
intra_predict_4x4_h      2               1
intra_predict_4x4_v      1               1
intra_predict_8x8_dc     8               2
intra_predict_8x8_dc8    1               1
intra_predict_8x8_dcl    5               2
intra_predict_8x8_dct    5               2
intra_predict_8x8_ddl    27              3
intra_predict_8x8_ddr    26              3
intra_predict_8x8_h      4               2
intra_predict_8x8_v      3               1
intra_predict_8x8_vl     29              3
intra_predict_8x8_vr     31              4
intra_predict_8x8c_dc    8               5
intra_predict_8x8c_dc8   1               1
intra_predict_8x8c_dcl   5               3
intra_predict_8x8c_dct   5               3
intra_predict_8x8c_h     4               2
intra_predict_8x8c_p     58              30
intra_predict_8x8c_v     4               1
intra_predict_16x16_dc   32              8
intra_predict_16x16_dc8  9               4
intra_predict_16x16_dcl  26              6
intra_predict_16x16_dct  26              6
intra_predict_16x16_h    23              7
intra_predict_16x16_p    182             44
intra_predict_16x16_v    22              4

Signed-off-by: Xiwei Gu <guxiwei-hf@loongson.cn>

d8ed272a

loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions · 00b8e3b9

Yin Shiyou authored 1 year ago and

Yin Shiyou committed 1 year ago


Performance has improved from 4.92fps to 6.32fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
sad_4x4                 13               3
sad_4x8                 26               7
sad_4x16                57               13
sad_8x4                 24               3
sad_8x8                 54               8
sad_8x16                108              13
sad_16x8                95               8
sad_16x16               189              13
sad_x3_4x4              37               6
sad_x3_4x8              71               13
sad_x3_8x4              70               8
sad_x3_8x8              162              14
sad_x3_8x16             323              25
sad_x3_16x8             279              15
sad_x3_16x16            555              27
sad_x4_4x4              49               8
sad_x4_4x8              95               17
sad_x4_8x4              94               8
sad_x4_8x8              214              16
sad_x4_8x16             429              33
sad_x4_16x8             372              18
sad_x4_16x16            740              34

Signed-off-by: wanglu <wanglu@loongson.cn>

00b8e3b9

loongarch: Improve the performance of deblock series functions. · d7d283f6

Yin Shiyou authored 1 year ago and

Yin Shiyou committed 1 year ago


Performance has improved from 4.76fps to 4.92fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

functions           performance     performance
                        (c)            (asm)
deblock_luma[0]         79               39
deblock_luma[1]         91               18
deblock_luma_intra[0]   63               44
deblock_luma_intra[1]   71               18
deblock_strength        104              33

Signed-off-by: Hao Chen <chenhao@loongson.cn>

d7d283f6

loongarch: Add loongson_asm.S and loongson_utils.S · 25ffd616
Yin Shiyou authored 1 year ago and Yin Shiyou committed 1 year ago
```
Common macros and functions for loongson optimization.

Signed-off-by: Shiyou Yin <yinshiyou-hf@loongson.cn>
```
25ffd616

loongarch: Init LSX/LASX support · 1ecc51ee

Yin Shiyou authored 1 year ago and

Yin Shiyou committed 1 year ago


LSX/LASX is the LOONGARCH 128-bit/256-bit SIMD Architecture.

Signed-off-by: Shiyou Yin <yinshiyou-hf@loongson.cn>
Signed-off-by: Xiwei Gu <guxiwei-hf@loongson.cn>

1ecc51ee

Oct 01, 2023

pixel: Add neon ssim_end implementation for 10 bit · 5a9dfdde

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for ssim_end function
for 10 bit depth. The implementation is based on the
previous one for 8 bit depth with a few differences like
IEEE-754 constant values and scheduling. The conversion
to floating point number must be done at the beginning
to prevent range overflows.

Benchmarks are shown below.

ssim_end_c: 715
ssim_end_neon: 380

Signed-off-by: Hubert Mazur <hum@semihalf.com>

5a9dfdde

pixel: Add neon ssim_core implementation for 10 bit · 67ad1cb6

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for ssim_core function
for 10 bit depth. Benchmarks are shown below.

ssim_core_c: 1315
ssim_core_neon: 470

Signed-off-by: Hubert Mazur <hum@semihalf.com>

67ad1cb6

pixel: Add neon hadamard implementations for 10 bit · 0e6165de

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for hadamard_ac functions
for 10 bit depth. Benchmarks are shown below.

hadamard_ac_8x8_c: 2995
hadamard_ac_8x8_neon: 682
hadamard_ac_8x16_c: 5959
hadamard_ac_8x16_neon: 1207
hadamard_ac_16x8_c: 5963
hadamard_ac_16x8_neon: 1212
hadamard_ac_16x16_c: 11851
hadamard_ac_16x16_neon: 2260

Signed-off-by: Hubert Mazur <hum@semihalf.com>

0e6165de

pixel: Add neon sa8d implementations for 10 bit · 8743a46d

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for sa8d 16x8 and 16x16 functions
for 10 bit depth. Benchmarks are shown below.

sa8d_8x8_c: 2914
sa8d_8x8_neon: 608
sa8d_16x16_c: 11469
sa8d_16x16_neon: 2030

Signed-off-by: Hubert Mazur <hum@semihalf.com>

8743a46d

pixel: Add neon satd implementations for 10 bit · 820fb5a7

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for satd 16x8 and 16x16 functions
for 10 bit depth. Benchmarks are shown below.

satd_16x8_c: 4268
satd_16x8_neon: 1493
satd_16x16_c: 8382
satd_16x16_neon: 2908

Signed-off-by: Hubert Mazur <hum@semihalf.com>

820fb5a7

Add neon pixel_var2 implementation for 10 bit · 9927ac9a

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for pixel_var2 function
for 10 bit depth. Benchmarks are shown below.

var2_8x8_c: 1988
var2_8x8_neon: 505
var2_8x16_c: 3800
var2_8x16_neon: 862

Signed-off-by: Hubert Mazur <hum@semihalf.com>

9927ac9a

Add neon pixel_var implementation for 10 bit · 7ae00538

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for pixel_var function
for 10 bit depth. Benchmarks are shown below.

var_8x8_c: 757
var_8x8_neon: 342
var_8x16_c: 1431
var_8x16_neon: 582
var_16x16_c: 2721
var_16x16_neon: 767

Signed-off-by: Hubert Mazur <hum@semihalf.com>

7ae00538

pixel: Add neon ssd_nv12 implementation for 10 bit · a87a9f89

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for ssd_nv12 function
for 10 bit depth. Benchmarks are shown below.

ssd_nv12_c: 181441
ssd_nv12_neon: 29037

Signed-off-by: Hubert Mazur <hum@semihalf.com>

a87a9f89

pixel: Add neon satd implementations for 10 bit · 1b59a1f3

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for satd 8x8 and 8x16 functions
for 10 bit depth. Benchmarks are shown below.

satd_8x8_c: 2143
satd_8x8_neon: 812
satd_8x16_c: 4228
satd_8x16_neon: 1504

Signed-off-by: Hubert Mazur <hum@semihalf.com>

1b59a1f3

pixel: Add neon satd implementations for 10 bit · 1754f6b2

Grzegorz Bernacki authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for satd functions for 10 bit
depth. Benchmarks are shown below.

satd_4x4_c: 858
satd_4x4_neon: 712
satd_4x8_c: 1834
satd_4x8_neon: 812
satd_4x16_c: 3677
satd_4x16_neon: 1149
satd_8x4_c: 1290
satd_8x4_neon: 427

Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>
Signed-off-by: Hubert Mazur <hum@semihalf.com>

1754f6b2

pixel: Add neon ssd implementations for 10 bit · 8fd1e5f2

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for ssd functions for 10 bit
depth. Benchmarks are shown below.

ssd_4x4_c: 1466
ssd_4x4_neon: 240
ssd_4x8_c: 1918
ssd_4x8_neon: 482
ssd_4x16_c: 5258
ssd_4x16_neon: 1025
ssd_8x4_c: 1291
ssd_8x4_neon: 235
ssd_8x8_c: 2431
ssd_8x8_neon: 425
ssd_8x16_c: 4635
ssd_8x16_neon: 910
ssd_16x8_c: 4198
ssd_16x8_neon: 897
ssd_16x16_c: 8549
ssd_16x16_neon: 1907

Signed-off-by: Hubert Mazur <hum@semihalf.com>

8fd1e5f2

pixel: Add neon asd8 implementations for 10 bit · 90b3391e

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for asd8 function for 10 bit
depth. Benchmarks are shown below.

asd8_c: 4400
asd8_neon: 857

Signed-off-by: Hubert Mazur <hum@semihalf.com>

90b3391e

pixel: Add neon vsad implementations for 10 bit · 8a90ffa7

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for vsad function for 10 bit
depth. Benchmarks are shown below.

vsad_c: 3599
vsad_neon: 392

Signed-off-by: Hubert Mazur <hum@semihalf.com>

8a90ffa7

pixel: Add neon sad_x3 implementations for 10 bit · 3afe3c82

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementations for sad_x3 functions for 10 bit
depth. Benchmarks are shown below.

sad_x3_4x4_c: 710
sad_x3_4x4_neon: 286
sad_x3_4x8_c: 1422
sad_x3_4x8_neon: 430
sad_x3_8x4_c: 1350
sad_x3_8x4_neon: 269
sad_x3_8x8_c: 2851
sad_x3_8x8_neon: 440
sad_x3_8x16_c: 5597
sad_x3_8x16_neon: 734
sad_x3_16x8_c: 5414
sad_x3_16x8_neon: 722
sad_x3_16x16_c: 10729
sad_x3_16x16_neon: 1288

Signed-off-by: Hubert Mazur <hum@semihalf.com>

3afe3c82

quant: Add implementation for denoise_dct function · 7882a368

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementation for denoise_dct function for high bit
depth. Benchmarks are shown below.

denoise_dct_c: 2149
denoise_dct_neon: 585

Signed-off-by: Hubert Mazur <hum@semihalf.com>

7882a368

quant: Add neon implementations of coeff_level_run · 01e05671

Hubert Mazur authored 2 years ago and

Anton Mitrofanov committed 1 year ago


Provide arm64 neon implementations for coeff_level_run functions for high bit
depth. Benchmarks are shown below.

coeff_level_run4_c: 135
coeff_level_run4_neon: 155
coeff_level_run8_c: 181
coeff_level_run8_neon: 182
coeff_level_run15_c: 296
coeff_level_run15_neon: 275
coeff_level_run16_c: 305
coeff_level_run16_neon: 264

Signed-off-by: Hubert Mazur <hum@semihalf.com>

01e05671