loongarch: support LoongArch LSX and LASX optimization.

This patch optimizes functions on loongarch platform. The functions have been implemented by using pure assembly rather than C intrinsics. Performance has improved from 4.76fps to 20.50fps by using the following command:

./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv

This benchmark has been taken by executing 'checkasm' on loongarch platform:

make checkasm
./checkasm8 --bench

Function	c	Assembly
deblock_luma[0]	79	39
deblock_luma[1]	91	18
deblock_luma_intra[0]	63	44
deblock_luma_intra[1]	71	18
deblock_strength	104	33
sad_4x4	13	3
sad_4x8	26	7
sad_4x16	57	13
sad_8x4	24	3
sad_8x8	54	8
sad_8x16	108	13
sad_16x8	95	8
sad_16x16	189	13
sad_x3_4x4	37	6
sad_x3_4x8	71	13
sad_x3_8x4	70	8
sad_x3_8x8	162	14
sad_x3_8x16	323	25
sad_x3_16x8	279	15
sad_x3_16x16	555	27
sad_x4_4x4	49	8
sad_x4_4x8	95	17
sad_x4_8x4	94	8
sad_x4_8x8	214	16
sad_x4_8x16	429	33
sad_x4_16x8	372	18
sad_x4_16x16	740	34
intra_predict_4x4_dc	3	2
intra_predict_4x4_dc8	1	1
intra_predict_4x4_dcl	2	1
intra_predict_4x4_dct	2	1
intra_predict_4x4_ddl	7	2
intra_predict_4x4_h	2	1
intra_predict_4x4_v	1	1
intra_predict_8x8_dc	8	2
intra_predict_8x8_dc8	1	1
intra_predict_8x8_dcl	5	2
intra_predict_8x8_dct	5	2
intra_predict_8x8_ddl	27	3
intra_predict_8x8_ddr	26	3
intra_predict_8x8_h	4	2
intra_predict_8x8_v	3	1
intra_predict_8x8_vl	29	3
intra_predict_8x8_vr	31	4
intra_predict_8x8c_dc	8	5
intra_predict_8x8c_dc8	1	1
intra_predict_8x8c_dcl	5	3
intra_predict_8x8c_dct	5	3
intra_predict_8x8c_h	4	2
intra_predict_8x8c_p	58	30
intra_predict_8x8c_v	4	1
intra_predict_16x16_dc	32	8
intra_predict_16x16_dc8	9	4
intra_predict_16x16_dcl	26	6
intra_predict_16x16_dct	26	6
intra_predict_16x16_h	23	7
intra_predict_16x16_p	182	44
intra_predict_16x16_v	22	4
coeff_last15	3	2
coeff_last16	3	1
coeff_last64	42	6
decimate_score15	8	12
decimate_score16	8	11
decimate_score64	61	43
dequant_4x4_cqm	16	5
dequant_4x4_dc_cqm	13	5
dequant_4x4_dc_flat	13	5
dequant_4x4_flat	16	5
dequant_8x8_cqm	71	9
dequant_8x8_flat	71	9
avg_4x2	16	5
avg_4x4	30	6
avg_4x8	63	10
avg_4x16	124	19
avg_8x4	60	6
avg_8x8	119	10
avg_8x16	233	19
avg_16x8	229	21
avg_16x16	451	41
get_ref_4x4	30	9
get_ref_4x8	52	11
get_ref_8x4	45	9
get_ref_8x8	80	11
get_ref_8x16	156	16
get_ref_12x10	137	13
get_ref_16x8	147	11
get_ref_16x16	282	16
get_ref_20x18	278	22
hpel_filter	5163	686
lowres_init	5440	286
mc_chroma_2x2	24	7
mc_chroma_2x4	42	10
mc_chroma_4x2	41	7
mc_chroma_4x4	75	10
mc_chroma_4x8	144	19
mc_chroma_8x4	137	15
mc_chroma_8x8	269	28
mc_luma_4x4	30	10
mc_luma_4x8	52	12
mc_luma_8x4	44	10
mc_luma_8x8	80	13
mc_luma_8x16	156	19
mc_luma_16x8	147	13
mc_luma_16x16	281	19
memcpy_aligned	14	9
memzero_aligned	24	4
offsetadd_w4	79	18
offsetadd_w8	142	18
offsetadd_w16	277	25
offsetadd_w20	1118	38
offsetsub_w4	75	18
offsetsub_w8	140	18
offsetsub_w16	265	25
offsetsub_w20	989	39
weight_w4	111	19
weight_w8	205	19
weight_w16	396	29
weight_w20	1143	45
deinterleave_chroma_fdec	76	9
deinterleave_chroma_fenc	86	9
plane_copy_deinterleave	733	90
plane_copy_interleave	791	245
store_interleave_chroma	82	12
add4x4_idct	34	9
add8x8_idct	139	31
add8x8_idct8	269	39
add8x8_idct_dc	67	7
add16x16_idct	564	123
add16x16_idct_dc	260	22
dct4x4dc	18	10
idct4x4dc	16	9
sub4x4_dct	25	7
sub8x8_dct	101	12
sub8x8_dct8	160	25
sub16x16_dct	403	52
sub16x16_dct8	646	68
zigzag_scan_4x4_frame	4	1
hadamard_ac_8x8	117	21
hadamard_ac_8x16	236	42
hadamard_ac_16x8	235	31
hadamard_ac_16x16	473	60
intra_sad_x3_4x4	50	21
intra_sad_x3_8x8	183	34
intra_sad_x3_8x8c	181	36
intra_sad_x3_16x16	643	68
intra_satd_x3_4x4	83	61
intra_satd_x3_8x8c	344	81
intra_satd_x3_16x16	1389	136
sa8d_8x8	97	19
sa8d_16x16	394	68
satd_4x4	24	8
satd_4x8	51	11
satd_4x16	103	24
satd_8x4	52	9
satd_8x8	108	12
satd_8x16	218	24
satd_16x8	218	19
satd_16x16	437	38
ssd_4x4	10	5
ssd_4x8	24	8
ssd_4x16	42	15
ssd_8x4	23	5
ssd_8x8	37	9
ssd_8x16	74	17
ssd_16x8	72	11
ssd_16x16	140	23
var2_8x8	91	37
var2_8x16	176	66
var_8x8	50	15
var_8x16	65	29
var_16x16	132	56

@yinshiyou @BugMaster Hello, this patch adds loongarch platform support and optimizes some functions. We use pure assembly rather than C intrinsics to get better performance. Performance has improved from 4.76fps to 20.50fps. Reviews are welcome.

added 14 commits

04458c72...eaa68fad - 3 commits from branch videolan:master
ff8306c6 - loongarch: Init LSX/LASX support
1513135a - loongarch: Add checkasm support
38c9fc03 - loongarch: Add asm.S file
211d6246 - loongarch: Add loongsonutil.S file
d074b1f0 - loongarch: Improve the performance of deblock series functions
ac0281e3 - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
8465f013 - loongarch: Improve the performance of predict series functions
8d504279 - loongarch: Improve the performance of quant series functions
5b8cbae7 - loongarch: Improve the performance of mc series functions
35e09bff - loongarch: Improve the performance of dct series functions
2eadbd09 - loongarch: Improve the performance of pixel series functions

Compare with previous version

can anyone help to review this PR.

added 11 commits

6d536dcb - loongarch: Init LSX/LASX support
dcaf97b0 - loongarch: Add checkasm support
9ab927db - loongarch: Add asm.S file
ec64e83b - loongarch: Add loongsonutil.S file
65e43789 - loongarch: Improve the performance of deblock series functions
b080a945 - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
cb5b01bd - loongarch: Improve the performance of predict series functions
33d6558a - loongarch: Improve the performance of quant series functions
36edaf43 - loongarch: Improve the performance of mc series functions
2abfbfc0 - loongarch: Improve the performance of dct series functions
9696d3f2 - loongarch: Improve the performance of pixel series functions

Compare with previous version

resolved all threads

added 11 commits

061aa0a1 - loongarch: Init LSX/LASX support
0eed024c - loongarch: Add checkasm support
d8fd7dd1 - loongarch: Add asm.S file
a1ec289e - loongarch: Add loongsonutil.S file
c9861da2 - loongarch: Improve the performance of deblock series functions
ce2087ce - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
ebe13a0d - loongarch: Improve the performance of predict series functions
eabd60ed - loongarch: Improve the performance of quant series functions
d0a962f0 - loongarch: Improve the performance of mc series functions
47a3b1f8 - loongarch: Improve the performance of dct series functions
63e0690f - loongarch: Improve the performance of pixel series functions

Compare with previous version

I found some LASX opt is slower than LSX according to the checkasm test. It'e suggested to remove these LASX optimizations: avg_16x16_lasx; decimate_score15_lasx; decimate_score16_lasx; decimate_score64_lasx; intra_predict_8x8c_dc8_lsx; intra_sad_x3_8x8_lasx; intra_sad_x3_8x8c_lasx; intra_sad_x3_16x16_lasx; intra_satd_x3_8x8c_lasx; memcpy_aligned_lasx; sad_4x8_lasx; sad_4x16_lasx; quant_4x4_lasx; quant_4x4_dc_lasx; quant_8x8_lasx; sad_8x8_lasx; sad_8x16_lasx; sad_16x8_lasx; sad_16x16_lasx; sad_aligned_4x8_lasx; sad_aligned_4x16_lasx; sad_aligned_8x8_lasx: 7 sad_aligned_8x16_lasx; sad_aligned_16x8_lasx; sad_aligned_16x16_lasx; sad_x3_4x8_lasx; sad_x3_8x4_lasx; sad_x3_8x8_lasx; sad_x3_8x16_lasx; sad_x4_4x8_lasx; sad_x4_8x16_lasx; ssd_4x4_lasx; ssd_4x8_lasx; ssd_4x16_lasx; ssd_8x4_lasx; var_8x8_lasx; var_8x16_lasx; var_16x16_lasx;

added 6 commits

39951e92 - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
f7e14b93 - loongarch: Improve the performance of predict series functions
ba09be07 - loongarch: Improve the performance of quant series functions
4b9719d6 - loongarch: Improve the performance of mc series functions
2413c413 - loongarch: Improve the performance of dct series functions
330b4a2d - loongarch: Improve the performance of pixel series functions

Compare with previous version

resolved all threads

LGTM.

@gramner could you please help to review this PR.

@jbk Could you help review this PR, any suggestion will be appreciated.

@BugMaster Hello Anton, I have sent out the CLA, is there anything else I need to do.

Hi, I received it and sent it further to check that everything is fine. So wait for now, I was going to do a full code review next week.

Thank you very much! As mentioned in your previous email, we signed the CLA and represent a company, them should I re-open a new MR and change author of each patch to one person who mentioned in the CLA ?

This MR will be replaced by another MR. !135 (merged)

closed

loongarch: support LoongArch LSX and LASX optimization.

Closed by Anton Mitrofanov 1 year ago (Oct 12, 2023 8:21pm UTC) 1 year ago

Activity

loongarch: support LoongArch LSX and LASX optimization.

Merge request reports

Closed by Anton Mitrofanov 1 year ago (Oct 12, 2023 8:21pm UTC) 1 year ago

Activity