loongarch: support LoongArch LSX and LASX optimization.
LSX/LASX is the LoongArch 128-bit/256-bit SIMD instruction. All patches have been tested on Loongson 3A5000 platform. encode performance has been speeded up about 336%
Merge request reports
Activity
This benchmark has been taken by executing 'checkasm' on loongarch platform:
- make checkasm
- ./checkasm8 --bench
Function c Assembly deblock_luma[0] 79 39 deblock_luma[1] 91 18 deblock_luma_intra[0] 63 44 deblock_luma_intra[1] 71 18 deblock_strength 104 33 sad_4x4 13 3 sad_4x8 26 7 sad_4x16 57 13 sad_8x4 24 3 sad_8x8 54 8 sad_8x16 108 13 sad_16x8 95 8 sad_16x16 189 13 sad_x3_4x4 37 6 sad_x3_4x8 71 13 sad_x3_8x4 70 8 sad_x3_8x8 162 14 sad_x3_8x16 323 25 sad_x3_16x8 279 15 sad_x3_16x16 555 27 sad_x4_4x4 49 8 sad_x4_4x8 95 17 sad_x4_8x4 94 8 sad_x4_8x8 214 16 sad_x4_8x16 429 33 sad_x4_16x8 372 18 sad_x4_16x16 740 34 intra_predict_4x4_dc 3 2 intra_predict_4x4_dc8 1 1 intra_predict_4x4_dcl 2 1 intra_predict_4x4_dct 2 1 intra_predict_4x4_ddl 7 2 intra_predict_4x4_h 2 1 intra_predict_4x4_v 1 1 intra_predict_8x8_dc 8 2 intra_predict_8x8_dc8 1 1 intra_predict_8x8_dcl 5 2 intra_predict_8x8_dct 5 2 intra_predict_8x8_ddl 27 3 intra_predict_8x8_ddr 26 3 intra_predict_8x8_h 4 2 intra_predict_8x8_v 3 1 intra_predict_8x8_vl 29 3 intra_predict_8x8_vr 31 4 intra_predict_8x8c_dc 8 5 intra_predict_8x8c_dc8 1 1 intra_predict_8x8c_dcl 5 3 intra_predict_8x8c_dct 5 3 intra_predict_8x8c_h 4 2 intra_predict_8x8c_p 58 30 intra_predict_8x8c_v 4 1 intra_predict_16x16_dc 32 8 intra_predict_16x16_dc8 9 4 intra_predict_16x16_dcl 26 6 intra_predict_16x16_dct 26 6 intra_predict_16x16_h 23 7 intra_predict_16x16_p 182 44 intra_predict_16x16_v 22 4 coeff_last15 3 2 coeff_last16 3 1 coeff_last64 42 6 decimate_score15 8 12 decimate_score16 8 11 decimate_score64 61 43 dequant_4x4_cqm 16 5 dequant_4x4_dc_cqm 13 5 dequant_4x4_dc_flat 13 5 dequant_4x4_flat 16 5 dequant_8x8_cqm 71 9 dequant_8x8_flat 71 9 avg_4x2 16 5 avg_4x4 30 6 avg_4x8 63 10 avg_4x16 124 19 avg_8x4 60 6 avg_8x8 119 10 avg_8x16 233 19 avg_16x8 229 21 avg_16x16 451 41 get_ref_4x4 30 9 get_ref_4x8 52 11 get_ref_8x4 45 9 get_ref_8x8 80 11 get_ref_8x16 156 16 get_ref_12x10 137 13 get_ref_16x8 147 11 get_ref_16x16 282 16 get_ref_20x18 278 22 hpel_filter 5163 686 lowres_init 5440 286 mc_chroma_2x2 24 7 mc_chroma_2x4 42 10 mc_chroma_4x2 41 7 mc_chroma_4x4 75 10 mc_chroma_4x8 144 19 mc_chroma_8x4 137 15 mc_chroma_8x8 269 28 mc_luma_4x4 30 10 mc_luma_4x8 52 12 mc_luma_8x4 44 10 mc_luma_8x8 80 13 mc_luma_8x16 156 19 mc_luma_16x8 147 13 mc_luma_16x16 281 19 memcpy_aligned 14 9 memzero_aligned 24 4 offsetadd_w4 79 18 offsetadd_w8 142 18 offsetadd_w16 277 25 offsetadd_w20 1118 38 offsetsub_w4 75 18 offsetsub_w8 140 18 offsetsub_w16 265 25 offsetsub_w20 989 39 weight_w4 111 19 weight_w8 205 19 weight_w16 396 29 weight_w20 1143 45 deinterleave_chroma_fdec 76 9 deinterleave_chroma_fenc 86 9 plane_copy_deinterleave 733 90 plane_copy_interleave 791 245 store_interleave_chroma 82 12 add4x4_idct 34 9 add8x8_idct 139 31 add8x8_idct8 269 39 add8x8_idct_dc 67 7 add16x16_idct 564 123 add16x16_idct_dc 260 22 dct4x4dc 18 10 idct4x4dc 16 9 sub4x4_dct 25 7 sub8x8_dct 101 12 sub8x8_dct8 160 25 sub16x16_dct 403 52 sub16x16_dct8 646 68 zigzag_scan_4x4_frame 4 1 hadamard_ac_8x8 117 21 hadamard_ac_8x16 236 42 hadamard_ac_16x8 235 31 hadamard_ac_16x16 473 60 intra_sad_x3_4x4 50 21 intra_sad_x3_8x8 183 34 intra_sad_x3_8x8c 181 36 intra_sad_x3_16x16 643 68 intra_satd_x3_4x4 83 61 intra_satd_x3_8x8c 344 81 intra_satd_x3_16x16 1389 136 sa8d_8x8 97 19 sa8d_16x16 394 68 satd_4x4 24 8 satd_4x8 51 11 satd_4x16 103 24 satd_8x4 52 9 satd_8x8 108 12 satd_8x16 218 24 satd_16x8 218 19 satd_16x16 437 38 ssd_4x4 10 5 ssd_4x8 24 8 ssd_4x16 42 15 ssd_8x4 23 5 ssd_8x8 37 9 ssd_8x16 74 17 ssd_16x8 72 11 ssd_16x16 140 23 var2_8x8 91 37 var2_8x16 176 66 var_8x8 50 15 var_8x16 65 29 var_16x16 132 56 @yinshiyou @BugMaster Hello, this patch adds loongarch platform support and optimizes some functions. We use pure assembly rather than C intrinsics to get better performance. Performance has improved from 4.76fps to 20.50fps. Reviews are welcome.
added 14 commits
-
04458c72...eaa68fad - 3 commits from branch
videolan:master
- ff8306c6 - loongarch: Init LSX/LASX support
- 1513135a - loongarch: Add checkasm support
- 38c9fc03 - loongarch: Add asm.S file
- 211d6246 - loongarch: Add loongsonutil.S file
- d074b1f0 - loongarch: Improve the performance of deblock series functions
- ac0281e3 - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
- 8465f013 - loongarch: Improve the performance of predict series functions
- 8d504279 - loongarch: Improve the performance of quant series functions
- 5b8cbae7 - loongarch: Improve the performance of mc series functions
- 35e09bff - loongarch: Improve the performance of dct series functions
- 2eadbd09 - loongarch: Improve the performance of pixel series functions
Toggle commit list-
04458c72...eaa68fad - 3 commits from branch
- Resolved by Lu Wang
- Resolved by Lu Wang
added 11 commits
- 6d536dcb - loongarch: Init LSX/LASX support
- dcaf97b0 - loongarch: Add checkasm support
- 9ab927db - loongarch: Add asm.S file
- ec64e83b - loongarch: Add loongsonutil.S file
- 65e43789 - loongarch: Improve the performance of deblock series functions
- b080a945 - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
- cb5b01bd - loongarch: Improve the performance of predict series functions
- 33d6558a - loongarch: Improve the performance of quant series functions
- 36edaf43 - loongarch: Improve the performance of mc series functions
- 2abfbfc0 - loongarch: Improve the performance of dct series functions
- 9696d3f2 - loongarch: Improve the performance of pixel series functions
Toggle commit listadded 11 commits
- 061aa0a1 - loongarch: Init LSX/LASX support
- 0eed024c - loongarch: Add checkasm support
- d8fd7dd1 - loongarch: Add asm.S file
- a1ec289e - loongarch: Add loongsonutil.S file
- c9861da2 - loongarch: Improve the performance of deblock series functions
- ce2087ce - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
- ebe13a0d - loongarch: Improve the performance of predict series functions
- eabd60ed - loongarch: Improve the performance of quant series functions
- d0a962f0 - loongarch: Improve the performance of mc series functions
- 47a3b1f8 - loongarch: Improve the performance of dct series functions
- 63e0690f - loongarch: Improve the performance of pixel series functions
Toggle commit list- Resolved by Lu Wang
I found some LASX opt is slower than LSX according to the checkasm test. It'e suggested to remove these LASX optimizations: avg_16x16_lasx; decimate_score15_lasx; decimate_score16_lasx; decimate_score64_lasx; intra_predict_8x8c_dc8_lsx; intra_sad_x3_8x8_lasx; intra_sad_x3_8x8c_lasx; intra_sad_x3_16x16_lasx; intra_satd_x3_8x8c_lasx; memcpy_aligned_lasx; sad_4x8_lasx; sad_4x16_lasx; quant_4x4_lasx; quant_4x4_dc_lasx; quant_8x8_lasx; sad_8x8_lasx; sad_8x16_lasx; sad_16x8_lasx; sad_16x16_lasx; sad_aligned_4x8_lasx; sad_aligned_4x16_lasx; sad_aligned_8x8_lasx: 7 sad_aligned_8x16_lasx; sad_aligned_16x8_lasx; sad_aligned_16x16_lasx; sad_x3_4x8_lasx; sad_x3_8x4_lasx; sad_x3_8x8_lasx; sad_x3_8x16_lasx; sad_x4_4x8_lasx; sad_x4_8x16_lasx; ssd_4x4_lasx; ssd_4x8_lasx; ssd_4x16_lasx; ssd_8x4_lasx; var_8x8_lasx; var_8x16_lasx; var_16x16_lasx;
added 6 commits
- 39951e92 - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
- f7e14b93 - loongarch: Improve the performance of predict series functions
- ba09be07 - loongarch: Improve the performance of quant series functions
- 4b9719d6 - loongarch: Improve the performance of mc series functions
- 2413c413 - loongarch: Improve the performance of dct series functions
- 330b4a2d - loongarch: Improve the performance of pixel series functions
Toggle commit list@gramner could you please help to review this PR.
@jbk Could you help review this PR, any suggestion will be appreciated.
@BugMaster Hello Anton, I have sent out the CLA, is there anything else I need to do.
Thank you very much! As mentioned in your previous email, we signed the CLA and represent a company, them should I re-open a new MR and change author of each patch to one person who mentioned in the CLA ?
Edited by Yin Shiyou
This MR will be replaced by another MR. !135 (merged)