Skip to content
Snippets Groups Projects

loongarch: support LoongArch LSX and LASX optimization.

Closed Lu Wang requested to merge wangluls/x264:LOONGARCH-V4 into master
1 unresolved thread

LSX/LASX is the LoongArch 128-bit/256-bit SIMD instruction. All patches have been tested on Loongson 3A5000 platform. encode performance has been speeded up about 336%

Merge request reports

Pipeline #319118 passed

Pipeline passed for 330b4a2d on wangluls:LOONGARCH-V4

Closed by Anton MitrofanovAnton Mitrofanov 1 year ago (Oct 12, 2023 8:21pm UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • This patch optimizes functions on loongarch platform. The functions have been implemented by using pure assembly rather than C intrinsics. Performance has improved from 4.76fps to 20.50fps by using the following command:

    1. ./configure && make -j5
    2. ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
  • This benchmark has been taken by executing 'checkasm' on loongarch platform:

    1. make checkasm
    2. ./checkasm8 --bench
    Function c Assembly
    deblock_luma[0] 79 39
    deblock_luma[1] 91 18
    deblock_luma_intra[0] 63 44
    deblock_luma_intra[1] 71 18
    deblock_strength 104 33
    sad_4x4 13 3
    sad_4x8 26 7
    sad_4x16 57 13
    sad_8x4 24 3
    sad_8x8 54 8
    sad_8x16 108 13
    sad_16x8 95 8
    sad_16x16 189 13
    sad_x3_4x4 37 6
    sad_x3_4x8 71 13
    sad_x3_8x4 70 8
    sad_x3_8x8 162 14
    sad_x3_8x16 323 25
    sad_x3_16x8 279 15
    sad_x3_16x16 555 27
    sad_x4_4x4 49 8
    sad_x4_4x8 95 17
    sad_x4_8x4 94 8
    sad_x4_8x8 214 16
    sad_x4_8x16 429 33
    sad_x4_16x8 372 18
    sad_x4_16x16 740 34
    intra_predict_4x4_dc 3 2
    intra_predict_4x4_dc8 1 1
    intra_predict_4x4_dcl 2 1
    intra_predict_4x4_dct 2 1
    intra_predict_4x4_ddl 7 2
    intra_predict_4x4_h 2 1
    intra_predict_4x4_v 1 1
    intra_predict_8x8_dc 8 2
    intra_predict_8x8_dc8 1 1
    intra_predict_8x8_dcl 5 2
    intra_predict_8x8_dct 5 2
    intra_predict_8x8_ddl 27 3
    intra_predict_8x8_ddr 26 3
    intra_predict_8x8_h 4 2
    intra_predict_8x8_v 3 1
    intra_predict_8x8_vl 29 3
    intra_predict_8x8_vr 31 4
    intra_predict_8x8c_dc 8 5
    intra_predict_8x8c_dc8 1 1
    intra_predict_8x8c_dcl 5 3
    intra_predict_8x8c_dct 5 3
    intra_predict_8x8c_h 4 2
    intra_predict_8x8c_p 58 30
    intra_predict_8x8c_v 4 1
    intra_predict_16x16_dc 32 8
    intra_predict_16x16_dc8 9 4
    intra_predict_16x16_dcl 26 6
    intra_predict_16x16_dct 26 6
    intra_predict_16x16_h 23 7
    intra_predict_16x16_p 182 44
    intra_predict_16x16_v 22 4
    coeff_last15 3 2
    coeff_last16 3 1
    coeff_last64 42 6
    decimate_score15 8 12
    decimate_score16 8 11
    decimate_score64 61 43
    dequant_4x4_cqm 16 5
    dequant_4x4_dc_cqm 13 5
    dequant_4x4_dc_flat 13 5
    dequant_4x4_flat 16 5
    dequant_8x8_cqm 71 9
    dequant_8x8_flat 71 9
    avg_4x2 16 5
    avg_4x4 30 6
    avg_4x8 63 10
    avg_4x16 124 19
    avg_8x4 60 6
    avg_8x8 119 10
    avg_8x16 233 19
    avg_16x8 229 21
    avg_16x16 451 41
    get_ref_4x4 30 9
    get_ref_4x8 52 11
    get_ref_8x4 45 9
    get_ref_8x8 80 11
    get_ref_8x16 156 16
    get_ref_12x10 137 13
    get_ref_16x8 147 11
    get_ref_16x16 282 16
    get_ref_20x18 278 22
    hpel_filter 5163 686
    lowres_init 5440 286
    mc_chroma_2x2 24 7
    mc_chroma_2x4 42 10
    mc_chroma_4x2 41 7
    mc_chroma_4x4 75 10
    mc_chroma_4x8 144 19
    mc_chroma_8x4 137 15
    mc_chroma_8x8 269 28
    mc_luma_4x4 30 10
    mc_luma_4x8 52 12
    mc_luma_8x4 44 10
    mc_luma_8x8 80 13
    mc_luma_8x16 156 19
    mc_luma_16x8 147 13
    mc_luma_16x16 281 19
    memcpy_aligned 14 9
    memzero_aligned 24 4
    offsetadd_w4 79 18
    offsetadd_w8 142 18
    offsetadd_w16 277 25
    offsetadd_w20 1118 38
    offsetsub_w4 75 18
    offsetsub_w8 140 18
    offsetsub_w16 265 25
    offsetsub_w20 989 39
    weight_w4 111 19
    weight_w8 205 19
    weight_w16 396 29
    weight_w20 1143 45
    deinterleave_chroma_fdec 76 9
    deinterleave_chroma_fenc 86 9
    plane_copy_deinterleave 733 90
    plane_copy_interleave 791 245
    store_interleave_chroma 82 12
    add4x4_idct 34 9
    add8x8_idct 139 31
    add8x8_idct8 269 39
    add8x8_idct_dc 67 7
    add16x16_idct 564 123
    add16x16_idct_dc 260 22
    dct4x4dc 18 10
    idct4x4dc 16 9
    sub4x4_dct 25 7
    sub8x8_dct 101 12
    sub8x8_dct8 160 25
    sub16x16_dct 403 52
    sub16x16_dct8 646 68
    zigzag_scan_4x4_frame 4 1
    hadamard_ac_8x8 117 21
    hadamard_ac_8x16 236 42
    hadamard_ac_16x8 235 31
    hadamard_ac_16x16 473 60
    intra_sad_x3_4x4 50 21
    intra_sad_x3_8x8 183 34
    intra_sad_x3_8x8c 181 36
    intra_sad_x3_16x16 643 68
    intra_satd_x3_4x4 83 61
    intra_satd_x3_8x8c 344 81
    intra_satd_x3_16x16 1389 136
    sa8d_8x8 97 19
    sa8d_16x16 394 68
    satd_4x4 24 8
    satd_4x8 51 11
    satd_4x16 103 24
    satd_8x4 52 9
    satd_8x8 108 12
    satd_8x16 218 24
    satd_16x8 218 19
    satd_16x16 437 38
    ssd_4x4 10 5
    ssd_4x8 24 8
    ssd_4x16 42 15
    ssd_8x4 23 5
    ssd_8x8 37 9
    ssd_8x16 74 17
    ssd_16x8 72 11
    ssd_16x16 140 23
    var2_8x8 91 37
    var2_8x16 176 66
    var_8x8 50 15
    var_8x16 65 29
    var_16x16 132 56
  • @yinshiyou @BugMaster Hello, this patch adds loongarch platform support and optimizes some functions. We use pure assembly rather than C intrinsics to get better performance. Performance has improved from 4.76fps to 20.50fps. Reviews are welcome.

  • Lu Wang added 14 commits

    added 14 commits

    • 04458c72...eaa68fad - 3 commits from branch videolan:master
    • ff8306c6 - loongarch: Init LSX/LASX support
    • 1513135a - loongarch: Add checkasm support
    • 38c9fc03 - loongarch: Add asm.S file
    • 211d6246 - loongarch: Add loongsonutil.S file
    • d074b1f0 - loongarch: Improve the performance of deblock series functions
    • ac0281e3 - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
    • 8465f013 - loongarch: Improve the performance of predict series functions
    • 8d504279 - loongarch: Improve the performance of quant series functions
    • 5b8cbae7 - loongarch: Improve the performance of mc series functions
    • 35e09bff - loongarch: Improve the performance of dct series functions
    • 2eadbd09 - loongarch: Improve the performance of pixel series functions

    Compare with previous version

  • can anyone help to review this PR.

  • Yin Shiyou
  • Yin Shiyou
  • Lu Wang added 11 commits

    added 11 commits

    • 6d536dcb - loongarch: Init LSX/LASX support
    • dcaf97b0 - loongarch: Add checkasm support
    • 9ab927db - loongarch: Add asm.S file
    • ec64e83b - loongarch: Add loongsonutil.S file
    • 65e43789 - loongarch: Improve the performance of deblock series functions
    • b080a945 - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
    • cb5b01bd - loongarch: Improve the performance of predict series functions
    • 33d6558a - loongarch: Improve the performance of quant series functions
    • 36edaf43 - loongarch: Improve the performance of mc series functions
    • 2abfbfc0 - loongarch: Improve the performance of dct series functions
    • 9696d3f2 - loongarch: Improve the performance of pixel series functions

    Compare with previous version

  • Lu Wang resolved all threads

    resolved all threads

  • Lu Wang added 11 commits

    added 11 commits

    • 061aa0a1 - loongarch: Init LSX/LASX support
    • 0eed024c - loongarch: Add checkasm support
    • d8fd7dd1 - loongarch: Add asm.S file
    • a1ec289e - loongarch: Add loongsonutil.S file
    • c9861da2 - loongarch: Improve the performance of deblock series functions
    • ce2087ce - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
    • ebe13a0d - loongarch: Improve the performance of predict series functions
    • eabd60ed - loongarch: Improve the performance of quant series functions
    • d0a962f0 - loongarch: Improve the performance of mc series functions
    • 47a3b1f8 - loongarch: Improve the performance of dct series functions
    • 63e0690f - loongarch: Improve the performance of pixel series functions

    Compare with previous version

    • Resolved by Lu Wang

      I found some LASX opt is slower than LSX according to the checkasm test. It'e suggested to remove these LASX optimizations: avg_16x16_lasx; decimate_score15_lasx; decimate_score16_lasx; decimate_score64_lasx; intra_predict_8x8c_dc8_lsx; intra_sad_x3_8x8_lasx; intra_sad_x3_8x8c_lasx; intra_sad_x3_16x16_lasx; intra_satd_x3_8x8c_lasx; memcpy_aligned_lasx; sad_4x8_lasx; sad_4x16_lasx; quant_4x4_lasx; quant_4x4_dc_lasx; quant_8x8_lasx; sad_8x8_lasx; sad_8x16_lasx; sad_16x8_lasx; sad_16x16_lasx; sad_aligned_4x8_lasx; sad_aligned_4x16_lasx; sad_aligned_8x8_lasx: 7 sad_aligned_8x16_lasx; sad_aligned_16x8_lasx; sad_aligned_16x16_lasx; sad_x3_4x8_lasx; sad_x3_8x4_lasx; sad_x3_8x8_lasx; sad_x3_8x16_lasx; sad_x4_4x8_lasx; sad_x4_8x16_lasx; ssd_4x4_lasx; ssd_4x8_lasx; ssd_4x16_lasx; ssd_8x4_lasx; var_8x8_lasx; var_8x16_lasx; var_16x16_lasx;

  • Lu Wang added 6 commits

    added 6 commits

    • 39951e92 - loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
    • f7e14b93 - loongarch: Improve the performance of predict series functions
    • ba09be07 - loongarch: Improve the performance of quant series functions
    • 4b9719d6 - loongarch: Improve the performance of mc series functions
    • 2413c413 - loongarch: Improve the performance of dct series functions
    • 330b4a2d - loongarch: Improve the performance of pixel series functions

    Compare with previous version

  • Lu Wang resolved all threads

    resolved all threads

  • LGTM.

  • @gramner could you please help to review this PR.

  • @jbk Could you help review this PR, any suggestion will be appreciated.

  • This MR will be replaced by another MR. !135 (merged)

  • Please register or sign in to reply
    Loading