- Mar 14, 2024
-
-
Prior to this change dealing with the scenario where the number of XMM registers spilled depends on if a branch is taken or not was complicated to handle well. There was essentially three options: 1) Always spill the largest number of XMM register. Results in unnecessary spills. 2) Do the spilling after the branch. Results in code duplication for the shared subset of spills. 3) Do the spilling manually. Optimal, but overly complex and vexing. This adds an additional optional argument to the WIN64_SPILL_XMM and WIN64_PUSH_XMM macros to make it possible to allocate space for a certain number of registers but initially only push a subset of those, with the option of pushing additional register later.
-
Allows the use of multiple independent stack allocations within a function without having to manually fiddle with stack offsets.
-
-
- Mar 12, 2024
-
-
Anton Mitrofanov authored
Use correct return type for pixel_sad_x3/x4 functions. Bug report by Dominik 'Rathann' Mierzejewski .
-
- Feb 28, 2024
-
-
This makes the code much simpler (especially for adding support for other instruction set extensions), avoids needing inline assembly for this feature, and generally is more of the canonical way to do this. The CPU feature detection was added in 9c3c7168, using HWCAP_CPUID. The argument for using that, was that HWCAP_CPUID was added much earlier in the kernel (in Linux v4.11), while the HWCAP flags for individual features always come later. This allows detecting support for new CPU extensions before the kernel exposes information about them via hwcap flags. However in practice, there's probably quite little advantage in this. E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in v5.10 - later than HWCAP_CPUID, but there's probably very little practical cases where one would run a kernel older than that on a CPU that supports those instructions. Additionally, we provide our own definitions of the flag values to check (as they are fixed constants anyway), with names not conflicting with the ones from system headers. This reduces the number of ifdefs needed, and allows detecting those features even if building with userland headers that are lacking the definitions of those flags. Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04, do expose support for these features via HWCAP flags, but the emulated cpuid registers are missing the bits for exposing e.g. SVE2 (This issue is fixed in later versions of QEMU though.) Also drop the ifdef check for whether AT_HWCAP is defined; it was added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18, which also precedes when aarch64 was commonly used anyway, so don't guard the use of that with an ifdef.
-
Anton Mitrofanov authored
Use same Docker images as VLC for contrib compilation.
-
Anton Mitrofanov authored
-
- Feb 19, 2024
-
-
Henrik Gramner authored
Automatically flag x86-64 asm object files as SHSTK-compatible. Shadow Stack (SHSTK) is a part of Control-flow Enforcement Technology (CET) which is a feature aimed at defending against ROP attacks by verifying that 'call' and 'ret' instructions are correctly matched. For well-written code this works transparently without any code changes, as return addresses popped from the shadow stack should match return addresses popped from the normal stack for performance reasons anyway.
-
Henrik Gramner authored
-
Henrik Gramner authored
-
Henrik Gramner authored
Also make the GFNI cpu flag imply the presence of both AESNI and CLMUL.
-
Henrik Gramner authored
Broadcasting a memory operand is a binary flag, you either broadcast or you don't, and there's only a single possible element size for any given instruction. The instruction syntax however requires the broadcast semanticts to be explicitly defined, which is an issue when using macros to template code for multiple register widths. Add some helper defines to alleviate the issue.
-
Henrik Gramner authored
-
- Jan 13, 2024
-
-
Anton Mitrofanov authored
-
- Nov 23, 2023
-
-
David Chen authored
Imporve the performance of NEON functions of aarch64/pixel-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=ssd Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: ssd_4x4_c: 235 ssd_4x4_neon: 226 ssd_4x4_sve: 151 ssd_4x8_c: 409 ssd_4x8_neon: 363 ssd_4x8_sve: 201 ssd_4x16_c: 781 ssd_4x16_neon: 653 ssd_4x16_sve: 313 ssd_8x4_c: 402 ssd_8x4_neon: 192 ssd_8x4_sve: 192 ssd_8x8_c: 728 ssd_8x8_neon: 275 ssd_8x8_sve: 275 Command executed: ./checkasm10 --bench=ssd Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: ssd_4x4_c: 256 ssd_4x4_neon: 226 ssd_4x4_sve: 153 ssd_4x8_c: 460 ssd_4x8_neon: 369 ssd_4x8_sve: 215 ssd_4x16_c: 852 ssd_4x16_neon: 651 ssd_4x16_sve: 340 Command executed: ./checkasm8 --bench=ssd Testbed: AWS Graviton3 Results: ssd_4x4_c: 295 ssd_4x4_neon: 288 ssd_4x4_sve: 228 ssd_4x8_c: 454 ssd_4x8_neon: 431 ssd_4x8_sve: 294 ssd_4x16_c: 779 ssd_4x16_neon: 631 ssd_4x16_sve: 438 ssd_8x4_c: 463 ssd_8x4_neon: 247 ssd_8x4_sve: 246 ssd_8x8_c: 781 ssd_8x8_neon: 413 ssd_8x8_sve: 353 Command executed: ./checkasm10 --bench=ssd Testbed: AWS Graviton3 Results: ssd_4x4_c: 322 ssd_4x4_neon: 335 ssd_4x4_sve: 240 ssd_4x8_c: 522 ssd_4x8_neon: 448 ssd_4x8_sve: 294 ssd_4x16_c: 832 ssd_4x16_neon: 603 ssd_4x16_sve: 440 Command executed: ./checkasm8 --bench=sa8d Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: sa8d_8x8_c: 2103 sa8d_8x8_neon: 619 sa8d_8x8_sve: 617 Command executed: ./checkasm8 --bench=sa8d Testbed: AWS Graviton3 Results: sa8d_8x8_c: 2021 sa8d_8x8_neon: 597 sa8d_8x8_sve: 580 Command executed: ./checkasm8 --bench=var Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: var_8x8_c: 595 var_8x8_neon: 262 var_8x8_sve: 262 var_8x16_c: 1193 var_8x16_neon: 435 var_8x16_sve: 419 Command executed: ./checkasm8 --bench=var Testbed: AWS Graviton3 Results: var_8x8_c: 616 var_8x8_neon: 229 var_8x8_sve: 222 var_8x16_c: 1207 var_8x16_neon: 399 var_8x16_sve: 389 Command executed: ./checkasm8 --bench=hadamard_ac Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: hadamard_ac_8x8_c: 2330 hadamard_ac_8x8_neon: 635 hadamard_ac_8x8_sve: 635 hadamard_ac_8x16_c: 4500 hadamard_ac_8x16_neon: 1152 hadamard_ac_8x16_sve: 1151 hadamard_ac_16x8_c: 4499 hadamard_ac_16x8_neon: 1151 hadamard_ac_16x8_sve: 1150 hadamard_ac_16x16_c: 8812 hadamard_ac_16x16_neon: 2187 hadamard_ac_16x16_sve: 2186 Command executed: ./checkasm8 --bench=hadamard_ac Testbed: AWS Graviton3 Results: hadamard_ac_8x8_c: 2266 hadamard_ac_8x8_neon: 517 hadamard_ac_8x8_sve: 513 hadamard_ac_8x16_c: 4444 hadamard_ac_8x16_neon: 867 hadamard_ac_8x16_sve: 849 hadamard_ac_16x8_c: 4443 hadamard_ac_16x8_neon: 880 hadamard_ac_16x8_sve: 868 hadamard_ac_16x16_c: 8595 hadamard_ac_16x16_neon: 1656 hadamard_ac_16x16_sve: 1622
-
David Chen authored
Place NEON pixel-a macros and constants that are intended to be used by SVE/SVE2 functions as well in a common file.
-
David Chen authored
Imporve the performance of NEON functions of aarch64/mc-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=avg Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: avg_4x2_c: 274 avg_4x2_neon: 215 avg_4x2_sve: 171 avg_4x4_c: 461 avg_4x4_neon: 343 avg_4x4_sve: 225 avg_4x8_c: 806 avg_4x8_neon: 619 avg_4x8_sve: 334 avg_4x16_c: 1523 avg_4x16_neon: 1168 avg_4x16_sve: 558 Command executed: ./checkasm8 --bench=avg Testbed: AWS Graviton3 Results: avg_4x2_c: 267 avg_4x2_neon: 213 avg_4x2_sve: 167 avg_4x4_c: 467 avg_4x4_neon: 350 avg_4x4_sve: 221 avg_4x8_c: 784 avg_4x8_neon: 624 avg_4x8_sve: 302 avg_4x16_c: 1445 avg_4x16_neon: 1182 avg_4x16_sve: 485
-
David Chen authored
Place NEON mc-a macros and functions that are intended to be used by SVE/SVE2 functions as well in a common file.
-
- Nov 20, 2023
-
-
David Chen authored
Imporve the performance of NEON functions of aarch64/deblock-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=deblock Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: deblock_chroma[1]_c: 735 deblock_chroma[1]_neon: 427 deblock_chroma[1]_sve: 353 Command executed: ./checkasm8 --bench=deblock Testbed: AWS Graviton3 Results: deblock_chroma[1]_c: 719 deblock_chroma[1]_neon: 442 deblock_chroma[1]_sve: 345
-
David Chen authored
Place NEON deblock-a macros that are intended to be used by SVE/SVE2 functions as well in a common file.
-
David Chen authored
Imporve the performance of NEON functions of aarch64/dct-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=sub Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: sub4x4_dct_c: 528 sub4x4_dct_neon: 322 sub4x4_dct_sve: 247 Command executed: ./checkasm8 --bench=sub Testbed: AWS Graviton3 Results: sub4x4_dct_c: 562 sub4x4_dct_neon: 376 sub4x4_dct_sve: 255 Command executed: ./checkasm8 --bench=add Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: add4x4_idct_c: 698 add4x4_idct_neon: 386 add4x4_idct_sve2: 345 Command executed: ./checkasm8 --bench=zigzag Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: zigzag_interleave_8x8_cavlc_frame_c: 582 zigzag_interleave_8x8_cavlc_frame_neon: 273 zigzag_interleave_8x8_cavlc_frame_sve: 257 Command executed: ./checkasm8 --bench=zigzag Testbed: AWS Graviton3 Results: zigzag_interleave_8x8_cavlc_frame_c: 587 zigzag_interleave_8x8_cavlc_frame_neon: 257 zigzag_interleave_8x8_cavlc_frame_sve: 249
-
- Nov 18, 2023
-
-
David Chen authored
Place NEON dct-a macros that are intended to be used by SVE/SVE2 functions as well in a common file.
-
- Nov 14, 2023
-
-
Martin Storsjö authored
The sve-default-vector-length property sets the maximum vector length in bytes; the default is 64, i.e. handling up to 512 bit vectors. In order to be able to test 1024 and 2048 bit vectors, this has to be raised separately from setting the sve<n>=on property.
-
Martin Storsjö authored
In the new version, there's no longer any "wine64" executable, but both i386 and x86_64 are handled with the same "wine" frontend.
-
Martin Storsjö authored
-
- Nov 02, 2023
-
-
Martin Storsjö authored
-
Martin Storsjö authored
The assembly currently uses a mixture of different styles. Don't make all of it entirely consistent now, but try to make functions more consistent within themselves at least. In particular, get rid of the convention to have braces hanging outside of the alignment line. Some functions have the whole content indented off by one char compared to other functions; adjust those (but retain the functions that are self-consistent and match either of the common styles).
-
Martin Storsjö authored
The assembly currently uses a mixture of different styles. Don't make all of it entirely consistent now, but try to make functions more consistent within themselves at least. In particular, get rid of the convention to have braces hanging outside of the alignment line.
-
Martin Storsjö authored
Don't manually add in the rounding constant (via a fused multiply-add instruction) when we can just do a plain rounded right shift. Cortex A53 A72 A73 8bpc: Before: dequant_4x4_cqm_neon: 515 246 267 dequant_4x4_dc_cqm_neon: 410 265 266 dequant_4x4_dc_flat_neon: 413 271 271 dequant_4x4_flat_neon: 519 254 274 dequant_8x8_cqm_neon: 1555 980 1002 dequant_8x8_flat_neon: 1562 994 1014 After: dequant_4x4_cqm_neon: 499 246 255 dequant_4x4_dc_cqm_neon: 376 265 255 dequant_4x4_dc_flat_neon: 378 271 260 dequant_4x4_flat_neon: 500 254 262 dequant_8x8_cqm_neon: 1489 900 925 dequant_8x8_flat_neon: 1493 915 938 10bpc: Before: dequant_4x4_cqm_neon: 483 275 275 dequant_4x4_dc_cqm_neon: 429 256 261 dequant_4x4_dc_flat_neon: 435 267 267 dequant_4x4_flat_neon: 487 283 288 dequant_8x8_cqm_neon: 1511 1112 1076 dequant_8x8_flat_neon: 1518 1139 1089 After: dequant_4x4_cqm_neon: 472 255 239 dequant_4x4_dc_cqm_neon: 404 256 232 dequant_4x4_dc_flat_neon: 406 267 234 dequant_4x4_flat_neon: 472 255 239 dequant_8x8_cqm_neon: 1462 922 978 dequant_8x8_flat_neon: 1462 922 978 This makes it around 3% faster on the Cortex A53, around 8% faster for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp on A72/A73.
-
Martin Storsjö authored
Cortex A53 A72 A73 8 bpc: Before: sad_x3_4x4_neon: 580 303 204 sad_x3_4x8_neon: 1065 516 323 sad_x3_8x4_neon: 668 262 282 sad_x3_8x8_neon: 1238 454 471 sad_x3_8x16_neon: 2378 842 847 sad_x3_16x8_neon: 2136 738 776 sad_x3_16x16_neon: 4162 1378 1463 After: sad_x3_4x4_neon: 477 298 206 sad_x3_4x8_neon: 842 515 327 sad_x3_8x4_neon: 603 260 279 sad_x3_8x8_neon: 1110 451 464 sad_x3_8x16_neon: 2125 841 843 sad_x3_16x8_neon: 2124 730 766 sad_x3_16x16_neon: 4145 1370 1434 10 bpc: Before: sad_x3_4x4_neon: 632 247 254 sad_x3_4x8_neon: 1162 419 443 sad_x3_8x4_neon: 890 358 416 sad_x3_8x8_neon: 1670 632 759 sad_x3_8x16_neon: 3230 1179 1458 sad_x3_16x8_neon: 3070 1209 1403 sad_x3_16x16_neon: 6030 2333 2699 After: sad_x3_4x4_neon: 522 253 255 sad_x3_4x8_neon: 932 443 431 sad_x3_8x4_neon: 880 354 406 sad_x3_8x8_neon: 1660 626 736 sad_x3_8x16_neon: 3220 1170 1397 sad_x3_16x8_neon: 3060 1184 1362 sad_x3_16x16_neon: 6020 2272 2579 Thus, this is around a 20-25% speedup on Cortex A53 for the small sizes (much smaller difference for bigger sizes though), while it doesn't make much of a difference at all (mostly within measurement noise) for the out-of-order cores (A72 and A73).
-
- Oct 24, 2023
-
-
Anton Mitrofanov authored
-
- Oct 19, 2023
-
-
Martin Storsjö authored
We could also use HWCAP_SVE and HWCAP2_SVE2 for detecting this, but these might not be available in all userland headers, while HWCAP_CPUID is available much earlier. The register ID_AA64ZFR0_EL1, which indicates if SVE2 is available, can only be accessed if SVE is available. If not building all the C code with SVE enabled (which could make it impossible to run on on HW without SVE), binutils refuses to assemble an instruction reading ID_AA64ZFR0_EL1 - but if referring to it with the technical name S3_0_C0_C4_4, it can be assembled even without any extra extensions enabled.
-
- Oct 18, 2023
-
-
Martin Storsjö authored
We don't expect the user to build the whole x264 codebase with SVE/SVE2 enabled, as we only enable this feature for the assembly files that use it, in order to have binaries that are portable and enable the SVE codepaths at runtime if supported.
-
- Oct 12, 2023
-
-
Yin Shiyou authored
Performance has improved from 11.27fps to 20.50fps by using the following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) hadamard_ac_8x8 117 21 hadamard_ac_8x16 236 42 hadamard_ac_16x8 235 31 hadamard_ac_16x16 473 60 intra_sad_x3_4x4 50 21 intra_sad_x3_8x8 183 34 intra_sad_x3_8x8c 181 36 intra_sad_x3_16x16 643 68 intra_satd_x3_4x4 83 61 intra_satd_x3_8x8c 344 81 intra_satd_x3_16x16 1389 136 sa8d_8x8 97 19 sa8d_16x16 394 68 satd_4x4 24 8 satd_4x8 51 11 satd_4x16 103 24 satd_8x4 52 9 satd_8x8 108 12 satd_8x16 218 24 satd_16x8 218 19 satd_16x16 437 38 ssd_4x4 10 5 ssd_4x8 24 8 ssd_4x16 42 15 ssd_8x4 23 5 ssd_8x8 37 9 ssd_8x16 74 17 ssd_16x8 72 11 ssd_16x16 140 23 var2_8x8 91 37 var2_8x16 176 66 var_8x8 50 15 var_8x16 65 29 var_16x16 132 56 Signed-off-by:
Hecai Yuan <yuanhecai@loongson.cn>
-
Yin Shiyou authored
Performance has improved from 10.53fps to 11.27fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) add4x4_idct 34 9 add8x8_idct 139 31 add8x8_idct8 269 39 add8x8_idct_dc 67 7 add16x16_idct 564 123 add16x16_idct_dc 260 22 dct4x4dc 18 10 idct4x4dc 16 9 sub4x4_dct 25 7 sub8x8_dct 101 12 sub8x8_dct8 160 25 sub16x16_dct 403 52 sub16x16_dct8 646 68 zigzag_scan_4x4_frame 4 1 Signed-off-by:
zhoupeng <zhoupeng@loongson.cn>
-
Yin Shiyou authored
Performance has improved from 6.78fps to 10.53fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) avg_4x2 16 5 avg_4x4 30 6 avg_4x8 63 10 avg_4x16 124 19 avg_8x4 60 6 avg_8x8 119 10 avg_8x16 233 19 avg_16x8 229 21 avg_16x16 451 41 get_ref_4x4 30 9 get_ref_4x8 52 11 get_ref_8x4 45 9 get_ref_8x8 80 11 get_ref_8x16 156 16 get_ref_12x10 137 13 get_ref_16x8 147 11 get_ref_16x16 282 16 get_ref_20x18 278 22 hpel_filter 5163 686 lowres_init 5440 286 mc_chroma_2x2 24 7 mc_chroma_2x4 42 10 mc_chroma_4x2 41 7 mc_chroma_4x4 75 10 mc_chroma_4x8 144 19 mc_chroma_8x4 137 15 mc_chroma_8x8 269 28 mc_luma_4x4 30 10 mc_luma_4x8 52 12 mc_luma_8x4 44 10 mc_luma_8x8 80 13 mc_luma_8x16 156 19 mc_luma_16x8 147 13 mc_luma_16x16 281 19 memcpy_aligned 14 9 memzero_aligned 24 4 offsetadd_w4 79 18 offsetadd_w8 142 18 offsetadd_w16 277 25 offsetadd_w20 1118 38 offsetsub_w4 75 18 offsetsub_w8 140 18 offsetsub_w16 265 25 offsetsub_w20 989 39 weight_w4 111 19 weight_w8 205 19 weight_w16 396 29 weight_w20 1143 45 deinterleave_chroma_fdec 76 9 deinterleave_chroma_fenc 86 9 plane_copy_deinterleave 733 90 plane_copy_interleave 791 245 store_interleave_chroma 82 12 Signed-off-by:
Xiwei Gu <guxiwei-hf@loongson.cn>
-
- Oct 10, 2023
-
-
Performance has improved from 6.34fps to 6.78fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) coeff_last15 3 2 coeff_last16 3 1 coeff_last64 42 6 decimate_score15 8 12 decimate_score16 8 11 decimate_score64 61 43 dequant_4x4_cqm 16 5 dequant_4x4_dc_cqm 13 5 dequant_4x4_dc_flat 13 5 dequant_4x4_flat 16 5 dequant_8x8_cqm 71 9 dequant_8x8_flat 71 9 Signed-off-by:
Shiyou Yin <yinshiyou-hf@loongson.cn>
-
Performance has improved from 6.32fps to 6.34fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) intra_predict_4x4_dc 3 2 intra_predict_4x4_dc8 1 1 intra_predict_4x4_dcl 2 1 intra_predict_4x4_dct 2 1 intra_predict_4x4_ddl 7 2 intra_predict_4x4_h 2 1 intra_predict_4x4_v 1 1 intra_predict_8x8_dc 8 2 intra_predict_8x8_dc8 1 1 intra_predict_8x8_dcl 5 2 intra_predict_8x8_dct 5 2 intra_predict_8x8_ddl 27 3 intra_predict_8x8_ddr 26 3 intra_predict_8x8_h 4 2 intra_predict_8x8_v 3 1 intra_predict_8x8_vl 29 3 intra_predict_8x8_vr 31 4 intra_predict_8x8c_dc 8 5 intra_predict_8x8c_dc8 1 1 intra_predict_8x8c_dcl 5 3 intra_predict_8x8c_dct 5 3 intra_predict_8x8c_h 4 2 intra_predict_8x8c_p 58 30 intra_predict_8x8c_v 4 1 intra_predict_16x16_dc 32 8 intra_predict_16x16_dc8 9 4 intra_predict_16x16_dcl 26 6 intra_predict_16x16_dct 26 6 intra_predict_16x16_h 23 7 intra_predict_16x16_p 182 44 intra_predict_16x16_v 22 4 Signed-off-by:
Xiwei Gu <guxiwei-hf@loongson.cn>
-
Performance has improved from 4.92fps to 6.32fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) sad_4x4 13 3 sad_4x8 26 7 sad_4x16 57 13 sad_8x4 24 3 sad_8x8 54 8 sad_8x16 108 13 sad_16x8 95 8 sad_16x16 189 13 sad_x3_4x4 37 6 sad_x3_4x8 71 13 sad_x3_8x4 70 8 sad_x3_8x8 162 14 sad_x3_8x16 323 25 sad_x3_16x8 279 15 sad_x3_16x16 555 27 sad_x4_4x4 49 8 sad_x4_4x8 95 17 sad_x4_8x4 94 8 sad_x4_8x8 214 16 sad_x4_8x16 429 33 sad_x4_16x8 372 18 sad_x4_16x16 740 34 Signed-off-by:
wanglu <wanglu@loongson.cn>
-
Performance has improved from 4.76fps to 4.92fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) deblock_luma[0] 79 39 deblock_luma[1] 91 18 deblock_luma_intra[0] 63 44 deblock_luma_intra[1] 71 18 deblock_strength 104 33 Signed-off-by:
Hao Chen <chenhao@loongson.cn>
-