Optimize missing PowerPC functions
-
Review changes -
-
Download -
Patches
-
Plain diff
This patch optimizes several functions for PowerPC that didn't get hardware optimizations before, the functions are implemented by taking advantage of Power ISA 2.07 so the minimum processor that these optimizations target is POWER8. The functions have been implemented using pure assembly rather than C intrinsics for the following reasons:
- using the same scheme of the other architectures to easily get extended and modified, therefore stay on par with them
- Get more flexibility and control of using the assembly instructions, therefore achieve a close to optimal performance by interleaving independent instructions and saturate the execution units
- Avoid the different behavior of compiler variants which yield into a different results in various circumstances
- Isolating the endianness work into macros defined in 'asm.S' so one no need to worry about endianness while implementing functions
- Easily upgradable to new ISA versions for further optimizations.
This benchmark has been taken by executing 'checkasm' on POWER8 for little-endian and big-endian modes
Little-endian benchmark
Function | C | Assembly | C intrinsics |
---|---|---|---|
asd8 | 503 | 134 | |
avg_4x2 | 92 | 60 | |
avg_4x4 | 152 | 79 | |
avg_4x8 | 273 | 121 | |
avg_4x16 | 622 | 240 | |
avg_8x4 | 265 | 77 | |
avg_8x8 | 596 | 120 | |
avg_8x16 | 1194 | 211 | |
avg_16x8 | 1138 | 189 | |
avg_16x16 | 2318 | 336 | |
coeff_last4 | 24 | 20 | |
coeff_last8 | 29 | 21 | |
coeff_last15 | 39 | 27 | |
coeff_last16 | 37 | 25 | |
coeff_last64 | 228 | 60 | |
coeff_level_run4 | 39 | 35 | |
coeff_level_run8 | 49 | 47 | |
coeff_level_run15 | 70 | 69 | |
coeff_level_run16 | 74 | 67 | |
dct4x4dc | 123 | 50 | |
deblock_chroma[1] | 246 | 85 | |
deblock_chroma_420_intra_mbaff | 103 | 93 | |
deblock_chroma_420_mbaff | 153 | 106 | |
deblock_chroma_422_intra_mbaff | 210 | 124 | |
deblock_chroma_422_mbaff | 255 | 126 | |
deblock_chroma_intra[1] | 273 | 65 | |
deblock_h_chroma_420 | 254 | 126 | |
deblock_h_chroma_420_intra | 205 | 124 | |
deblock_h_chroma_422 | 551 | 233 | |
deblock_h_chroma_422_intra | 412 | 217 | |
deblock_strength | 502 | 124 | |
decimate_score15 | 66 | 33 | |
decimate_score16 | 69 | 35 | |
decimate_score64 | 376 | 75 | |
denoise_dct | 617 | 101 | |
idct4x4dc | 117 | 46 | |
integral_init4h | 608 | 183 | |
integral_init4v | 781 | 175 | |
integral_init8h | 579 | 273 | |
integral_init8v | 287 | 86 | |
intra_predict_4x4_ddl | 44 | 27 | |
intra_predict_4x4_ddr | 51 | 36 | |
intra_predict_8x8_dc | 48 | 38 | |
intra_predict_8x8_ddl | 123 | 51 | |
intra_predict_8x8_ddr | 126 | 49 | |
intra_predict_8x8_h | 44 | 31 | |
intra_predict_8x8_hd | 122 | 52 | |
intra_predict_8x8_hu | 71 | 55 | |
intra_predict_8x8_v | 25 | 23 | |
intra_predict_8x8_vl | 113 | 46 | |
intra_predict_8x8_vr | 119 | 52 | |
intra_predict_8x8c_h | 43 | 42 | |
intra_predict_8x8c_v | 25 | 23 | |
intra_predict_8x16c_dc | 96 | 86 | |
intra_predict_8x16c_dcl | 68 | 68 | |
intra_predict_8x16c_h | 74 | 67 | |
intra_predict_8x16c_v | 42 | 26 | |
mbtree_propagate_cost | 3368 | 849 | |
mbtree_propagate_list | 7973 | 5689 | |
offsetadd_w4 | 469 | 120 | |
offsetadd_w8 | 808 | 121 | |
offsetadd_w16 | 1485 | 138 | |
offsetadd_w20 | 5266 | 176 | |
offsetsub_w4 | 430 | 120 | |
offsetsub_w8 | 739 | 122 | |
offsetsub_w16 | 1455 | 109 | |
offsetsub_w20 | 5462 | 164 | |
sa8d_8x8 | 833 | 134 | 163 |
sa8d_16x16 | 3249 | 413 | 665 |
sa8d_satd_16x16 | 578 | ||
ssd_4x4 | 74 | 48 | |
ssd_4x8 | 133 | 65 | |
ssd_8x4 | 130 | 44 | |
ssd_8x8 | 250 | 64 | 106 |
ssd_8x16 | 531 | 107 | |
ssd_16x8 | 493 | 86 | |
ssd_16x16 | 966 | 128 | 233 |
ssd_nv12 | 102376 | 2872 | |
ssim_end | 222 | 88 | |
sub8x16_dct_dc | 338 | 128 | |
var2_8x8 | 513 | 124 | |
var2_8x16 | 932 | 222 | |
vsad | 1140 | 151 | |
zigzag_sub_4x4_field | 87 | 58 | |
zigzag_sub_4x4_frame | 87 | 58 | |
zigzag_sub_4x4ac_field | 82 | 61 | |
zigzag_sub_4x4ac_frame | 84 | 65 |
Big-endian benchmark
Function | C | Assembly | C intrinsics |
---|---|---|---|
asd8 | 501 | 99 | |
avg_4x2 | 96 | 57 | |
avg_4x4 | 151 | 74 | |
avg_4x8 | 267 | 119 | |
avg_4x16 | 632 | 241 | |
avg_8x4 | 266 | 71 | |
avg_8x8 | 584 | 114 | |
avg_8x16 | 1197 | 189 | |
avg_16x8 | 1105 | 211 | |
avg_16x16 | 2283 | 340 | |
coeff_last15 | 39 | 27 | |
coeff_last16 | 40 | 28 | |
coeff_last64 | 214 | 60 | |
coeff_level_run8 | 50 | 48 | |
coeff_level_run15 | 69 | 65 | |
coeff_level_run16 | 73 | 61 | |
dct4x4dc | 126 | 43 | |
deblock_chroma[1] | 258 | 73 | |
deblock_chroma_420_intra_mbaff | 103 | 89 | |
deblock_chroma_420_mbaff | 155 | 100 | |
deblock_chroma_422_intra_mbaff | 209 | 125 | |
deblock_chroma_422_mbaff | 260 | 117 | |
deblock_chroma_intra[1] | 282 | 61 | |
deblock_h_chroma_420 | 254 | 117 | |
deblock_h_chroma_420_intra | 204 | 125 | |
deblock_h_chroma_422 | 549 | 193 | |
deblock_h_chroma_422_intra | 422 | 207 | |
deblock_strength | 515 | 107 | |
decimate_score15 | 67 | 35 | |
decimate_score16 | 68 | 36 | |
decimate_score64 | 283 | 63 | |
denoise_dct | 513 | 94 | |
idct4x4dc | 119 | 40 | |
integral_init4h | 688 | 156 | |
integral_init4v | 779 | 108 | |
integral_init8h | 583 | 250 | |
integral_init8v | 291 | 102 | |
intra_predict_4x4_ddl | 49 | 32 | |
intra_predict_4x4_ddr | 53 | 38 | |
intra_predict_8x8_dc | 58 | 38 | |
intra_predict_8x8_ddl | 128 | 34 | |
intra_predict_8x8_ddr | 131 | 35 | |
intra_predict_8x8_h | 48 | 29 | |
intra_predict_8x8_hd | 117 | 35 | |
intra_predict_8x8_hu | 79 | 45 | |
intra_predict_8x8_v | 32 | 27 | |
intra_predict_8x8_vl | 117 | 34 | |
intra_predict_8x8_vr | 122 | 44 | |
intra_predict_8x8c_dcl | 40 | 36 | |
intra_predict_8x8c_dct | 39 | 29 | |
intra_predict_8x8c_h | 47 | 34 | |
intra_predict_8x8c_v | 31 | 26 | |
intra_predict_8x16c_dc | 101 | 62 | |
intra_predict_8x16c_dcl | 71 | 53 | |
intra_predict_8x16c_dct | 48 | 33 | |
intra_predict_8x16c_h | 80 | 49 | |
intra_predict_8x16c_v | 42 | 32 | |
mbtree_propagate_cost | 5870 | 839 | |
mbtree_propagate_list | 7510 | 5448 | |
offsetadd_w4 | 431 | 123 | |
offsetadd_w8 | 765 | 121 | |
offsetadd_w16 | 1468 | 91 | |
offsetadd_w20 | 6426 | 139 | |
offsetsub_w4 | 439 | 121 | |
offsetsub_w8 | 763 | 112 | |
offsetsub_w16 | 1521 | 94 | |
offsetsub_w20 | 6434 | 137 | |
sa8d_8x8 | 825 | 125 | 482 |
sa8d_16x16 | 3212 | 393 | 1917 |
sa8d_satd_16x16 | 561 | ||
ssd_4x4 | 77 | 47 | |
ssd_4x8 | 130 | 63 | |
ssd_8x4 | 134 | 39 | |
ssd_8x8 | 257 | 57 | 640 |
ssd_8x16 | 537 | 96 | |
ssd_16x8 | 495 | 70 | |
ssd_16x16 | 966 | 120 | 791 |
ssd_nv12 | 101814 | 2370 | |
ssim_end | 311 | 85 | |
sub8x16_dct_dc | 340 | 114 | |
var2_8x8 | 514 | 117 | |
var2_8x16 | 934 | 199 | |
vsad | 1139 | 138 | |
zigzag_sub_4x4_field | 86 | 59 | |
zigzag_sub_4x4_frame | 87 | 60 | |
zigzag_sub_4x4ac_field | 87 | 65 | |
zigzag_sub_4x4ac_frame | 87 | 61 |
As noticed from the benchmark, the margin of big-endian performance is more than the little-endian one, part of this gap is due to the fact that vector load/store operations in little-endian mode need permuting for each operation. However, Power ISA 3.00 has introduced more load/store operations that eliminate the permuting operation on little-endian along with useful instructions such as 'Vector Absolute Difference', 'Vector Insert/Extract Element', and 'Count Trailing Zeros' Also, this patch add new cpu detection method for Linux and FreeBSD that is easy to extend for newer ISA versions.
checkasm has passed the tests of the implemented functions on both POWER8 and POWER9 for little-endian and big-endian modes.
Merge request reports
- version 1674174012
- version 15581bc3ed
- version 14b1c12de6
- version 131acffaf8
- version 123c0b2689
- version 115d2273c8
- version 106203dffc
- version 904350528
- version 804350528
- version 74e865777
- version 6c0b76290
- version 517324efa
- version 41ab2f845
- version 345000696
- version 2c2a658a6
- version 17687608f
- master (base)
- latest version790eb7301 commit,
- version 167417401218 commits,
- version 15581bc3ed17 commits,
- version 14b1c12de617 commits,
- version 131acffaf817 commits,
- version 123c0b268913 commits,
- version 115d2273c812 commits,
- version 106203dffc11 commits,
- version 90435052811 commits,
- version 80435052811 commits,
- version 74e86577710 commits,
- version 6c0b762909 commits,
- version 517324efa11 commits,
- version 41ab2f8455 commits,
- version 3450006964 commits,
- version 2c2a658a63 commits,
- version 17687608f1 commit,
- Side-by-side
- Inline