Optimize missing PowerPC functions
This patch optimizes several functions for PowerPC that didn't get hardware optimizations before, the functions are implemented by taking advantage of Power ISA 2.07 so the minimum processor that these optimizations target is POWER8. The functions have been implemented using pure assembly rather than C intrinsics for the following reasons:
- using the same scheme of the other architectures to easily get extended and modified, therefore stay on par with them
- Get more flexibility and control of using the assembly instructions, therefore achieve a close to optimal performance by interleaving independent instructions and saturate the execution units
- Avoid the different behavior of compiler variants which yield into a different results in various circumstances
- Isolating the endianness work into macros defined in 'asm.S' so one no need to worry about endianness while implementing functions
- Easily upgradable to new ISA versions for further optimizations.
This benchmark has been taken by executing 'checkasm' on POWER8 for little-endian and big-endian modes
Little-endian benchmark
Function | C | Assembly | C intrinsics |
---|---|---|---|
asd8 | 503 | 134 | |
avg_4x2 | 92 | 60 | |
avg_4x4 | 152 | 79 | |
avg_4x8 | 273 | 121 | |
avg_4x16 | 622 | 240 | |
avg_8x4 | 265 | 77 | |
avg_8x8 | 596 | 120 | |
avg_8x16 | 1194 | 211 | |
avg_16x8 | 1138 | 189 | |
avg_16x16 | 2318 | 336 | |
coeff_last4 | 24 | 20 | |
coeff_last8 | 29 | 21 | |
coeff_last15 | 39 | 27 | |
coeff_last16 | 37 | 25 | |
coeff_last64 | 228 | 60 | |
coeff_level_run4 | 39 | 35 | |
coeff_level_run8 | 49 | 47 | |
coeff_level_run15 | 70 | 69 | |
coeff_level_run16 | 74 | 67 | |
dct4x4dc | 123 | 50 | |
deblock_chroma[1] | 246 | 85 | |
deblock_chroma_420_intra_mbaff | 103 | 93 | |
deblock_chroma_420_mbaff | 153 | 106 | |
deblock_chroma_422_intra_mbaff | 210 | 124 | |
deblock_chroma_422_mbaff | 255 | 126 | |
deblock_chroma_intra[1] | 273 | 65 | |
deblock_h_chroma_420 | 254 | 126 | |
deblock_h_chroma_420_intra | 205 | 124 | |
deblock_h_chroma_422 | 551 | 233 | |
deblock_h_chroma_422_intra | 412 | 217 | |
deblock_strength | 502 | 124 | |
decimate_score15 | 66 | 33 | |
decimate_score16 | 69 | 35 | |
decimate_score64 | 376 | 75 | |
denoise_dct | 617 | 101 | |
idct4x4dc | 117 | 46 | |
integral_init4h | 608 | 183 | |
integral_init4v | 781 | 175 | |
integral_init8h | 579 | 273 | |
integral_init8v | 287 | 86 | |
intra_predict_4x4_ddl | 44 | 27 | |
intra_predict_4x4_ddr | 51 | 36 | |
intra_predict_8x8_dc | 48 | 38 | |
intra_predict_8x8_ddl | 123 | 51 | |
intra_predict_8x8_ddr | 126 | 49 | |
intra_predict_8x8_h | 44 | 31 | |
intra_predict_8x8_hd | 122 | 52 | |
intra_predict_8x8_hu | 71 | 55 | |
intra_predict_8x8_v | 25 | 23 | |
intra_predict_8x8_vl | 113 | 46 | |
intra_predict_8x8_vr | 119 | 52 | |
intra_predict_8x8c_h | 43 | 42 | |
intra_predict_8x8c_v | 25 | 23 | |
intra_predict_8x16c_dc | 96 | 86 | |
intra_predict_8x16c_dcl | 68 | 68 | |
intra_predict_8x16c_h | 74 | 67 | |
intra_predict_8x16c_v | 42 | 26 | |
mbtree_propagate_cost | 3368 | 849 | |
mbtree_propagate_list | 7973 | 5689 | |
offsetadd_w4 | 469 | 120 | |
offsetadd_w8 | 808 | 121 | |
offsetadd_w16 | 1485 | 138 | |
offsetadd_w20 | 5266 | 176 | |
offsetsub_w4 | 430 | 120 | |
offsetsub_w8 | 739 | 122 | |
offsetsub_w16 | 1455 | 109 | |
offsetsub_w20 | 5462 | 164 | |
sa8d_8x8 | 833 | 134 | 163 |
sa8d_16x16 | 3249 | 413 | 665 |
sa8d_satd_16x16 | 578 | ||
ssd_4x4 | 74 | 48 | |
ssd_4x8 | 133 | 65 | |
ssd_8x4 | 130 | 44 | |
ssd_8x8 | 250 | 64 | 106 |
ssd_8x16 | 531 | 107 | |
ssd_16x8 | 493 | 86 | |
ssd_16x16 | 966 | 128 | 233 |
ssd_nv12 | 102376 | 2872 | |
ssim_end | 222 | 88 | |
sub8x16_dct_dc | 338 | 128 | |
var2_8x8 | 513 | 124 | |
var2_8x16 | 932 | 222 | |
vsad | 1140 | 151 | |
zigzag_sub_4x4_field | 87 | 58 | |
zigzag_sub_4x4_frame | 87 | 58 | |
zigzag_sub_4x4ac_field | 82 | 61 | |
zigzag_sub_4x4ac_frame | 84 | 65 |
Big-endian benchmark
Function | C | Assembly | C intrinsics |
---|---|---|---|
asd8 | 501 | 99 | |
avg_4x2 | 96 | 57 | |
avg_4x4 | 151 | 74 | |
avg_4x8 | 267 | 119 | |
avg_4x16 | 632 | 241 | |
avg_8x4 | 266 | 71 | |
avg_8x8 | 584 | 114 | |
avg_8x16 | 1197 | 189 | |
avg_16x8 | 1105 | 211 | |
avg_16x16 | 2283 | 340 | |
coeff_last15 | 39 | 27 | |
coeff_last16 | 40 | 28 | |
coeff_last64 | 214 | 60 | |
coeff_level_run8 | 50 | 48 | |
coeff_level_run15 | 69 | 65 | |
coeff_level_run16 | 73 | 61 | |
dct4x4dc | 126 | 43 | |
deblock_chroma[1] | 258 | 73 | |
deblock_chroma_420_intra_mbaff | 103 | 89 | |
deblock_chroma_420_mbaff | 155 | 100 | |
deblock_chroma_422_intra_mbaff | 209 | 125 | |
deblock_chroma_422_mbaff | 260 | 117 | |
deblock_chroma_intra[1] | 282 | 61 | |
deblock_h_chroma_420 | 254 | 117 | |
deblock_h_chroma_420_intra | 204 | 125 | |
deblock_h_chroma_422 | 549 | 193 | |
deblock_h_chroma_422_intra | 422 | 207 | |
deblock_strength | 515 | 107 | |
decimate_score15 | 67 | 35 | |
decimate_score16 | 68 | 36 | |
decimate_score64 | 283 | 63 | |
denoise_dct | 513 | 94 | |
idct4x4dc | 119 | 40 | |
integral_init4h | 688 | 156 | |
integral_init4v | 779 | 108 | |
integral_init8h | 583 | 250 | |
integral_init8v | 291 | 102 | |
intra_predict_4x4_ddl | 49 | 32 | |
intra_predict_4x4_ddr | 53 | 38 | |
intra_predict_8x8_dc | 58 | 38 | |
intra_predict_8x8_ddl | 128 | 34 | |
intra_predict_8x8_ddr | 131 | 35 | |
intra_predict_8x8_h | 48 | 29 | |
intra_predict_8x8_hd | 117 | 35 | |
intra_predict_8x8_hu | 79 | 45 | |
intra_predict_8x8_v | 32 | 27 | |
intra_predict_8x8_vl | 117 | 34 | |
intra_predict_8x8_vr | 122 | 44 | |
intra_predict_8x8c_dcl | 40 | 36 | |
intra_predict_8x8c_dct | 39 | 29 | |
intra_predict_8x8c_h | 47 | 34 | |
intra_predict_8x8c_v | 31 | 26 | |
intra_predict_8x16c_dc | 101 | 62 | |
intra_predict_8x16c_dcl | 71 | 53 | |
intra_predict_8x16c_dct | 48 | 33 | |
intra_predict_8x16c_h | 80 | 49 | |
intra_predict_8x16c_v | 42 | 32 | |
mbtree_propagate_cost | 5870 | 839 | |
mbtree_propagate_list | 7510 | 5448 | |
offsetadd_w4 | 431 | 123 | |
offsetadd_w8 | 765 | 121 | |
offsetadd_w16 | 1468 | 91 | |
offsetadd_w20 | 6426 | 139 | |
offsetsub_w4 | 439 | 121 | |
offsetsub_w8 | 763 | 112 | |
offsetsub_w16 | 1521 | 94 | |
offsetsub_w20 | 6434 | 137 | |
sa8d_8x8 | 825 | 125 | 482 |
sa8d_16x16 | 3212 | 393 | 1917 |
sa8d_satd_16x16 | 561 | ||
ssd_4x4 | 77 | 47 | |
ssd_4x8 | 130 | 63 | |
ssd_8x4 | 134 | 39 | |
ssd_8x8 | 257 | 57 | 640 |
ssd_8x16 | 537 | 96 | |
ssd_16x8 | 495 | 70 | |
ssd_16x16 | 966 | 120 | 791 |
ssd_nv12 | 101814 | 2370 | |
ssim_end | 311 | 85 | |
sub8x16_dct_dc | 340 | 114 | |
var2_8x8 | 514 | 117 | |
var2_8x16 | 934 | 199 | |
vsad | 1139 | 138 | |
zigzag_sub_4x4_field | 86 | 59 | |
zigzag_sub_4x4_frame | 87 | 60 | |
zigzag_sub_4x4ac_field | 87 | 65 | |
zigzag_sub_4x4ac_frame | 87 | 61 |
As noticed from the benchmark, the margin of big-endian performance is more than the little-endian one, part of this gap is due to the fact that vector load/store operations in little-endian mode need permuting for each operation. However, Power ISA 3.00 has introduced more load/store operations that eliminate the permuting operation on little-endian along with useful instructions such as 'Vector Absolute Difference', 'Vector Insert/Extract Element', and 'Count Trailing Zeros' Also, this patch add new cpu detection method for Linux and FreeBSD that is easy to extend for newer ISA versions.
checkasm has passed the tests of the implemented functions on both POWER8 and POWER9 for little-endian and big-endian modes.