Skip to content

Optimize missing PowerPC functions

Mamone Tarsha requested to merge mamonet/x264:master into master

This patch optimizes several functions for PowerPC that didn't get hardware optimizations before, the functions are implemented by taking advantage of Power ISA 2.07 so the minimum processor that these optimizations target is POWER8. The functions have been implemented using pure assembly rather than C intrinsics for the following reasons:

  • using the same scheme of the other architectures to easily get extended and modified, therefore stay on par with them
  • Get more flexibility and control of using the assembly instructions, therefore achieve a close to optimal performance by interleaving independent instructions and saturate the execution units
  • Avoid the different behavior of compiler variants which yield into a different results in various circumstances
  • Isolating the endianness work into macros defined in 'asm.S' so one no need to worry about endianness while implementing functions
  • Easily upgradable to new ISA versions for further optimizations.

This benchmark has been taken by executing 'checkasm' on POWER8 for little-endian and big-endian modes

Little-endian benchmark


Function C Assembly C intrinsics
asd8 503 134
avg_4x2 92 60
avg_4x4 152 79
avg_4x8 273 121
avg_4x16 622 240
avg_8x4 265 77
avg_8x8 596 120
avg_8x16 1194 211
avg_16x8 1138 189
avg_16x16 2318 336
coeff_last4 24 20
coeff_last8 29 21
coeff_last15 39 27
coeff_last16 37 25
coeff_last64 228 60
coeff_level_run4 39 35
coeff_level_run8 49 47
coeff_level_run15 70 69
coeff_level_run16 74 67
dct4x4dc 123 50
deblock_chroma[1] 246 85
deblock_chroma_420_intra_mbaff 103 93
deblock_chroma_420_mbaff 153 106
deblock_chroma_422_intra_mbaff 210 124
deblock_chroma_422_mbaff 255 126
deblock_chroma_intra[1] 273 65
deblock_h_chroma_420 254 126
deblock_h_chroma_420_intra 205 124
deblock_h_chroma_422 551 233
deblock_h_chroma_422_intra 412 217
deblock_strength 502 124
decimate_score15 66 33
decimate_score16 69 35
decimate_score64 376 75
denoise_dct 617 101
idct4x4dc 117 46
integral_init4h 608 183
integral_init4v 781 175
integral_init8h 579 273
integral_init8v 287 86
intra_predict_4x4_ddl 44 27
intra_predict_4x4_ddr 51 36
intra_predict_8x8_dc 48 38
intra_predict_8x8_ddl 123 51
intra_predict_8x8_ddr 126 49
intra_predict_8x8_h 44 31
intra_predict_8x8_hd 122 52
intra_predict_8x8_hu 71 55
intra_predict_8x8_v 25 23
intra_predict_8x8_vl 113 46
intra_predict_8x8_vr 119 52
intra_predict_8x8c_h 43 42
intra_predict_8x8c_v 25 23
intra_predict_8x16c_dc 96 86
intra_predict_8x16c_dcl 68 68
intra_predict_8x16c_h 74 67
intra_predict_8x16c_v 42 26
mbtree_propagate_cost 3368 849
mbtree_propagate_list 7973 5689
offsetadd_w4 469 120
offsetadd_w8 808 121
offsetadd_w16 1485 138
offsetadd_w20 5266 176
offsetsub_w4 430 120
offsetsub_w8 739 122
offsetsub_w16 1455 109
offsetsub_w20 5462 164
sa8d_8x8 833 134 163
sa8d_16x16 3249 413 665
sa8d_satd_16x16 578
ssd_4x4 74 48
ssd_4x8 133 65
ssd_8x4 130 44
ssd_8x8 250 64 106
ssd_8x16 531 107
ssd_16x8 493 86
ssd_16x16 966 128 233
ssd_nv12 102376 2872
ssim_end 222 88
sub8x16_dct_dc 338 128
var2_8x8 513 124
var2_8x16 932 222
vsad 1140 151
zigzag_sub_4x4_field 87 58
zigzag_sub_4x4_frame 87 58
zigzag_sub_4x4ac_field 82 61
zigzag_sub_4x4ac_frame 84 65

Big-endian benchmark


Function C Assembly C intrinsics
asd8 501 99
avg_4x2 96 57
avg_4x4 151 74
avg_4x8 267 119
avg_4x16 632 241
avg_8x4 266 71
avg_8x8 584 114
avg_8x16 1197 189
avg_16x8 1105 211
avg_16x16 2283 340
coeff_last15 39 27
coeff_last16 40 28
coeff_last64 214 60
coeff_level_run8 50 48
coeff_level_run15 69 65
coeff_level_run16 73 61
dct4x4dc 126 43
deblock_chroma[1] 258 73
deblock_chroma_420_intra_mbaff 103 89
deblock_chroma_420_mbaff 155 100
deblock_chroma_422_intra_mbaff 209 125
deblock_chroma_422_mbaff 260 117
deblock_chroma_intra[1] 282 61
deblock_h_chroma_420 254 117
deblock_h_chroma_420_intra 204 125
deblock_h_chroma_422 549 193
deblock_h_chroma_422_intra 422 207
deblock_strength 515 107
decimate_score15 67 35
decimate_score16 68 36
decimate_score64 283 63
denoise_dct 513 94
idct4x4dc 119 40
integral_init4h 688 156
integral_init4v 779 108
integral_init8h 583 250
integral_init8v 291 102
intra_predict_4x4_ddl 49 32
intra_predict_4x4_ddr 53 38
intra_predict_8x8_dc 58 38
intra_predict_8x8_ddl 128 34
intra_predict_8x8_ddr 131 35
intra_predict_8x8_h 48 29
intra_predict_8x8_hd 117 35
intra_predict_8x8_hu 79 45
intra_predict_8x8_v 32 27
intra_predict_8x8_vl 117 34
intra_predict_8x8_vr 122 44
intra_predict_8x8c_dcl 40 36
intra_predict_8x8c_dct 39 29
intra_predict_8x8c_h 47 34
intra_predict_8x8c_v 31 26
intra_predict_8x16c_dc 101 62
intra_predict_8x16c_dcl 71 53
intra_predict_8x16c_dct 48 33
intra_predict_8x16c_h 80 49
intra_predict_8x16c_v 42 32
mbtree_propagate_cost 5870 839
mbtree_propagate_list 7510 5448
offsetadd_w4 431 123
offsetadd_w8 765 121
offsetadd_w16 1468 91
offsetadd_w20 6426 139
offsetsub_w4 439 121
offsetsub_w8 763 112
offsetsub_w16 1521 94
offsetsub_w20 6434 137
sa8d_8x8 825 125 482
sa8d_16x16 3212 393 1917
sa8d_satd_16x16 561
ssd_4x4 77 47
ssd_4x8 130 63
ssd_8x4 134 39
ssd_8x8 257 57 640
ssd_8x16 537 96
ssd_16x8 495 70
ssd_16x16 966 120 791
ssd_nv12 101814 2370
ssim_end 311 85
sub8x16_dct_dc 340 114
var2_8x8 514 117
var2_8x16 934 199
vsad 1139 138
zigzag_sub_4x4_field 86 59
zigzag_sub_4x4_frame 87 60
zigzag_sub_4x4ac_field 87 65
zigzag_sub_4x4ac_frame 87 61

As noticed from the benchmark, the margin of big-endian performance is more than the little-endian one, part of this gap is due to the fact that vector load/store operations in little-endian mode need permuting for each operation. However, Power ISA 3.00 has introduced more load/store operations that eliminate the permuting operation on little-endian along with useful instructions such as 'Vector Absolute Difference', 'Vector Insert/Extract Element', and 'Count Trailing Zeros' Also, this patch add new cpu detection method for Linux and FreeBSD that is easy to extend for newer ISA versions.

checkasm has passed the tests of the implemented functions on both POWER8 and POWER9 for little-endian and big-endian modes.

Merge request reports