Optimize missing PowerPC functions (!43) · Merge requests · VideoLAN / x264

This patch optimizes several functions for PowerPC that didn't get hardware optimizations before, the functions are implemented by taking advantage of Power ISA 2.07 so the minimum processor that these optimizations target is POWER8. The functions have been implemented using pure assembly rather than C intrinsics for the following reasons:

using the same scheme of the other architectures to easily get extended and modified, therefore stay on par with them
Get more flexibility and control of using the assembly instructions, therefore achieve a close to optimal performance by interleaving independent instructions and saturate the execution units
Avoid the different behavior of compiler variants which yield into a different results in various circumstances
Isolating the endianness work into macros defined in 'asm.S' so one no need to worry about endianness while implementing functions
Easily upgradable to new ISA versions for further optimizations.

This benchmark has been taken by executing 'checkasm' on POWER8 for little-endian and big-endian modes

Little-endian benchmark

Function	C	Assembly	C intrinsics
asd8	503	134
avg_4x2	92	60
avg_4x4	152	79
avg_4x8	273	121
avg_4x16	622	240
avg_8x4	265	77
avg_8x8	596	120
avg_8x16	1194	211
avg_16x8	1138	189
avg_16x16	2318	336
coeff_last4	24	20
coeff_last8	29	21
coeff_last15	39	27
coeff_last16	37	25
coeff_last64	228	60
coeff_level_run4	39	35
coeff_level_run8	49	47
coeff_level_run15	70	69
coeff_level_run16	74	67
dct4x4dc	123	50
deblock_chroma[1]	246	85
deblock_chroma_420_intra_mbaff	103	93
deblock_chroma_420_mbaff	153	106
deblock_chroma_422_intra_mbaff	210	124
deblock_chroma_422_mbaff	255	126
deblock_chroma_intra[1]	273	65
deblock_h_chroma_420	254	126
deblock_h_chroma_420_intra	205	124
deblock_h_chroma_422	551	233
deblock_h_chroma_422_intra	412	217
deblock_strength	502	124
decimate_score15	66	33
decimate_score16	69	35
decimate_score64	376	75
denoise_dct	617	101
idct4x4dc	117	46
integral_init4h	608	183
integral_init4v	781	175
integral_init8h	579	273
integral_init8v	287	86
intra_predict_4x4_ddl	44	27
intra_predict_4x4_ddr	51	36
intra_predict_8x8_dc	48	38
intra_predict_8x8_ddl	123	51
intra_predict_8x8_ddr	126	49
intra_predict_8x8_h	44	31
intra_predict_8x8_hd	122	52
intra_predict_8x8_hu	71	55
intra_predict_8x8_v	25	23
intra_predict_8x8_vl	113	46
intra_predict_8x8_vr	119	52
intra_predict_8x8c_h	43	42
intra_predict_8x8c_v	25	23
intra_predict_8x16c_dc	96	86
intra_predict_8x16c_dcl	68	68
intra_predict_8x16c_h	74	67
intra_predict_8x16c_v	42	26
mbtree_propagate_cost	3368	849
mbtree_propagate_list	7973	5689
offsetadd_w4	469	120
offsetadd_w8	808	121
offsetadd_w16	1485	138
offsetadd_w20	5266	176
offsetsub_w4	430	120
offsetsub_w8	739	122
offsetsub_w16	1455	109
offsetsub_w20	5462	164
sa8d_8x8	833	134	163
sa8d_16x16	3249	413	665
sa8d_satd_16x16		578
ssd_4x4	74	48
ssd_4x8	133	65
ssd_8x4	130	44
ssd_8x8	250	64	106
ssd_8x16	531	107
ssd_16x8	493	86
ssd_16x16	966	128	233
ssd_nv12	102376	2872
ssim_end	222	88
sub8x16_dct_dc	338	128
var2_8x8	513	124
var2_8x16	932	222
vsad	1140	151
zigzag_sub_4x4_field	87	58
zigzag_sub_4x4_frame	87	58
zigzag_sub_4x4ac_field	82	61
zigzag_sub_4x4ac_frame	84	65

Big-endian benchmark

Function	C	Assembly	C intrinsics
asd8	501	99
avg_4x2	96	57
avg_4x4	151	74
avg_4x8	267	119
avg_4x16	632	241
avg_8x4	266	71
avg_8x8	584	114
avg_8x16	1197	189
avg_16x8	1105	211
avg_16x16	2283	340
coeff_last15	39	27
coeff_last16	40	28
coeff_last64	214	60
coeff_level_run8	50	48
coeff_level_run15	69	65
coeff_level_run16	73	61
dct4x4dc	126	43
deblock_chroma[1]	258	73
deblock_chroma_420_intra_mbaff	103	89
deblock_chroma_420_mbaff	155	100
deblock_chroma_422_intra_mbaff	209	125
deblock_chroma_422_mbaff	260	117
deblock_chroma_intra[1]	282	61
deblock_h_chroma_420	254	117
deblock_h_chroma_420_intra	204	125
deblock_h_chroma_422	549	193
deblock_h_chroma_422_intra	422	207
deblock_strength	515	107
decimate_score15	67	35
decimate_score16	68	36
decimate_score64	283	63
denoise_dct	513	94
idct4x4dc	119	40
integral_init4h	688	156
integral_init4v	779	108
integral_init8h	583	250
integral_init8v	291	102
intra_predict_4x4_ddl	49	32
intra_predict_4x4_ddr	53	38
intra_predict_8x8_dc	58	38
intra_predict_8x8_ddl	128	34
intra_predict_8x8_ddr	131	35
intra_predict_8x8_h	48	29
intra_predict_8x8_hd	117	35
intra_predict_8x8_hu	79	45
intra_predict_8x8_v	32	27
intra_predict_8x8_vl	117	34
intra_predict_8x8_vr	122	44
intra_predict_8x8c_dcl	40	36
intra_predict_8x8c_dct	39	29
intra_predict_8x8c_h	47	34
intra_predict_8x8c_v	31	26
intra_predict_8x16c_dc	101	62
intra_predict_8x16c_dcl	71	53
intra_predict_8x16c_dct	48	33
intra_predict_8x16c_h	80	49
intra_predict_8x16c_v	42	32
mbtree_propagate_cost	5870	839
mbtree_propagate_list	7510	5448
offsetadd_w4	431	123
offsetadd_w8	765	121
offsetadd_w16	1468	91
offsetadd_w20	6426	139
offsetsub_w4	439	121
offsetsub_w8	763	112
offsetsub_w16	1521	94
offsetsub_w20	6434	137
sa8d_8x8	825	125	482
sa8d_16x16	3212	393	1917
sa8d_satd_16x16		561
ssd_4x4	77	47
ssd_4x8	130	63
ssd_8x4	134	39
ssd_8x8	257	57	640
ssd_8x16	537	96
ssd_16x8	495	70
ssd_16x16	966	120	791
ssd_nv12	101814	2370
ssim_end	311	85
sub8x16_dct_dc	340	114
var2_8x8	514	117
var2_8x16	934	199
vsad	1139	138
zigzag_sub_4x4_field	86	59
zigzag_sub_4x4_frame	87	60
zigzag_sub_4x4ac_field	87	65
zigzag_sub_4x4ac_frame	87	61

As noticed from the benchmark, the margin of big-endian performance is more than the little-endian one, part of this gap is due to the fact that vector load/store operations in little-endian mode need permuting for each operation. However, Power ISA 3.00 has introduced more load/store operations that eliminate the permuting operation on little-endian along with useful instructions such as 'Vector Absolute Difference', 'Vector Insert/Extract Element', and 'Count Trailing Zeros' Also, this patch add new cpu detection method for Linux and FreeBSD that is easy to extend for newer ISA versions.

checkasm has passed the tests of the implemented functions on both POWER8 and POWER9 for little-endian and big-endian modes.

Optimize missing PowerPC functions

Merge request reports