AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters
The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+ SRSHL instruction sequences. In the horizontal case this can be rewritten using a single SQSHRUN instruction with an additional rounding value (34 for 10-bit and 40 for 12-bit). Relative runtime of micro benchmarks after this patch on some Cortex CPU cores: regular: X1 A78 A76 A55 mc w2: 0.847x 0.864x 0.822x 0.859x mc w4: 0.889x 0.994x 0.868x 0.917x mc w8: 0.857x 0.911x 0.915x 0.978x mc w16: 0.890x 0.982x 0.868x 0.974x mc w32: 0.904x 0.991x 0.873x 0.967x mc w64: 0.919x 1.003x 0.860x 0.970x
parent
d2687884
No related branches found
No related tags found
Loading
Please register or sign in to comment