aarch64: Use rounded right shifts in dequant
Don't manually add in the rounding constant (via a fused multiply-add instruction) when we can just do a plain rounded right shift. Cortex A53 A72 A73 8bpc: Before: dequant_4x4_cqm_neon: 515 246 267 dequant_4x4_dc_cqm_neon: 410 265 266 dequant_4x4_dc_flat_neon: 413 271 271 dequant_4x4_flat_neon: 519 254 274 dequant_8x8_cqm_neon: 1555 980 1002 dequant_8x8_flat_neon: 1562 994 1014 After: dequant_4x4_cqm_neon: 499 246 255 dequant_4x4_dc_cqm_neon: 376 265 255 dequant_4x4_dc_flat_neon: 378 271 260 dequant_4x4_flat_neon: 500 254 262 dequant_8x8_cqm_neon: 1489 900 925 dequant_8x8_flat_neon: 1493 915 938 10bpc: Before: dequant_4x4_cqm_neon: 483 275 275 dequant_4x4_dc_cqm_neon: 429 256 261 dequant_4x4_dc_flat_neon: 435 267 267 dequant_4x4_flat_neon: 487 283 288 dequant_8x8_cqm_neon: 1511 1112 1076 dequant_8x8_flat_neon: 1518 1139 1089 After: dequant_4x4_cqm_neon: 472 255 239 dequant_4x4_dc_cqm_neon: 404 256 232 dequant_4x4_dc_flat_neon: 406 267 234 dequant_4x4_flat_neon: 472 255 239 dequant_8x8_cqm_neon: 1462 922 978 dequant_8x8_flat_neon: 1462 922 978 This makes it around 3% faster on the Cortex A53, around 8% faster for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp on A72/A73.
Loading
Please register or sign in to comment