Skip to content
Snippets Groups Projects
Commit dc755eab authored by Martin Storsjö's avatar Martin Storsjö
Browse files

aarch64: Use rounded right shifts in dequant

Don't manually add in the rounding constant (via a fused multiply-add
instruction) when we can just do a plain rounded right shift.

                     Cortex A53   A72   A73
8bpc:
Before:
dequant_4x4_cqm_neon:       515   246   267
dequant_4x4_dc_cqm_neon:    410   265   266
dequant_4x4_dc_flat_neon:   413   271   271
dequant_4x4_flat_neon:      519   254   274
dequant_8x8_cqm_neon:      1555   980  1002
dequant_8x8_flat_neon:     1562   994  1014
After:
dequant_4x4_cqm_neon:       499   246   255
dequant_4x4_dc_cqm_neon:    376   265   255
dequant_4x4_dc_flat_neon:   378   271   260
dequant_4x4_flat_neon:      500   254   262
dequant_8x8_cqm_neon:      1489   900   925
dequant_8x8_flat_neon:     1493   915   938

10bpc:
Before:
dequant_4x4_cqm_neon:       483   275   275
dequant_4x4_dc_cqm_neon:    429   256   261
dequant_4x4_dc_flat_neon:   435   267   267
dequant_4x4_flat_neon:      487   283   288
dequant_8x8_cqm_neon:      1511  1112  1076
dequant_8x8_flat_neon:     1518  1139  1089
After:
dequant_4x4_cqm_neon:       472   255   239
dequant_4x4_dc_cqm_neon:    404   256   232
dequant_4x4_dc_flat_neon:   406   267   234
dequant_4x4_flat_neon:      472   255   239
dequant_8x8_cqm_neon:      1462   922   978
dequant_8x8_flat_neon:     1462   922   978

This makes it around 3% faster on the Cortex A53, around 8% faster
for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp
on A72/A73.
parent 4664f5aa
No related branches found
No related tags found
Loading
Pipeline #401959 passed with stages
in 3 minutes and 36 seconds
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment