aarch64: Use rounded right shifts in dequant (dc755eab) · Commits · VideoLAN / x264

Commit dc755eab authored 1 year ago by Martin Storsjö

aarch64: Use rounded right shifts in dequant

Don't manually add in the rounding constant (via a fused multiply-add
instruction) when we can just do a plain rounded right shift.

                     Cortex A53   A72   A73
8bpc:
Before:
dequant_4x4_cqm_neon:       515   246   267
dequant_4x4_dc_cqm_neon:    410   265   266
dequant_4x4_dc_flat_neon:   413   271   271
dequant_4x4_flat_neon:      519   254   274
dequant_8x8_cqm_neon:      1555   980  1002
dequant_8x8_flat_neon:     1562   994  1014
After:
dequant_4x4_cqm_neon:       499   246   255
dequant_4x4_dc_cqm_neon:    376   265   255
dequant_4x4_dc_flat_neon:   378   271   260
dequant_4x4_flat_neon:      500   254   262
dequant_8x8_cqm_neon:      1489   900   925
dequant_8x8_flat_neon:     1493   915   938

10bpc:
Before:
dequant_4x4_cqm_neon:       483   275   275
dequant_4x4_dc_cqm_neon:    429   256   261
dequant_4x4_dc_flat_neon:   435   267   267
dequant_4x4_flat_neon:      487   283   288
dequant_8x8_cqm_neon:      1511  1112  1076
dequant_8x8_flat_neon:     1518  1139  1089
After:
dequant_4x4_cqm_neon:       472   255   239
dequant_4x4_dc_cqm_neon:    404   256   232
dequant_4x4_dc_flat_neon:   406   267   234
dequant_4x4_flat_neon:      472   255   239
dequant_8x8_cqm_neon:      1462   922   978
dequant_8x8_flat_neon:     1462   922   978

This makes it around 3% faster on the Cortex A53, around 8% faster
for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp
on A72/A73.

parent 4664f5aa

No related branches found

No related tags found

Pipeline #401959 passed with stages

in 3 minutes and 36 seconds

Hide whitespace changes

Inline Side-by-side

Showing with 33 additions and 69 deletions

Please register or to comment