Skip to content

arm/mc: Add 8 bit neon asm for avg, w_avg and mask

Martin Storsjö requested to merge mstorsjo/dav1d:arm_mc_asm into master

checkasm --bench numbers from a Snapdragon 835:

nop: 23.0
avg_w4_8bpc_c: 382.8
avg_w4_8bpc_neon: 35.7
avg_w8_8bpc_c: 590.5
avg_w8_8bpc_neon: 64.9
avg_w16_8bpc_c: 1310.1
avg_w16_8bpc_neon: 160.9
avg_w32_8bpc_c: 4108.3
avg_w32_8bpc_neon: 587.5
avg_w64_8bpc_c: 8406.4
avg_w64_8bpc_neon: 1368.7
avg_w128_8bpc_c: 19688.7
avg_w128_8bpc_neon: 3413.8
w_avg_w4_8bpc_c: 446.6
w_avg_w4_8bpc_neon: 52.8
w_avg_w8_8bpc_c: 736.0
w_avg_w8_8bpc_neon: 109.4
w_avg_w16_8bpc_c: 1840.4
w_avg_w16_8bpc_neon: 282.1
w_avg_w32_8bpc_c: 5984.5
w_avg_w32_8bpc_neon: 1079.6
w_avg_w64_8bpc_c: 12746.9
w_avg_w64_8bpc_neon: 2540.9
w_avg_w128_8bpc_c: 30252.6
w_avg_w128_8bpc_neon: 6372.0
mask_w4_8bpc_c: 489.9
mask_w4_8bpc_neon: 60.0
mask_w8_8bpc_c: 1109.5
mask_w8_8bpc_neon: 125.8
mask_w16_8bpc_c: 2713.3
mask_w16_8bpc_neon: 361.9
mask_w32_8bpc_c: 8999.7
mask_w32_8bpc_neon: 1371.3
mask_w64_8bpc_c: 19593.9
mask_w64_8bpc_neon: 3289.6
mask_w128_8bpc_c: 46820.6
mask_w128_8bpc_neon: 8773.1

This is a straight port of the corresponding asm from aarch64 by @janne. Up to width 32 it's more or less 1:1 exactly the same as for aarch64. For 64 and 128, there's not quite enough registers to do things exactly the same (and one vst1 can only store 32 bytes contrary to 64 bytes for a ld1). I tested pushing q4-q5 and using more registers but didn't get any significant gain from it.

Compared to the aarch64 version, I had to do the following changes:

  • I had to move the jump table closer, since the ADR range in ARM mode is very limited
  • Using 4 byte jump table entries, because ldrh r4, [r12, r4, lsl #2] isn't allowed to have the lsl #2 part. For thumb we could do 2 byte entries though, but that complicates code...

After resolving !202 (merged) in one way or another, I'll do the corresponding change here as well, if we want to use a different kind of symbol for the jump tables.

Edited by Martin Storsjö

Merge request reports