src/arm/64/cdef.S · efd9e5518e0ed5114f8b4579debd7ee6dbede21f · VideoLAN / dav1d

arm: cdef: Do an 8 bit implementation for cases with all edges present · b33f46e8

Martin Storsjö authored Feb 13, 2020

This increases the code size by around 3 KB on arm64.

Before:
ARM32:                    Cortex A7      A8      A9     A53     A72     A73
cdef_filter_4x4_8bpc_neon:    807.1   517.0   617.7   506.6   429.9   357.8
cdef_filter_4x8_8bpc_neon:   1407.9   899.3  1054.6   862.3   726.5   628.1
cdef_filter_8x8_8bpc_neon:   2394.9  1456.8  1676.8  1461.2  1084.4  1101.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            460.7   301.8   308.0
cdef_filter_4x8_8bpc_neon:                            831.6   547.0   555.2
cdef_filter_8x8_8bpc_neon:                           1454.6   935.6   960.4

After:
ARM32:
cdef_filter_4x4_8bpc_neon:    669.3   541.3   524.4   424.9   322.7   298.1
cdef_filter_4x8_8bpc_neon:   1159.1   922.9   881.1   709.2   538.3   514.1
cdef_filter_8x8_8bpc_neon:   1888.8  1285.4  1358.5  1152.9   839.3   871.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            383.6   262.1   259.9
cdef_filter_4x8_8bpc_neon:                            684.9   472.2   464.7
cdef_filter_8x8_8bpc_neon:                           1160.0   756.8   788.0

(The checkasm benchmark averages three different cases; the fully
edged case is one of those three, while it's the most common case
in actual video. The difference is much bigger if only benchmarking
that particular case.)

This actually apparently makes the code a little bit slower for the w=4
cases on Cortex A8, while it's a significant speedup on all other cores.

b33f46e8