Skip to content

arm: cdef: Do a 8 bit implementation for cases with all edges present

Martin Storsjö requested to merge mstorsjo/dav1d:arm64-cdef-edges into master

This increases the code size by around 3.4 KB.

Before:
ARM32:                    Cortex A7      A8      A9     A53     A72     A73
cdef_filter_4x4_8bpc_neon:    807.1   517.0   617.7   506.6   429.9   357.8
cdef_filter_4x8_8bpc_neon:   1407.9   899.3  1054.6   862.3   726.5   628.1
cdef_filter_8x8_8bpc_neon:   2394.9  1456.8  1676.8  1461.2  1084.4  1101.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            460.7   301.8   308.0
cdef_filter_4x8_8bpc_neon:                            831.6   547.0   555.2
cdef_filter_8x8_8bpc_neon:                           1454.6   935.6   960.4

After:
ARM32:
cdef_filter_4x4_8bpc_neon:    669.3   541.3   524.4   424.9   322.7   298.1
cdef_filter_4x8_8bpc_neon:   1159.1   922.9   881.1   709.2   538.3   514.1
cdef_filter_8x8_8bpc_neon:   1888.8  1285.4  1358.5  1152.9   839.3   871.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            383.6   262.1   259.9
cdef_filter_4x8_8bpc_neon:                            684.9   472.2   464.7
cdef_filter_8x8_8bpc_neon:                           1160.0   756.8   788.0

(The checkasm benchmark averages three different cases; the fully edged case is one of those three, while it's the most common case in actual video. The difference is much bigger if only benchmarking that particular case.)

This actually apparently makes the code a little bit slower for the w=4 cases on Cortex A8, while it's a significant speedup on all other cores.

Edited by Jean-Baptiste Kempf

Merge request reports