arm64: cdef: NEON optimized cdef filter function
Speedup vs C code: Cortex A53 A72 A73
cdef_filter_4x4_8bpc_neon: 4.62 4.48 4.76
cdef_filter_4x8_8bpc_neon: 4.82 4.80 5.08
cdef_filter_8x8_8bpc_neon: 5.29 5.33 5.79
I'll make the cdef_dir function separately in a separate MR afterwards, but this is the main cdef function showing up in profiles.
Edited by Martin Storsjö