arm64: cdef: NEON optimized cdef filter function

Speedup vs C code:     Cortex A53    A72    A73
cdef_filter_4x4_8bpc_neon:   4.62   4.48   4.76
cdef_filter_4x8_8bpc_neon:   4.82   4.80   5.08
cdef_filter_8x8_8bpc_neon:   5.29   5.33   5.79
18 jobs for master in 7 minutes and 21 seconds (queued for 1 second)
Status Job ID Name Coverage
  Style
passed #252425
amd64 debian
style-check

00:00:23

 
  Build
passed #252426
amd64 debian
build-debian

00:00:37

passed #252433
debian aarch64
build-debian-aarch64

00:01:27

passed #252434
debian aarch64
build-debian-aarch64-clang-5

00:01:01

passed #252437
debian armv7
build-debian-armv7

00:02:36

passed #252438
debian armv7
build-debian-armv7-clang-5

00:01:06

passed #252427
amd64 debian
build-debian-static

00:00:34

passed #252436
debian aarch64
build-debian-werror

00:00:31

passed #252428
amd64 debian
build-debian32

00:00:27

passed #252435
macos
build-macos

00:00:31

passed #252431
win32
build-win-arm32

00:00:29

passed #252432
win64
build-win-arm64

00:00:29

passed #252429
win32
build-win32

00:00:33

passed #252430
win64
build-win64

00:00:37

 
  Test
passed #252439
amd64 debian
test-debian

00:00:47

passed #252440
amd64 debian
test-debian-asan

00:02:08

passed #252441
amd64 debian
test-debian-msan

00:01:00

passed #252442
amd64 debian
test-debian-ubsan

00:01:23