arm: looprestoration: NEON optimized wiener filter

The relative speedup compared to C code is around 4-8x:

                    Cortex A7     A8     A9    A53    A72    A73
wiener_luma_8bpc_neon:   4.00   7.54   4.74   6.84   4.91   8.01
16 jobs for master in 6 minutes and 29 seconds (queued for 2 seconds)
Status Job ID Name Coverage
  Style
passed #249950
amd64 debian
style-check

00:00:22

 
  Build
passed #249951
amd64 debian
build-debian

00:00:33

passed #249958
debian aarch64
build-debian-aarch64

00:01:22

passed #249959
debian aarch64
build-debian-aarch64-clang-5

00:01:04

passed #249952
amd64 debian
build-debian-static

00:00:34

passed #249961
debian aarch64
build-debian-werror

00:00:32

passed #249953
amd64 debian
build-debian32

00:00:26

passed #249960
macos
build-macos

00:00:29

passed #249956
win32
build-win-arm32

00:00:27

passed #249957
win64
build-win-arm64

00:00:29

passed #249954
win32
build-win32

00:00:33

passed #249955
win64
build-win64

00:00:35

 
  Test
passed #249962
amd64 debian
test-debian

00:00:43

passed #249963
amd64 debian
test-debian-asan

00:02:07

passed #249964
amd64 debian
test-debian-msan

00:00:59

passed #249965
amd64 debian
test-debian-ubsan

00:01:22