arm64: looprestoration: Add a NEON implementation of SGR

Relative speedup vs (autovectorized) C code:
                      Cortex A53    A72    A73
selfguided_3x3_8bpc_neon:   2.91   2.12   2.68
selfguided_5x5_8bpc_neon:   3.18   2.65   3.39
selfguided_mix_8bpc_neon:   3.04   2.29   2.98

The relative speedup vs non-vectorized C code is around 2.6-4.6x.
21 jobs for master in 7 minutes and 18 seconds (queued for 4 seconds)
Status Job ID Name Coverage
  Style
passed #271995
debian amd64
style-check

00:00:23

 
  Build
passed #271996
debian amd64
build-debian

00:00:39

passed #272004
debian aarch64
build-debian-aarch64

00:01:39

passed #272005
debian aarch64
build-debian-aarch64-clang-5

00:01:09

passed #272008
debian armv7
build-debian-armv7

00:02:51

passed #272009
debian armv7
build-debian-armv7-clang-5

00:01:03

passed #271997
debian amd64
build-debian-static

00:00:44

passed #272007
debian aarch64
build-debian-werror

00:00:36

passed #271998
debian amd64
build-debian32

00:00:37

passed #272006
macos
build-macos

00:00:26

passed #272010
debian amd64 allowed to fail
build-ubuntu-snap

00:01:12

passed #272002
debian amd64
build-win-arm32

00:00:29

passed #272003
debian amd64
build-win-arm64

00:00:36

passed #271999
debian amd64
build-win32

00:00:41

passed #272000
debian amd64
build-win32-unaligned-stack

00:00:45

passed #272001
debian amd64
build-win64

00:00:47

 
  Test
passed #272011
debian amd64
test-debian

00:00:46

passed #272012
debian amd64
test-debian-asan

00:02:13

passed #272013
debian amd64
test-debian-msan

00:01:13

passed #272014
debian amd64
test-debian-ubsan

00:01:36

passed #272015
debian amd64
test-win64

00:00:57