arm64: looprestoration: Add a NEON implementation of SGR
Relative speedup vs (autovectorized) C code:
Cortex A53 A72 A73
selfguided_3x3_8bpc_neon: 2.91 2.12 2.68
selfguided_5x5_8bpc_neon: 3.18 2.65 3.39
selfguided_mix_8bpc_neon: 3.04 2.29 2.98
The relative speedup vs non-vectorized C code is around 2.6-4.6x.