arm: looprestoration: NEON optimized wiener filter
The relative speedup compared to C code is around 4-8x:
Cortex A7 A8 A9 A53 A72 A73
wiener_luma_8bpc_neon: 4.00 7.54 4.74 6.84 4.91 8.01
Edited by Martin Storsjö