arm32: looprestoration: NEON implementation of wiener filter for 16 bpc
Checkasm benchmarks:
Cortex A7 A8 A53 A72 A73
wiener_chroma_10bpc_neon: 385312.5 165772.7 184308.2 122311.2 126050.2
wiener_chroma_12bpc_neon: 385296.7 165538.0 184438.2 122290.5 126205.3
wiener_luma_10bpc_neon: 385318.5 165985.3 184147.4 122311.1 126168.4
wiener_luma_12bpc_neon: 385316.3 165819.1 184484.7 122304.4 125982.4
The corresponding numbers for arm64 for comparison:
Cortex A53 A72 A73
wiener_chroma_10bpc_neon: 176319.7 125992.1 128162.4
wiener_chroma_12bpc_neon: 176386.2 125986.4 128343.8
wiener_luma_10bpc_neon: 176174.0 126001.7 128227.8
wiener_luma_12bpc_neon: 176176.5 125992.1 128204.8
The arm32 version actually seems to run marginally faster than the arm64 one on A72 and A73. I believe this is because the arm64 code is tuned for A53 (which makes it a bit slower on other cores), but the arm32 code can't be tuned exactly the same way due to fewer registers being available.