arm32: looprestoration: NEON implementation of SGR for 10 bpc
This also contains, among the usual minor trivial fixups, a fairly notable speedup (overall 2-8%) for the existing arm64 looprestoration 10 bpc code.
Checkasm numbers:
Cortex A7 A8 A53 A72 A73
selfguided_3x3_10bpc_neon: 919127.6 717942.8 565717.8 404748.0 372179.8
selfguided_5x5_10bpc_neon: 640310.8 511873.4 370653.3 273593.7 256403.2
selfguided_mix_10bpc_neon: 1533887.0 1252389.5 922111.1 659033.4 613410.6
Corresponding numbers for arm64, for comparison:
Cortex A53 A72 A73
selfguided_3x3_10bpc_neon: 500706.0 367199.2 345261.2
selfguided_5x5_10bpc_neon: 361403.3 270550.0 249955.3
selfguided_mix_10bpc_neon: 846172.4 623590.3 578404.8