arm32: ipred: NEON implementation of ipred functions for 16 bpc
Plus the usual set of assorted cleanups and fixes noticed while working on the code.
Samples of some checkasm benchmarks:
Cortex A7 A8 A53 A72 A73
cfl_ac_420_w4_16bpc_neon: 258.2 130.0 187.8 88.1 99.9
cfl_ac_420_w8_16bpc_neon: 396.3 192.3 278.0 134.1 148.1
cfl_ac_420_w16_16bpc_neon: 705.9 341.5 508.4 231.2 263.0
intra_pred_filter_w32_10bpc_neon: 3450.6 3279.7 1505.6 1716.8 1631.0
intra_pred_filter_w32_12bpc_neon: 5075.2 2467.3 2027.9 1605.7 1556.0
intra_pred_paeth_w64_16bpc_neon: 7850.6 4682.9 4538.4 4640.4 4952.4
intra_pred_smooth_w64_16bpc_neon: 6807.7 4044.0 4001.4 3001.9 3131.5
Corresponding numbers for arm64:
Cortex A53 A72 A73
cfl_ac_420_w4_16bpc_neon: 154.8 87.1 81.6
cfl_ac_420_w8_16bpc_neon: 235.6 124.8 133.0
cfl_ac_420_w16_16bpc_neon: 428.8 206.5 234.9
intra_pred_filter_w32_10bpc_neon: 1333.2 1485.9 1468.3
intra_pred_filter_w32_12bpc_neon: 1839.1 1429.0 1439.7
intra_pred_paeth_w64_16bpc_neon: 3691.1 3091.8 3289.7
intra_pred_smooth_w64_16bpc_neon: 3776.8 3124.4 2827.1