arm64: mc: NEON implementation of put/prep 8tap/bilin for 16 bpc
Examples of checkasm benchmarks:
Cortex A53 A72 A73
mc_8tap_regular_w8_0_16bpc_neon: 96.8 49.6 62.5
mc_8tap_regular_w8_h_16bpc_neon: 570.3 388.0 467.2
mc_8tap_regular_w8_hv_16bpc_neon: 1035.8 776.7 891.3
mc_8tap_regular_w8_v_16bpc_neon: 400.6 285.0 278.3
mc_bilinear_w8_0_16bpc_neon: 90.0 44.8 57.8
mc_bilinear_w8_h_16bpc_neon: 191.2 158.7 156.4
mc_bilinear_w8_hv_16bpc_neon: 295.9 234.6 244.9
mc_bilinear_w8_v_16bpc_neon: 147.2 98.7 89.2
mct_8tap_regular_w8_0_16bpc_neon: 139.4 78.7 84.9
mct_8tap_regular_w8_h_16bpc_neon: 612.5 396.8 479.1
mct_8tap_regular_w8_hv_16bpc_neon: 1112.4 814.6 963.2
mct_8tap_regular_w8_v_16bpc_neon: 461.8 370.8 353.4
mct_bilinear_w8_0_16bpc_neon: 135.6 76.2 80.5
mct_bilinear_w8_h_16bpc_neon: 211.3 159.4 141.7
mct_bilinear_w8_hv_16bpc_neon: 325.7 237.2 227.2
mct_bilinear_w8_v_16bpc_neon: 180.7 135.9 129.5
For comparison, the corresponding numbers for 8 bpc:
mc_8tap_regular_w8_0_8bpc_neon: 78.6 41.0 39.5
mc_8tap_regular_w8_h_8bpc_neon: 371.2 299.6 348.3
mc_8tap_regular_w8_hv_8bpc_neon: 817.1 675.0 726.5
mc_8tap_regular_w8_v_8bpc_neon: 243.7 260.4 253.0
mc_bilinear_w8_0_8bpc_neon: 74.8 35.4 36.1
mc_bilinear_w8_h_8bpc_neon: 179.9 69.9 79.2
mc_bilinear_w8_hv_8bpc_neon: 210.8 132.4 144.8
mc_bilinear_w8_v_8bpc_neon: 141.6 64.9 65.4
mct_8tap_regular_w8_0_8bpc_neon: 101.7 54.4 59.5
mct_8tap_regular_w8_h_8bpc_neon: 391.3 329.1 358.3
mct_8tap_regular_w8_hv_8bpc_neon: 880.4 754.9 829.4
mct_8tap_regular_w8_v_8bpc_neon: 270.8 300.8 277.4
mct_bilinear_w8_0_8bpc_neon: 97.6 54.0 55.4
mct_bilinear_w8_h_8bpc_neon: 173.3 73.5 79.5
mct_bilinear_w8_hv_8bpc_neon: 228.3 163.0 174.0
mct_bilinear_w8_v_8bpc_neon: 128.9 72.5 63.3