arm64: mc: NEON implementation of blend for 16bpc
The branch includes a bunch of cleanups for the 8bpc code (primarily the arm64 version of it) noticed while working on the 16bpc version.
Checkasm numbers: Cortex A53 A72 A73
blend_h_w2_16bpc_neon: 109.3 83.1 56.7
blend_h_w4_16bpc_neon: 114.1 61.4 62.3
blend_h_w8_16bpc_neon: 133.3 80.8 81.1
blend_h_w16_16bpc_neon: 215.6 132.7 149.5
blend_h_w32_16bpc_neon: 390.4 254.2 235.8
blend_h_w64_16bpc_neon: 719.1 456.3 453.8
blend_h_w128_16bpc_neon: 1646.1 1112.3 1065.9
blend_v_w2_16bpc_neon: 185.9 175.9 180.0
blend_v_w4_16bpc_neon: 338.0 183.4 232.1
blend_v_w8_16bpc_neon: 426.5 213.8 250.6
blend_v_w16_16bpc_neon: 678.2 357.8 382.6
blend_v_w32_16bpc_neon: 1098.3 686.2 695.6
blend_w4_16bpc_neon: 75.7 31.5 32.0
blend_w8_16bpc_neon: 134.0 75.0 75.8
blend_w16_16bpc_neon: 467.9 267.3 310.0
blend_w32_16bpc_neon: 1201.9 658.7 779.7
Corresponding numbers for 8bpc for comparison:
blend_h_w2_8bpc_neon: 104.1 55.9 60.8
blend_h_w4_8bpc_neon: 108.9 58.7 48.2
blend_h_w8_8bpc_neon: 99.3 64.4 67.4
blend_h_w16_8bpc_neon: 145.2 93.4 85.1
blend_h_w32_8bpc_neon: 262.2 157.5 148.6
blend_h_w64_8bpc_neon: 466.7 278.9 256.6
blend_h_w128_8bpc_neon: 1054.2 624.7 571.0
blend_v_w2_8bpc_neon: 170.5 106.6 113.4
blend_v_w4_8bpc_neon: 333.0 189.9 225.9
blend_v_w8_8bpc_neon: 314.9 199.0 203.5
blend_v_w16_8bpc_neon: 476.9 300.8 241.1
blend_v_w32_8bpc_neon: 766.9 430.4 415.1
blend_w4_8bpc_neon: 66.7 35.4 26.0
blend_w8_8bpc_neon: 110.7 47.9 48.1
blend_w16_8bpc_neon: 299.4 161.8 162.3
blend_w32_8bpc_neon: 725.8 417.0 432.8