arm64: mc: NEON implementation of w_mask for 16 bpc
Checkasm numbers: Cortex A53 A72 A73
w_mask_420_w4_16bpc_neon: 173.6 123.5 120.3
w_mask_420_w8_16bpc_neon: 484.2 344.1 329.5
w_mask_420_w16_16bpc_neon: 1411.2 1027.4 1035.1
w_mask_420_w32_16bpc_neon: 5561.5 4093.2 3980.1
w_mask_420_w64_16bpc_neon: 13809.6 9856.5 9581.0
w_mask_420_w128_16bpc_neon: 35614.7 25553.8 24284.4
w_mask_422_w4_16bpc_neon: 159.4 112.2 114.2
w_mask_422_w8_16bpc_neon: 453.4 326.1 326.7
w_mask_422_w16_16bpc_neon: 1394.6 1062.3 1050.2
w_mask_422_w32_16bpc_neon: 5485.8 4219.6 4027.3
w_mask_422_w64_16bpc_neon: 13701.2 10079.6 9692.6
w_mask_422_w128_16bpc_neon: 35455.3 25892.5 24625.9
w_mask_444_w4_16bpc_neon: 153.0 112.3 112.7
w_mask_444_w8_16bpc_neon: 437.2 331.8 325.8
w_mask_444_w16_16bpc_neon: 1395.1 1069.1 1041.7
w_mask_444_w32_16bpc_neon: 5370.1 4213.5 4138.1
w_mask_444_w64_16bpc_neon: 13482.6 10190.5 10004.6
w_mask_444_w128_16bpc_neon: 35583.7 26911.2 25638.8
Corresponding numbers for 8 bpc for comparison:
w_mask_420_w4_8bpc_neon: 126.6 79.1 87.7
w_mask_420_w8_8bpc_neon: 343.9 195.0 211.5
w_mask_420_w16_8bpc_neon: 886.3 540.3 577.7
w_mask_420_w32_8bpc_neon: 3558.6 2152.4 2216.7
w_mask_420_w64_8bpc_neon: 8894.9 5161.2 5297.0
w_mask_420_w128_8bpc_neon: 22520.1 13514.5 13887.2
w_mask_422_w4_8bpc_neon: 112.9 68.2 77.0
w_mask_422_w8_8bpc_neon: 314.4 175.5 208.7
w_mask_422_w16_8bpc_neon: 835.5 565.0 608.3
w_mask_422_w32_8bpc_neon: 3381.3 2231.8 2287.6
w_mask_422_w64_8bpc_neon: 8499.4 5343.6 5460.8
w_mask_422_w128_8bpc_neon: 21823.3 14206.5 14249.1
w_mask_444_w4_8bpc_neon: 104.6 65.8 72.7
w_mask_444_w8_8bpc_neon: 290.4 173.7 196.6
w_mask_444_w16_8bpc_neon: 831.4 586.7 591.7
w_mask_444_w32_8bpc_neon: 3320.8 2300.6 2251.0
w_mask_444_w64_8bpc_neon: 8300.0 5480.5 5346.8
w_mask_444_w128_8bpc_neon: 21633.8 15981.3 14384.8