arm64: loopfilter: NEON implementation of loopfilter for 16 bpc
Checkasm runtimes: Cortex A53 A72 A73
lpf_h_sb_uv_w4_16bpc_neon: 919.0 795.0 714.9
lpf_h_sb_uv_w6_16bpc_neon: 1267.7 1116.2 1081.9
lpf_h_sb_y_w4_16bpc_neon: 1500.2 1543.9 1778.5
lpf_h_sb_y_w8_16bpc_neon: 2216.1 2183.0 2568.1
lpf_h_sb_y_w16_16bpc_neon: 2641.8 2630.4 2639.4
lpf_v_sb_uv_w4_16bpc_neon: 836.5 572.7 667.3
lpf_v_sb_uv_w6_16bpc_neon: 1130.8 709.1 955.5
lpf_v_sb_y_w4_16bpc_neon: 1271.6 1434.4 1272.1
lpf_v_sb_y_w8_16bpc_neon: 1818.0 1759.1 1664.6
lpf_v_sb_y_w16_16bpc_neon: 1998.6 2115.8 1586.6
Corresponding numbers for 8 bpc for comparison:
lpf_h_sb_uv_w4_8bpc_neon: 799.4 632.8 695.4
lpf_h_sb_uv_w6_8bpc_neon: 1067.3 613.6 767.5
lpf_h_sb_y_w4_8bpc_neon: 1490.5 1179.1 1018.9
lpf_h_sb_y_w8_8bpc_neon: 1892.9 1382.0 1172.0
lpf_h_sb_y_w16_8bpc_neon: 2117.4 1625.4 1739.0
lpf_v_sb_uv_w4_8bpc_neon: 447.1 447.7 446.0
lpf_v_sb_uv_w6_8bpc_neon: 522.1 529.0 513.1
lpf_v_sb_y_w4_8bpc_neon: 1043.7 785.0 775.9
lpf_v_sb_y_w8_8bpc_neon: 1500.4 1115.9 881.2
lpf_v_sb_y_w16_8bpc_neon: 1493.5 1371.4 1248.5