AArch64: Specialise HBD Neon convolutions for 6-tap filters
This is a follow-up work to !1595 (merged).
The 8-tap sub-pel filters used for motion vector interpolation are: regular, smooth, sharp. The regular and smooth filter kernels are zero-padded, so they are effectively 6-tap filters (some of them are 5-tap or even 4-tap).
This patch specialises the high bit-depth versions of put_8tap_neon
and prep_8tap_neon
functions for 6-tap filters, avoiding a lot of
redundant work to multiply by and add zero. Wherever the sharp
filtering is used the 8-tap path will be always selected.
Benchmarks can show a 0.5-10.8% FPS uplift highly depending on the input video source. Binary size increase is about 8.5 KiB.
This change set also contains 6-tap HV convolution optimization similar to !1619 (merged).
Decode benchmarks with different input videos:
AWS Graviton 2: 23.77 fps -> 24.28 fps (+2.14 %)
AWS Graviton 3: 42.04 fps -> 42.35 fps (+0.73 %)
AWS Graviton 2: 45.05 fps -> 46.75 fps (+3.77 %)
AWS Graviton 3: 86.50 fps -> 89.03 fps (+2.92 %)
AWS Graviton 2: 47.23 fps -> 49.51 fps (+4.82 %)
AWS Graviton 3: 86.77 fps -> 90.08 fps (+3.81 %)
AWS Graviton 2: 63.21 fps -> 69.76 fps (+10.4 %)
AWS Graviton 3: 134.92 fps -> 149.51 fps (+10.8 %)
Bosphorus 10-bit was encoded by aomenc
(3.7.1+):
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=10 --input-bit-depth=10 --ivf -o Bosphorus_1080p_10bit.ivf Bosphorus_3840x2160_120fps_420_10bit_YUV_Y4M.y4m
Micro benchmark results on AWS Graviton 3
Functions: 8-tap 6-tap
mc_8tap_*_w2_h_16bpc_neon: 27.9 -> 27.8 ( -0.36 % )
mc_8tap_*_w4_h_16bpc_neon: 33.7 -> 33.6 ( -0.30 % )
mc_8tap_*_w8_h_16bpc_neon: 75.2 -> 64.9 ( -13.70 % )
mc_8tap_*_w16_h_16bpc_neon: 205.2 -> 177.2 ( -13.65 % )
mc_8tap_*_w32_h_16bpc_neon: 634.6 -> 544.5 ( -14.20 % )
mc_8tap_*_w64_h_16bpc_neon: 2237.9 -> 1913.2 ( -14.51 % )
mc_8tap_*_w128_h_16bpc_neon: 6312.1 -> 5379.9 ( -14.77 % )
mct_8tap_*_w4_h_16bpc_neon: 22.4 -> 22.0 ( -1.79 % )
mct_8tap_*_w8_h_16bpc_neon: 80.7 -> 70.4 ( -12.76 % )
mct_8tap_*_w16_h_16bpc_neon: 248.2 -> 215.6 ( -13.13 % )
mct_8tap_*_w32_h_16bpc_neon: 963.3 -> 828.9 ( -13.95 % )
mct_8tap_*_w64_h_16bpc_neon: 2296.2 -> 1965.1 ( -14.42 % )
mct_8tap_*_w128_h_16bpc_neon: 5678.5 -> 4849.0 ( -14.61 % )
mc_8tap_*_w2_v_16bpc_neon: 30.8 -> 26.5 ( -13.96 % )
mc_8tap_*_w4_v_16bpc_neon: 41.6 -> 33.9 ( -18.51 % )
mc_8tap_*_w8_v_16bpc_neon: 68.7 -> 52.8 ( -23.14 % )
mc_8tap_*_w16_v_16bpc_neon: 193.6 -> 146.6 ( -24.28 % )
mc_8tap_*_w32_v_16bpc_neon: 614.9 -> 461.8 ( -24.90 % )
mc_8tap_*_w64_v_16bpc_neon: 2181.6 -> 1629.4 ( -25.31 % )
mc_8tap_*_w128_v_16bpc_neon: 6173.4 -> 4609.8 ( -25.33 % )
mct_8tap_*_w4_v_16bpc_neon: 31.0 -> 25.9 ( -16.45 % )
mct_8tap_*_w8_v_16bpc_neon: 80.8 -> 62.7 ( -22.40 % )
mct_8tap_*_w16_v_16bpc_neon: 265.4 -> 201.2 ( -24.19 % )
mct_8tap_*_w32_v_16bpc_neon: 1059.8 -> 792.2 ( -25.25 % )
mct_8tap_*_w64_v_16bpc_neon: 2557.6 -> 1913.9 ( -25.17 % )
mct_8tap_*_w128_v_16bpc_neon: 6368.9 -> 4758.0 ( -25.29 % )
mc_8tap_*_w2_hv_16bpc_neon: 54.0 -> 51.9 ( -3.89 % )
mc_8tap_*_w4_hv_16bpc_neon: 74.7 -> 64.6 ( -13.52 % )
mc_8tap_*_w8_hv_16bpc_neon: 149.9 -> 121.6 ( -18.88 % )
mc_8tap_*_w16_hv_16bpc_neon: 404.0 -> 329.3 ( -18.49 % )
mc_8tap_*_w32_hv_16bpc_neon: 1214.1 -> 995.5 ( -18.01 % )
mc_8tap_*_w64_hv_16bpc_neon: 4191.1 -> 3444.9 ( -17.80 % )
mc_8tap_*_w128_hv_16bpc_neon: 11693.9 -> 9641.2 ( -17.55 % )
mct_8tap_*_w4_hv_16bpc_neon: 53.7 -> 45.7 ( -14.90 % )
mct_8tap_*_w8_hv_16bpc_neon: 161.5 -> 126.2 ( -21.86 % )
mct_8tap_*_w16_hv_16bpc_neon: 496.5 -> 390.8 ( -21.29 % )
mct_8tap_*_w32_hv_16bpc_neon: 1889.4 -> 1482.1 ( -21.56 % )
mct_8tap_*_w64_hv_16bpc_neon: 4503.5 -> 3545.1 ( -21.28 % )
mct_8tap_*_w128_hv_16bpc_neon: 11112.0 -> 8779.9 ( -20.99 % )