x86: Add 6-tap variants of 8bpc mc AVX-512 (Ice Lake) functions
Because the horizontal filter uses the VNNI vpdpbusd
instruction (which does 4 pixels per instruction),
there's nothing to gain from going down to 6-tap.
For the vertical filter 6-tap is still beneficial.
For the 2D (hv) case the benefits of 8-tap h + 6-tap v over dual 8-tap is obviously less significant compared to the AVX2 case where 6-tap is beneficial in both directions.
Note that this limitation is only applicable to 8bpc.
Zen 4 8-tap 6-tap
mc_8tap_w2_v: 17.8 15.3
mc_8tap_w2_hv: 26.0 23.0
mc_8tap_w4_v: 16.7 14.2
mc_8tap_w4_hv: 27.3 24.2
mc_8tap_w8_v: 18.4 16.1
mc_8tap_w8_hv: 48.8 43.6
mc_8tap_w16_v: 43.2 39.5
mc_8tap_w16_hv: 91.6 81.8
mc_8tap_w32_v: 113.0 104.1
mc_8tap_w32_hv: 273.0 247.5
mc_8tap_w64_v: 281.2 217.4
mc_8tap_w64_hv: 931.2 858.8
mc_8tap_w128_v: 796.3 616.1
mc_8tap_w128_hv: 2593.8 2380.5