Skip to content

AArch64: Specialise HBD Neon convolutions for 6-tap filters

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_hbd_6tap into master

This is a follow-up work to !1595 (merged).

The 8-tap sub-pel filters used for motion vector interpolation are: regular, smooth, sharp. The regular and smooth filter kernels are zero-padded, so they are effectively 6-tap filters (some of them are 5-tap or even 4-tap).

This patch specialises the high bit-depth versions of put_8tap_neon and prep_8tap_neon functions for 6-tap filters, avoiding a lot of redundant work to multiply by and add zero. Wherever the sharp filtering is used the 8-tap path will be always selected.

Benchmarks can show a 0.5-10.8% FPS uplift highly depending on the input video source. Binary size increase is about 8.5 KiB.

This change set also contains 6-tap HV convolution optimization similar to !1619 (merged).


Decode benchmarks with different input videos:

Chimera AV1 10-bit - 1080p:

AWS Graviton 2:  23.77 fps  ->  24.28 fps (+2.14 %)
AWS Graviton 3:  42.04 fps  ->  42.35 fps (+0.73 %)

Georgia HDR - 1080p:

AWS Graviton 2:  45.05 fps  ->  46.75 fps (+3.77 %)
AWS Graviton 3:  86.50 fps  ->  89.03 fps (+2.92 %)

RED HDR Reel - 1080p:

AWS Graviton 2:  47.23 fps  ->  49.51 fps (+4.82 %)
AWS Graviton 3:  86.77 fps  ->  90.08 fps (+3.81 %)

Bosphorus 10-bit - 1080p:

AWS Graviton 2:  63.21 fps  ->  69.76 fps (+10.4 %)
AWS Graviton 3: 134.92 fps  -> 149.51 fps (+10.8 %)

Bosphorus 10-bit was encoded by aomenc (3.7.1+):

aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=10 --input-bit-depth=10 --ivf -o Bosphorus_1080p_10bit.ivf Bosphorus_3840x2160_120fps_420_10bit_YUV_Y4M.y4m

Micro benchmark results on AWS Graviton 3
Functions:                        8-tap       6-tap
mc_8tap_*_w2_h_16bpc_neon:         27.9  ->    27.8   (  -0.36 % )
mc_8tap_*_w4_h_16bpc_neon:         33.7  ->    33.6   (  -0.30 % )
mc_8tap_*_w8_h_16bpc_neon:         75.2  ->    64.9   ( -13.70 % )
mc_8tap_*_w16_h_16bpc_neon:       205.2  ->   177.2   ( -13.65 % )
mc_8tap_*_w32_h_16bpc_neon:       634.6  ->   544.5   ( -14.20 % )
mc_8tap_*_w64_h_16bpc_neon:      2237.9  ->  1913.2   ( -14.51 % )
mc_8tap_*_w128_h_16bpc_neon:     6312.1  ->  5379.9   ( -14.77 % )

mct_8tap_*_w4_h_16bpc_neon:        22.4  ->    22.0   (  -1.79 % )
mct_8tap_*_w8_h_16bpc_neon:        80.7  ->    70.4   ( -12.76 % )
mct_8tap_*_w16_h_16bpc_neon:      248.2  ->   215.6   ( -13.13 % )
mct_8tap_*_w32_h_16bpc_neon:      963.3  ->   828.9   ( -13.95 % )
mct_8tap_*_w64_h_16bpc_neon:     2296.2  ->  1965.1   ( -14.42 % )
mct_8tap_*_w128_h_16bpc_neon:    5678.5  ->  4849.0   ( -14.61 % )

mc_8tap_*_w2_v_16bpc_neon:         30.8  ->    26.5   ( -13.96 % )
mc_8tap_*_w4_v_16bpc_neon:         41.6  ->    33.9   ( -18.51 % )
mc_8tap_*_w8_v_16bpc_neon:         68.7  ->    52.8   ( -23.14 % )
mc_8tap_*_w16_v_16bpc_neon:       193.6  ->   146.6   ( -24.28 % )
mc_8tap_*_w32_v_16bpc_neon:       614.9  ->   461.8   ( -24.90 % )
mc_8tap_*_w64_v_16bpc_neon:      2181.6  ->  1629.4   ( -25.31 % )
mc_8tap_*_w128_v_16bpc_neon:     6173.4  ->  4609.8   ( -25.33 % )

mct_8tap_*_w4_v_16bpc_neon:        31.0  ->    25.9   ( -16.45 % )
mct_8tap_*_w8_v_16bpc_neon:        80.8  ->    62.7   ( -22.40 % )
mct_8tap_*_w16_v_16bpc_neon:      265.4  ->   201.2   ( -24.19 % )
mct_8tap_*_w32_v_16bpc_neon:     1059.8  ->   792.2   ( -25.25 % )
mct_8tap_*_w64_v_16bpc_neon:     2557.6  ->  1913.9   ( -25.17 % )
mct_8tap_*_w128_v_16bpc_neon:    6368.9  ->  4758.0   ( -25.29 % )

mc_8tap_*_w2_hv_16bpc_neon:        54.0  ->    51.9   (  -3.89 % )
mc_8tap_*_w4_hv_16bpc_neon:        74.7  ->    64.6   ( -13.52 % )
mc_8tap_*_w8_hv_16bpc_neon:       149.9  ->   121.6   ( -18.88 % )
mc_8tap_*_w16_hv_16bpc_neon:      404.0  ->   329.3   ( -18.49 % )
mc_8tap_*_w32_hv_16bpc_neon:     1214.1  ->   995.5   ( -18.01 % )
mc_8tap_*_w64_hv_16bpc_neon:     4191.1  ->  3444.9   ( -17.80 % )
mc_8tap_*_w128_hv_16bpc_neon:   11693.9  ->  9641.2   ( -17.55 % )

mct_8tap_*_w4_hv_16bpc_neon:       53.7  ->    45.7   ( -14.90 % )
mct_8tap_*_w8_hv_16bpc_neon:      161.5  ->   126.2   ( -21.86 % )
mct_8tap_*_w16_hv_16bpc_neon:     496.5  ->   390.8   ( -21.29 % )
mct_8tap_*_w32_hv_16bpc_neon:    1889.4  ->  1482.1   ( -21.56 % )
mct_8tap_*_w64_hv_16bpc_neon:    4503.5  ->  3545.1   ( -21.28 % )
mct_8tap_*_w128_hv_16bpc_neon:  11112.0  ->  8779.9   ( -20.99 % )
Edited by Arpad Panyik

Merge request reports