arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths (!1452) · Merge requests · VideoLAN / dav1d

Martin Storsjö requested to merge mstorsjo/dav1d:arm64-10bpc-clip into master Sep 15, 2022

This fixes conformance with the argon test samples, in particular with these samples:

profile0_core/streams/test10100_579_8614.obu
profile0_core/streams/test10218_6914.obu

This gives a pretty notable slowdown to these transforms - some examples:

Before:                                 Cortex A53       A72       A73    Apple M1
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       365.7     291.4     299.2    0.5
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    1864.8    1408.2    1458.2    2.6
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   31019.8   25440.7   24892.5   42.8
After:
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       401.7     322.5     343.4    0.6
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    2154.4    1614.3    1704.9    2.7
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   38220.0   28423.7   28172.6   51.6

Thus, for the transforms alone, it makes them around 10-20% slower.

Measured on actual full decoding, it makes decoding of 10 bpc Chimera 2% slower on an Apple M1 (from 164 to 160 fps).

arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths

Merge request reports