arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths
This fixes conformance with the argon test samples, in particular with these samples:
profile0_core/streams/test10100_579_8614.obu
profile0_core/streams/test10218_6914.obuThis gives a pretty notable slowdown to these transforms - some examples:
Before:                                 Cortex A53       A72       A73    Apple M1
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       365.7     291.4     299.2    0.5
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    1864.8    1408.2    1458.2    2.6
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   31019.8   25440.7   24892.5   42.8
After:
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       401.7     322.5     343.4    0.6
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    2154.4    1614.3    1704.9    2.7
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   38220.0   28423.7   28172.6   51.6Thus, for the transforms alone, it makes them around 10-20% slower.
Measured on actual full decoding, it makes decoding of 10 bpc Chimera 2% slower on an Apple M1 (from 164 to 160 fps).