aarch64: Improve performance of `subWxH_dct` kernels. sub4x4 SVE removed because NEON is faster
BEFORE => AFTER = IMPROVEMENT
--------------------------------------------------------------------------
sub4x4_dct_c: 67 => sub4x4_dct_c: 66 =
sub4x4_dct_neon: 51 => sub4x4_dct_neon: 15 = 51/13 = 3.4x
sub4x4_dct_sve: 19 => sub4x4_dct_sve: 19 = now redundant
sub8x8_dct_c: 321 => sub8x8_dct_c: 317 =
sub8x8_dct_neon: 69 => sub8x8_dct_neon: 63 = 69/63 = 1.10x
sub8x8_dct8_c: 540 => sub8x8_dct8_c: 534 =
sub8x8_dct8_neon: 110 => sub8x8_dct8_neon: 105 = 110/105 = 1.05x
sub8x8_dct_dc_c: 130 => sub8x8_dct_dc_c: 130 =
sub8x8_dct_dc_neon: 22 => sub8x8_dct_dc_neon: 18 = 22/18 = 1.22x
sub8x16_dct_dc_c: 283 => sub8x16_dct_dc_c: 280 =
sub8x16_dct_dc_neon: 51 => sub8x16_dct_dc_neon: 47 = 51/48 = 1.09x
sub16x16_dct_c: 1352 => sub16x16_dct_c: 1345 =
sub16x16_dct_neon: 318 => sub16x16_dct_neon: 283 = 318/283 = 1.12x
sub16x16_dct8_c: 2273 => sub16x16_dct8_c: 2279 =
sub16x16_dct8_neon: 499 => sub16x16_dct8_neon: 479 = 499/479 = 1.04x
Edited by Matthias Langer