Skip to content

arm64: mc: Implement 8tap and bilin functions

Martin Storsjö requested to merge mstorsjo/dav1d:arm64-mc into master

These functions have been tuned against Cortex A53 and Snapdragon 835. The bilin functions have mainly been written with code size in mind, as they aren't used much in practice.

Relative speedups for the actual filtering fuctions (that don't just do a plain copy) are around 4-10x, some over 17x.

Relative speedups measured with checkasm:

                                Cortex A53   Snapdragon 835
mc_8tap_regular_w2_0_8bpc_neon:       7.21   5.49
mc_8tap_regular_w2_h_8bpc_neon:       5.84   5.39
mc_8tap_regular_w2_hv_8bpc_neon:      5.85   5.56
mc_8tap_regular_w2_v_8bpc_neon:      10.97   7.83
mc_8tap_regular_w4_0_8bpc_neon:       7.03   6.02
mc_8tap_regular_w4_h_8bpc_neon:       8.99   7.22
mc_8tap_regular_w4_hv_8bpc_neon:      7.79   7.59
mc_8tap_regular_w4_v_8bpc_neon:      13.76  10.40
mc_8tap_regular_w8_0_8bpc_neon:       7.22   6.01
mc_8tap_regular_w8_h_8bpc_neon:      10.79   7.51
mc_8tap_regular_w8_hv_8bpc_neon:     10.18   6.93
mc_8tap_regular_w8_v_8bpc_neon:      16.90  11.39
mc_8tap_regular_w16_0_8bpc_neon:      6.84   4.01
mc_8tap_regular_w16_h_8bpc_neon:     12.52   8.08
mc_8tap_regular_w16_hv_8bpc_neon:    10.15   6.71
mc_8tap_regular_w16_v_8bpc_neon:     17.24  10.66
mc_8tap_regular_w32_0_8bpc_neon:      3.96   3.20
mc_8tap_regular_w32_h_8bpc_neon:     10.15   6.39
mc_8tap_regular_w32_hv_8bpc_neon:     6.59   4.61
mc_8tap_regular_w32_v_8bpc_neon:     12.00   8.22
mc_8tap_regular_w64_0_8bpc_neon:      2.90   1.64
mc_8tap_regular_w64_h_8bpc_neon:      7.73   4.71
mc_8tap_regular_w64_hv_8bpc_neon:     4.43   3.20
mc_8tap_regular_w64_v_8bpc_neon:      8.50   5.91
mc_8tap_regular_w128_0_8bpc_neon:     2.08   1.25
mc_8tap_regular_w128_h_8bpc_neon:     6.42   3.84
mc_8tap_regular_w128_hv_8bpc_neon:    3.32   2.49
mc_8tap_regular_w128_v_8bpc_neon:     6.52   4.72
mc_bilinear_w2_0_8bpc_neon:           7.12   5.45
mc_bilinear_w2_h_8bpc_neon:           4.19   3.62
mc_bilinear_w2_hv_8bpc_neon:          4.69   5.54
mc_bilinear_w2_v_8bpc_neon:           6.77   5.79
mc_bilinear_w4_0_8bpc_neon:           6.97   6.05
mc_bilinear_w4_h_8bpc_neon:           7.02   9.34
mc_bilinear_w4_hv_8bpc_neon:          7.33   8.29
mc_bilinear_w4_v_8bpc_neon:           8.20  10.77
mc_bilinear_w8_0_8bpc_neon:           7.11   5.66
mc_bilinear_w8_h_8bpc_neon:          11.78  15.69
mc_bilinear_w8_hv_8bpc_neon:         13.87  12.92
mc_bilinear_w8_v_8bpc_neon:          14.56  19.11
mc_bilinear_w16_0_8bpc_neon:          6.81   4.48
mc_bilinear_w16_h_8bpc_neon:         17.65  14.17
mc_bilinear_w16_hv_8bpc_neon:        13.87  12.15
mc_bilinear_w16_v_8bpc_neon:         25.07  24.13
mc_bilinear_w32_0_8bpc_neon:          3.90   3.17
mc_bilinear_w32_h_8bpc_neon:         11.83   9.92
mc_bilinear_w32_hv_8bpc_neon:        11.69   9.74
mc_bilinear_w32_v_8bpc_neon:         15.21  15.81
mc_bilinear_w64_0_8bpc_neon:          2.88   1.71
mc_bilinear_w64_h_8bpc_neon:          8.22   7.34
mc_bilinear_w64_hv_8bpc_neon:         7.59   6.60
mc_bilinear_w64_v_8bpc_neon:         10.23  11.18
mc_bilinear_w128_0_8bpc_neon:         2.13   1.25
mc_bilinear_w128_h_8bpc_neon:         6.10   5.94
mc_bilinear_w128_hv_8bpc_neon:        4.75   5.07
mc_bilinear_w128_v_8bpc_neon:         6.24   8.85
mct_8tap_regular_w4_0_8bpc_neon:      6.72   6.83
mct_8tap_regular_w4_h_8bpc_neon:      8.10   6.30
mct_8tap_regular_w4_hv_8bpc_neon:     7.57   7.50
mct_8tap_regular_w4_v_8bpc_neon:     11.58   8.96
mct_8tap_regular_w8_0_8bpc_neon:     10.60  11.12
mct_8tap_regular_w8_h_8bpc_neon:      9.51   6.37
mct_8tap_regular_w8_hv_8bpc_neon:     9.36   6.77
mct_8tap_regular_w8_v_8bpc_neon:     15.78  10.03
mct_8tap_regular_w16_0_8bpc_neon:    16.90  11.46
mct_8tap_regular_w16_h_8bpc_neon:    10.51   6.89
mct_8tap_regular_w16_hv_8bpc_neon:    9.46   6.53
mct_8tap_regular_w16_v_8bpc_neon:    15.52   9.45
mct_8tap_regular_w32_0_8bpc_neon:     3.49   3.11
mct_8tap_regular_w32_h_8bpc_neon:     5.33   3.42
mct_8tap_regular_w32_hv_8bpc_neon:    6.41   4.43
mct_8tap_regular_w32_v_8bpc_neon:     5.82   4.58
mct_8tap_regular_w64_0_8bpc_neon:     2.74   2.36
mct_8tap_regular_w64_h_8bpc_neon:     4.86   2.97
mct_8tap_regular_w64_hv_8bpc_neon:    4.32   3.14
mct_8tap_regular_w64_v_8bpc_neon:     4.98   3.85
mct_8tap_regular_w128_0_8bpc_neon:    2.16   1.50
mct_8tap_regular_w128_h_8bpc_neon:    4.61   2.75
mct_8tap_regular_w128_hv_8bpc_neon:   3.26   2.43
mct_8tap_regular_w128_v_8bpc_neon:    4.56   3.48
mct_bilinear_w4_0_8bpc_neon:          6.46   6.82
mct_bilinear_w4_h_8bpc_neon:          5.80   7.20
mct_bilinear_w4_hv_8bpc_neon:         6.17   6.60
mct_bilinear_w4_v_8bpc_neon:          5.94   7.79
mct_bilinear_w8_0_8bpc_neon:         10.53  11.26
mct_bilinear_w8_h_8bpc_neon:          9.78  11.95
mct_bilinear_w8_hv_8bpc_neon:        11.64   9.54
mct_bilinear_w8_v_8bpc_neon:         10.39  13.52
mct_bilinear_w16_0_8bpc_neon:        16.69  11.58
mct_bilinear_w16_h_8bpc_neon:        12.70   9.12
mct_bilinear_w16_hv_8bpc_neon:       10.22   8.10
mct_bilinear_w16_v_8bpc_neon:        16.07  15.51
mct_bilinear_w32_0_8bpc_neon:         3.46   3.04
mct_bilinear_w32_h_8bpc_neon:         3.02   2.65
mct_bilinear_w32_hv_8bpc_neon:        8.34   5.93
mct_bilinear_w32_v_8bpc_neon:         4.31   4.63
mct_bilinear_w64_0_8bpc_neon:         2.73   2.30
mct_bilinear_w64_h_8bpc_neon:         2.68   2.39
mct_bilinear_w64_hv_8bpc_neon:        5.85   4.24
mct_bilinear_w64_v_8bpc_neon:         3.53   3.79
mct_bilinear_w128_0_8bpc_neon:        2.18   1.36
mct_bilinear_w128_h_8bpc_neon:        2.36   2.22
mct_bilinear_w128_hv_8bpc_neon:       4.12   3.44
mct_bilinear_w128_v_8bpc_neon:        2.52   3.34

These functions have been tested built for GCC 5.3 for Linux, with Xcode 10 for iOS, with llvm-mingw (clang 8) and MSVC 2017/armasm64 with gas-preprocessor (where building requires unsubmitted hacks for meson) for windows.

Merge request reports