Skip to content

arm64: filmgrain16: Add NEON implementation of gen_grain for 16 bpc

Martin Storsjö requested to merge mstorsjo/dav1d:arm64-gengrain-16bpc into master

Relative speedup over C code:

                             Cortex A53    A72    A73   Apple M1
gen_grain_uv_ar0_16bpc_420_neon:   2.90   4.13   5.43   5.80
gen_grain_uv_ar0_16bpc_422_neon:   3.23   4.51   5.52   5.83
gen_grain_uv_ar0_16bpc_444_neon:   4.01   4.97   6.08   5.87
gen_grain_uv_ar1_16bpc_420_neon:   2.94   2.80   3.56   3.48
gen_grain_uv_ar1_16bpc_422_neon:   3.14   3.07   3.68   3.47
gen_grain_uv_ar1_16bpc_444_neon:   3.54   3.51   3.93   2.61
gen_grain_uv_ar2_16bpc_420_neon:   3.92   3.69   4.40   3.98
gen_grain_uv_ar2_16bpc_422_neon:   4.13   3.96   4.42   3.92
gen_grain_uv_ar2_16bpc_444_neon:   4.69   4.33   4.84   3.25
gen_grain_uv_ar3_16bpc_420_neon:   5.05   5.39   5.42   4.74
gen_grain_uv_ar3_16bpc_422_neon:   5.25   5.68   5.57   4.67
gen_grain_uv_ar3_16bpc_444_neon:   6.02   6.33   6.35   4.38
gen_grain_y_ar0_16bpc_neon:        4.67   5.23   5.22  10.11
gen_grain_y_ar1_16bpc_neon:        3.32   3.03   3.28   2.24
gen_grain_y_ar2_16bpc_neon:        4.59   3.95   4.64   3.52
gen_grain_y_ar3_16bpc_neon:        5.89   5.93   6.36   4.79

Absolute numbers:

                                 Cortex A53       A72       A73    Apple M1
gen_grain_uv_ar0_16bpc_420_neon:    19797.2    9725.0    9234.0    29.7
gen_grain_uv_ar0_16bpc_422_neon:    34899.4   16875.3   17021.6    57.7
gen_grain_uv_ar0_16bpc_444_neon:    53776.6   28470.1   28773.1   107.8
gen_grain_uv_ar1_16bpc_420_neon:    37998.2   24631.2   24754.0    84.2
gen_grain_uv_ar1_16bpc_422_neon:    70817.5   44642.5   46323.1   166.3
gen_grain_uv_ar1_16bpc_444_neon:   123333.0   77316.4   83523.1   427.5
gen_grain_uv_ar2_16bpc_420_neon:    49115.8   33053.7   33249.9    93.6
gen_grain_uv_ar2_16bpc_422_neon:    92965.3   59663.8   64741.9   187.9
gen_grain_uv_ar2_16bpc_444_neon:   160899.7  108845.6  115422.4   441.8
gen_grain_uv_ar3_16bpc_420_neon:    65786.6   41924.3   45562.1   108.1
gen_grain_uv_ar3_16bpc_422_neon:   126232.3   78691.6   87351.5   217.6
gen_grain_uv_ar3_16bpc_444_neon:   218702.6  140197.8  151294.8   454.3
gen_grain_y_ar0_16bpc_neon:         35867.9   17653.6   20770.7   108.0
gen_grain_y_ar1_16bpc_neon:        118781.8   74777.1   81338.6   426.0
gen_grain_y_ar2_16bpc_neon:        155919.9  102145.8  109698.1   438.5
gen_grain_y_ar3_16bpc_neon:        213348.1  133054.8  144726.0   447.9

Corresponding numbers for 8bpc:

                                 Cortex A53       A72       A73    Apple M1
gen_grain_uv_ar0_8bpc_420_neon:     15086.1    8384.7    8556.6    29.4
gen_grain_uv_ar0_8bpc_422_neon:     26800.6   14354.4   15526.5    56.6
gen_grain_uv_ar0_8bpc_444_neon:     43749.6   22408.6   24627.9   108.3
gen_grain_uv_ar1_8bpc_420_neon:     33706.3   21892.6   22835.9    87.1
gen_grain_uv_ar1_8bpc_422_neon:     63897.0   41820.1   43468.9   171.8
gen_grain_uv_ar1_8bpc_444_neon:    117345.1   76372.5   79938.3   370.0
gen_grain_uv_ar2_8bpc_420_neon:     42808.8   28493.8   29932.8    92.2
gen_grain_uv_ar2_8bpc_422_neon:     82282.5   53969.4   58191.1   181.8
gen_grain_uv_ar2_8bpc_444_neon:    147641.4   98136.4  103157.6   430.2
gen_grain_uv_ar3_8bpc_420_neon:     56784.3   36342.0   40812.3   102.2
gen_grain_uv_ar3_8bpc_422_neon:    110249.7   70215.6   79716.0   200.5
gen_grain_uv_ar3_8bpc_444_neon:    196461.7  125802.8  141781.5   440.1
gen_grain_y_ar0_8bpc_neon:          36451.7   17794.4   19839.3   109.5
gen_grain_y_ar1_8bpc_neon:         113155.6   71811.9   77296.8   370.2
gen_grain_y_ar2_8bpc_neon:         142812.3   95042.4  100434.4   431.8
gen_grain_y_ar3_8bpc_neon:         191608.6  121199.5  136946.4   437.2

Real world speedup for chimera 10 bpc seems to be from around 281 to 283 fps on an Apple M1.

Merge request reports