Skip to content

arm32: filmgrain: Add NEON implementation of gen_grain for 8 bpc

Martin Storsjö requested to merge mstorsjo/dav1d:arm32-filmgrain-8bpc into master

Relative speedup over C code:

                             Cortex A7     A8     A9    A53    A72    A73
gen_grain_uv_ar0_8bpc_420_neon:   6.13   7.81   8.17   6.78   6.62  11.13
gen_grain_uv_ar0_8bpc_422_neon:   6.34   7.64   8.00   6.83   6.93  10.31 
gen_grain_uv_ar0_8bpc_444_neon:   7.09   8.29   8.55   7.95   7.89  11.05
gen_grain_uv_ar1_8bpc_420_neon:   3.39   2.26   3.06   4.13   3.41   4.95
gen_grain_uv_ar1_8bpc_422_neon:   3.40   2.23   3.02   4.18   3.36   4.73
gen_grain_uv_ar1_8bpc_444_neon:   3.46   2.18   2.95   4.46   3.57   4.91
gen_grain_uv_ar2_8bpc_420_neon:   3.88   3.00   3.32   4.74   3.57   5.31
gen_grain_uv_ar2_8bpc_422_neon:   3.92   3.04   3.36   4.82   3.57   5.06
gen_grain_uv_ar2_8bpc_444_neon:   4.32   3.14   3.62   5.56   3.90   5.43
gen_grain_uv_ar3_8bpc_420_neon:   4.35   3.53   4.05   5.35   4.44   5.56 
gen_grain_uv_ar3_8bpc_422_neon:   4.38   3.49   4.17   5.41   4.48   5.36
gen_grain_uv_ar3_8bpc_444_neon:   4.84   3.70   4.36   5.95   4.87   5.82
gen_grain_y_ar0_8bpc_neon:        5.18   5.57   7.65   5.93   7.13   9.01
gen_grain_y_ar1_8bpc_neon:        2.64   1.66   2.48   3.32   3.15   3.77
gen_grain_y_ar2_8bpc_neon:        3.57   2.64   3.21   4.59   3.68   4.64
gen_grain_y_ar3_8bpc_neon:        4.27   3.93   4.12   5.41   4.63   5.17

(A73 is benched against C code compiled with a different C compiler, which can explain the slightly differing numbers there.)

Absolute numbers:

                                 Cortex A7         A8         A9        A53       A72        A73
gen_grain_uv_ar0_8bpc_420_neon:    19614.6    13396.4    12320.4    15030.7    8288.1     8754.4
gen_grain_uv_ar0_8bpc_422_neon:    34660.9    24315.5    22225.3    26809.2   14549.8    15804.6
gen_grain_uv_ar0_8bpc_444_neon:    55625.6    39914.5    37100.2    44658.3   22917.3    27369.6
gen_grain_uv_ar1_8bpc_420_neon:    50049.5    63179.4    44793.1    36406.7   22690.3    25401.9
gen_grain_uv_ar1_8bpc_422_neon:    93289.5   117755.0    82815.4    67081.4   43133.1    46698.0
gen_grain_uv_ar1_8bpc_444_neon:   170880.0   223259.2   156241.5   122760.0   78655.6    85604.9
gen_grain_uv_ar2_8bpc_420_neon:    68185.5    78123.2    61457.3    47886.7   31526.2    36519.6
gen_grain_uv_ar2_8bpc_422_neon:   129195.2   148653.9   114133.2    89822.7   60242.6    70160.1
gen_grain_uv_ar2_8bpc_444_neon:   233133.7   272277.4   214108.7   161589.5  109069.3   127763.7
gen_grain_uv_ar3_8bpc_420_neon:    96374.4    94372.2    79663.8    70832.0   43065.3    50593.9
gen_grain_uv_ar3_8bpc_422_neon:   186324.8   184321.8   151490.1   136200.1   83758.0    98378.7
gen_grain_uv_ar3_8bpc_444_neon:   335596.6   336811.6   279755.5   247251.5  151657.2   178906.0
gen_grain_y_ar0_8bpc_neon:         46109.3    36022.2    28476.2    36478.5   18740.1    20660.4
gen_grain_y_ar1_8bpc_neon:        165054.2   217090.4   152578.9   118409.4   74357.2    83794.5
gen_grain_y_ar2_8bpc_neon:        226576.9   268320.3   210924.6   157829.4  105956.5   124293.2
gen_grain_y_ar3_8bpc_neon:        328337.2   330421.3   275110.1   242097.3  148538.7   177270.8 

Corresponding numbers for the original arm64 version:

                                                                 Cortex A53       A72        A73
gen_grain_uv_ar0_8bpc_420_neon:                                     14874.7    7765.5     8536.0
gen_grain_uv_ar0_8bpc_422_neon:                                     26510.9   13685.3    15308.2
gen_grain_uv_ar0_8bpc_444_neon:                                     43189.6   21565.3    24312.0
gen_grain_uv_ar1_8bpc_420_neon:                                     33715.7   21669.8    22758.3
gen_grain_uv_ar1_8bpc_422_neon:                                     63955.3   41581.4    42852.5
gen_grain_uv_ar1_8bpc_444_neon:                                    117390.1   76503.5    78446.4
gen_grain_uv_ar2_8bpc_420_neon:                                     42779.0   27794.3    29677.9
gen_grain_uv_ar2_8bpc_422_neon:                                     82283.8   53446.7    58232.2
gen_grain_uv_ar2_8bpc_444_neon:                                    147773.8   98492.7   103754.1
gen_grain_uv_ar3_8bpc_420_neon:                                     56698.8   35697.1    40695.9
gen_grain_uv_ar3_8bpc_422_neon:                                    110132.4   69829.1    79196.8
gen_grain_uv_ar3_8bpc_444_neon:                                    196642.7  124174.9   141812.5
gen_grain_y_ar0_8bpc_neon:                                          36461.0   17782.0    19827.0
gen_grain_y_ar1_8bpc_neon:                                         113202.7   72457.7    75995.8
gen_grain_y_ar2_8bpc_neon:                                         142894.0   94450.9   100304.5
gen_grain_y_ar3_8bpc_neon:                                         191697.7  120674.9   137223.8

The arm64 version uses lots of registers (21 different GPRs in total, and the hot loop uses 18 of them), which causes some overhead to make that work on arm32 with much fewer available registers.

Merge request reports