Skip to content

arm32: ipred: NEON implementation of ipred functions for 16 bpc

Martin Storsjö requested to merge mstorsjo/dav1d:arm32-ipred16 into master

Plus the usual set of assorted cleanups and fixes noticed while working on the code.

Samples of some checkasm benchmarks:

                                 Cortex A7       A8      A53      A72      A73
cfl_ac_420_w4_16bpc_neon:            258.2    130.0    187.8     88.1     99.9
cfl_ac_420_w8_16bpc_neon:            396.3    192.3    278.0    134.1    148.1
cfl_ac_420_w16_16bpc_neon:           705.9    341.5    508.4    231.2    263.0
intra_pred_filter_w32_10bpc_neon:   3450.6   3279.7   1505.6   1716.8   1631.0 
intra_pred_filter_w32_12bpc_neon:   5075.2   2467.3   2027.9   1605.7   1556.0
intra_pred_paeth_w64_16bpc_neon:    7850.6   4682.9   4538.4   4640.4   4952.4 
intra_pred_smooth_w64_16bpc_neon:   6807.7   4044.0   4001.4   3001.9   3131.5

Corresponding numbers for arm64:

                                                  Cortex A53      A72      A73
cfl_ac_420_w4_16bpc_neon:                              154.8     87.1     81.6 
cfl_ac_420_w8_16bpc_neon:                              235.6    124.8    133.0
cfl_ac_420_w16_16bpc_neon:                             428.8    206.5    234.9 
intra_pred_filter_w32_10bpc_neon:                     1333.2   1485.9   1468.3
intra_pred_filter_w32_12bpc_neon:                     1839.1   1429.0   1439.7 
intra_pred_paeth_w64_16bpc_neon:                      3691.1   3091.8   3289.7
intra_pred_smooth_w64_16bpc_neon:                     3776.8   3124.4   2827.1 

Merge request reports