Skip to content

arm: cdef: Port the ARM64 CDEF NEON assembly to 32 bit arm

Martin Storsjö requested to merge mstorsjo/dav1d:arm32-cdef into master

The relative speedup ranges from 2.5 to 3.8x for find_dir and around 5 to 10x for filter.

The find_dir function is a bit restricted by barely having enough registers, leaving very few ones for temporaries, so less things can be done in parallel and many instructions end up depending on the result of the preceding instruction.

The ported functions end up slightly slower than the corresponding ARM64 ones, but only marginally:

ARM64:                   Cortex A53     A72     A73
cdef_dir_8bpc_neon:           400.0   268.8   282.2
cdef_filter_4x4_8bpc_neon:    596.3   359.9   379.7
cdef_filter_4x8_8bpc_neon:   1091.0   670.4   698.5
cdef_filter_8x8_8bpc_neon:   1998.7  1207.2  1218.4
ARM32:
cdef_dir_8bpc_neon:           528.5   329.1   337.4
cdef_filter_4x4_8bpc_neon:    632.5   482.5   432.2
cdef_filter_4x8_8bpc_neon:   1107.2   854.8   782.3
cdef_filter_8x8_8bpc_neon:   1984.8  1381.0  1414.4

Relative speedup over C code:

                        Cortex A7     A8     A9    A53    A72    A73
cdef_dir_8bpc_neon:          2.92   2.54   2.67   3.87   3.37   3.83
cdef_filter_4x4_8bpc_neon:   5.09   7.61   6.10   6.85   4.94   7.41
cdef_filter_4x8_8bpc_neon:   5.53   8.23   6.77   7.67   5.60   8.01
cdef_filter_8x8_8bpc_neon:   6.26  10.14   8.49   8.54   6.94   4.27

Merge request reports