arm: cdef: Port the ARM64 CDEF NEON assembly to 32 bit arm
The relative speedup ranges from 2.5 to 3.8x for find_dir and around 5 to 10x for filter.
The find_dir function is a bit restricted by barely having enough registers, leaving very few ones for temporaries, so less things can be done in parallel and many instructions end up depending on the result of the preceding instruction.
The ported functions end up slightly slower than the corresponding ARM64 ones, but only marginally:
ARM64: Cortex A53 A72 A73
cdef_dir_8bpc_neon: 400.0 268.8 282.2
cdef_filter_4x4_8bpc_neon: 596.3 359.9 379.7
cdef_filter_4x8_8bpc_neon: 1091.0 670.4 698.5
cdef_filter_8x8_8bpc_neon: 1998.7 1207.2 1218.4
ARM32:
cdef_dir_8bpc_neon: 528.5 329.1 337.4
cdef_filter_4x4_8bpc_neon: 632.5 482.5 432.2
cdef_filter_4x8_8bpc_neon: 1107.2 854.8 782.3
cdef_filter_8x8_8bpc_neon: 1984.8 1381.0 1414.4
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
cdef_dir_8bpc_neon: 2.92 2.54 2.67 3.87 3.37 3.83
cdef_filter_4x4_8bpc_neon: 5.09 7.61 6.10 6.85 4.94 7.41
cdef_filter_4x8_8bpc_neon: 5.53 8.23 6.77 7.67 5.60 8.01
cdef_filter_8x8_8bpc_neon: 6.26 10.14 8.49 8.54 6.94 4.27