arm32: cdef: Add NEON implementations of CDEF for 16 bpc
Use a shared template file for assembly functions that can be templated into 8 and 16 bpc forms, just like in the arm64 version.
Checkasm benchmarks:
Cortex A7 A8 A53 A72 A73
cdef_dir_16bpc_neon: 975.9 853.2 555.2 378.7 386.9
cdef_filter_4x4_16bpc_neon: 746.9 521.7 481.2 333.0 340.8
cdef_filter_4x8_16bpc_neon: 1300.0 885.5 816.3 582.7 599.5
cdef_filter_8x8_16bpc_neon: 2282.5 1415.0 1417.6 1059.0 1076.3
Corresponding numbers for arm64, for comparison:
Cortex A53 A72 A73
cdef_dir_16bpc_neon: 418.0 306.7 310.7
cdef_filter_4x4_16bpc_neon: 453.4 282.9 297.4
cdef_filter_4x8_16bpc_neon: 807.5 514.2 533.8
cdef_filter_8x8_16bpc_neon: 1425.2 924.4 942.0