Skip to content

arm32: looprestoration: NEON implementation of wiener filter for 16 bpc

Martin Storsjö requested to merge mstorsjo/dav1d:arm32-wiener16 into master

Checkasm benchmarks:

                           Cortex A7         A8        A53       A72       A73
wiener_chroma_10bpc_neon:   385312.5   165772.7   184308.2  122311.2  126050.2
wiener_chroma_12bpc_neon:   385296.7   165538.0   184438.2  122290.5  126205.3
wiener_luma_10bpc_neon:     385318.5   165985.3   184147.4  122311.1  126168.4 
wiener_luma_12bpc_neon:     385316.3   165819.1   184484.7  122304.4  125982.4

The corresponding numbers for arm64 for comparison:

                                                Cortex A53       A72       A73
wiener_chroma_10bpc_neon:                         176319.7  125992.1  128162.4
wiener_chroma_12bpc_neon:                         176386.2  125986.4  128343.8
wiener_luma_10bpc_neon:                           176174.0  126001.7  128227.8
wiener_luma_12bpc_neon:                           176176.5  125992.1  128204.8

The arm32 version actually seems to run marginally faster than the arm64 one on A72 and A73. I believe this is because the arm64 code is tuned for A53 (which makes it a bit slower on other cores), but the arm32 code can't be tuned exactly the same way due to fewer registers being available.

Merge request reports

Loading