arm32: looprestoration: Rewrite the wiener functions (!1776) · Merge requests · VideoLAN / dav1d

Martin Storsjö requested to merge mstorsjo/dav1d:arm32-wiener-rewrite into master Dec 20, 2024

Switch to the same cache-friendly algorithm as was done for arm64 in 2e73051c and for the reference C code in 8291a66e.

Contrary to the arm64 implementation, this uses a main loop in C (very similar to the one in the main C implementation in 8291a66e) rather than assembly; this gives a bit more overhead on the call to each function, but it shouldn't affect the big picture much.

Performane wise, this doesn't make much of a difference - it makes things a little bit faster on some cores, and a little bit slower on others:

Before:                 Cortex A7        A8       A53       A72       A73
wiener_7tap_8bpc_neon:   269384.4  147730.7  140028.5   92662.5   92929.0
wiener_7tap_10bpc_neon:  352690.2  159970.2  169427.8  116614.9  119371.1
After:
wiener_7tap_8bpc_neon:   238328.0  157274.1  134588.6   92200.3   97619.6
wiener_7tap_10bpc_neon:  336369.3  162182.0  161954.4  125521.2  130634.0

This is mostly in line with the results on arm64 in 2e73051c. On arm64, there was a bit larger speedup for the 7tap case, mostly attributed to unrolling the vertical filter (and the new filter_hv function) to operate on 16 pixels at a time. On arm32, there's not enough registers to do that, so we can't get such gains from unrolling. (Reducing the unrolling on the arm64 version to match the case on arm32 also shows similar performance numbers as on arm32 here.)

In the arm64 version, we also added separate 5tap versions of all functions; not doing that for arm32 at this point.

This increases the binary size by 2 KB.

This doesn't have any immediate effect on how much stack space dav1d requires in total, since the largest stack users on arm currently are the 8tap_scaled functions.

arm32: looprestoration: Rewrite the wiener functions

Merge request reports