Skip to content
Snippets Groups Projects
Commit 2ba57aa5 authored by Martin Storsjö's avatar Martin Storsjö
Browse files

arm32: looprestoration: Rewrite the wiener functions

Switch to the same cache-friendly algorithm as was done for arm64
in 2e73051c and for the reference
C code in 8291a66e.

Contrary to the arm64 implementation, this uses a main loop in C
(very similar to the one in the main C implementation in
8291a66e) rather than assembly;
this gives a bit more overhead on the call to each function, but
it shouldn't affect the big picture much.

Performane wise, this doesn't make much of a difference - it makes
things a little bit faster on some cores, and a little bit slower
on others:

Before:                 Cortex A7        A8       A53       A72       A73
wiener_7tap_8bpc_neon:   269384.4  147730.7  140028.5   92662.5   92929.0
wiener_7tap_10bpc_neon:  352690.2  159970.2  169427.8  116614.9  119371.1
After:
wiener_7tap_8bpc_neon:   238328.0  157274.1  134588.6   92200.3   97619.6
wiener_7tap_10bpc_neon:  336369.3  162182.0  161954.4  125521.2  130634.0

This is mostly in line with the results on arm64 in
2e73051c. On arm64, there was a
bit larger speedup for the 7tap case, mostly attributed to
unrolling the vertical filter (and the new filter_hv function) to
operate on 16 pixels at a time. On arm32, there's not enough
registers to do that, so we can't get such gains from unrolling.
(Reducing the unrolling on the arm64 version to match the case
on arm32 also shows similar performance numbers as on arm32 here.)

In the arm64 version, we also added separate 5tap versions of all
functions; not doing that for arm32 at this point.

This increases the binary size by 2 KB.

This doesn't have any immediate effect on how much stack space
dav1d requires in total, since the largest stack users on arm
currently are the 8tap_scaled functions.
parent 8291a66e
No related branches found
No related tags found
1 merge request!1776arm32: looprestoration: Rewrite the wiener functions
Pipeline #547131 passed with stages
in 36 minutes and 20 seconds
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment