Commit 0282f6f3 authored by Martin Storsjö's avatar Martin Storsjö

arm64: loopfilter: Implement NEON loop filters

The exact relative speedup compared to C code is a bit vague and hard
to measure, depending on eactly how many filtered blocks are skipped,
as the NEON version always filters 16 pixels at a time, while the
C code can skip processing individual 4 pixel blocks.

Additionally, the checkasm benchmarking code runs the same function
repeatedly on the same buffer, which can make the filter take
different codepaths on each run, as the function updates the buffer
which will be used as input for the next run.

If tweaking the checkasm test data to try to avoid skipped blocks,
the relative speedups compared to C is between 2x and 5x, while
it is around 1x to 4x with the current checkasm test as such.

Benchmark numbers from a tweaked checkasm that avoids skipped
blocks:

                        Cortex A53     A72     A73
lpf_h_sb_uv_w4_8bpc_c:      2954.7  1399.3  1655.3
lpf_h_sb_uv_w4_8bpc_neon:    895.5   650.8   692.0
lpf_h_sb_uv_w6_8bpc_c:      3879.2  1917.2  2257.7
lpf_h_sb_uv_w6_8bpc_neon:   1125.6   759.5   838.4
lpf_h_sb_y_w4_8bpc_c:       6711.0  3275.5  3913.7
lpf_h_sb_y_w4_8bpc_neon:    1744.0  1342.1  1351.5
lpf_h_sb_y_w8_8bpc_c:      10695.7  6155.8  6638.9
lpf_h_sb_y_w8_8bpc_neon:    2146.5  1560.4  1609.1
lpf_h_sb_y_w16_8bpc_c:     11355.8  6292.0  6995.9
lpf_h_sb_y_w16_8bpc_neon:   2475.4  1949.6  1968.4
lpf_v_sb_uv_w4_8bpc_c:      2639.7  1204.8  1425.9
lpf_v_sb_uv_w4_8bpc_neon:    510.7   351.4   334.7
lpf_v_sb_uv_w6_8bpc_c:      3468.3  1757.1  2021.5
lpf_v_sb_uv_w6_8bpc_neon:    625.0   415.0   397.8
lpf_v_sb_y_w4_8bpc_c:       5428.7  2731.7  3068.5
lpf_v_sb_y_w4_8bpc_neon:    1172.6   792.1   768.0
lpf_v_sb_y_w8_8bpc_c:       8946.1  4412.8  5121.0
lpf_v_sb_y_w8_8bpc_neon:    1565.5  1063.6  1062.7
lpf_v_sb_y_w16_8bpc_c:      8978.9  4411.7  5112.0
lpf_v_sb_y_w16_8bpc_neon:   1775.0  1288.1  1236.7
parent 204bf211
Pipeline #6334 passed with stages
in 7 minutes and 25 seconds