arm64: looprestoration: Rewrite the wiener functions

Make them operate in a more cache friendly manner, interleaving
horizontal and vertical filtering (reducing the amount of stack
used from 51 KB to 4 KB), similar to what was done for x86 in
78d27b7d.

This also adds separate 5tap versions of the filters and unrolls
the vertical filter a bit more (which maybe could have been done
without doing the rewrite).

This does, however, increase the compiled code size by around
3.5 KB.

Before:                Cortex A53       A72       A73
wiener_5tap_8bpc_neon:   136855.6   91446.2   87363.6
wiener_7tap_8bpc_neon:   136861.6   91454.9   87374.5
wiener_5tap_10bpc_neon:  167685.3  114720.3  116522.1
wiener_5tap_12bpc_neon:  167677.5  114724.7  116511.9
wiener_7tap_10bpc_neon:  167681.6  114738.5  116567.0
wiener_7tap_12bpc_neon:  167673.8  114720.8  116515.4
After:
wiener_5tap_8bpc_neon:    87102.1   60460.6   66803.8
wiener_7tap_8bpc_neon:   110831.7   78489.0   82015.9
wiener_5tap_10bpc_neon:  109999.2   90259.0   89238.0
wiener_5tap_12bpc_neon:  109978.3   90255.7   89220.7
wiener_7tap_10bpc_neon:  137877.6  107578.5  103435.6
wiener_7tap_12bpc_neon:  137868.8  107568.9  103390.4
35 jobs for arm64-wiener-rewrite in 5 minutes and 1 second (queued for 2 seconds)
Status Job ID Name Coverage
  Style
passed #546588
amd64 docker
style-check

00:00:07

passed #546589
amd64 docker
x86inc-check

00:00:08

 
  Build
passed #546601
amd64 docker
build-android-aarch64

00:00:13

passed #546600
amd64 docker
build-android-armv7

00:00:10

passed #546590
amd64 avx2 docker
build-debian

00:00:30

passed #546602
aarch64 docker
build-debian-aarch64

00:00:25

passed #546603
aarch64 docker
build-debian-aarch64-clang-5

00:00:25

passed #546606
armv7 docker
build-debian-armv7

00:00:42

passed #546607
armv7 docker
build-debian-armv7-clang-5

00:00:35

passed #546593
amd64 docker
build-debian-examples

00:00:19

passed #546594
amd64 docker
build-debian-no-tools

00:00:19

passed #546609
ppc64le docker
build-debian-ppc64le

00:01:53

passed #546591
amd64 docker
build-debian-static

00:00:28

passed #546605
aarch64 docker
build-debian-werror

00:00:07

passed #546592
amd64 docker
build-debian32

00:00:30

passed #546604
catalina amd64
build-macos

00:00:36

passed #546610
amd64 docker
build-pages

00:00:09

passed #546608
amd64 docker
build-ubuntu-snap

00:00:26

passed #546598
amd64 docker
build-win-arm32

00:00:10

passed #546599
amd64 docker
build-win-arm64

00:00:15

passed #546595
amd64 docker
build-win32

00:00:32

passed #546596
amd64 docker
build-win32-unaligned-stack

00:00:27

passed #546597
amd64 docker
build-win64

00:00:40

 
  Test
passed #546611
amd64 docker
test-debian

00:01:02

95.8258%
passed #546620
aarch64 docker
test-debian-aarch64

00:00:41

passed #546622
armv7 docker
test-debian-armv7-clang-5

00:01:01

passed #546615
amd64 docker
test-debian-asan

00:02:29

passed #546612
avx2 amd64 docker
test-debian-asm

00:00:50

passed #546616
amd64 docker
test-debian-msan

00:01:34

passed #546621
ppc64le docker
test-debian-ppc64le

00:02:57

passed #546618
amd64 docker
test-debian-tsan

00:02:51

passed #546617
amd64 docker
test-debian-ubsan

00:01:38

passed #546614
amd64 avx2 docker
test-debian-unaligned-stack

00:00:56

passed #546613
avx2 amd64 docker
test-debian32-asm

00:00:51

passed #546619
amd64 avx2 docker
test-win64

00:01:17