aarch64: optimize cabac asm

0.5% - 2% overall speedup on
`./x264 --threads X --profile high --preset veryfast --crf 15 -o /dev/null park_joy_420_720p50.y4m`
cabac is responsible for roughly 1/6 of the CPU use.
Branch mispredictions are reduced by 15% to 20%.

cortex-s53: 0.5% faster
cortex-a72: 2%  faster
neoverse-n1: 0.9% faster
10 jobs for aarch64_cabac in 8 minutes and 54 seconds (queued for 1 second)
latest
Status Job ID Name Coverage
  Build
passed #456069
aarch64 docker
build-debian-aarch64

00:08:25

passed #456068
amd64 docker
build-debian-amd64

00:01:57

passed #456072
macos
build-macos

00:03:02

passed #456070
amd64 docker
build-win32

00:02:14

passed #456071
amd64 docker
build-win64

00:02:23

 
  Test
passed #456074
aarch64 docker
test-debian-aarch64

00:00:26

passed #456073
amd64 docker
test-debian-amd64

00:00:17

passed #456077
macos
test-macos

00:00:03

passed #456075
amd64 docker
test-win32

00:00:16

passed #456076
amd64 docker
test-win64

00:00:17