itx: Size optimizations for arm32, arm64 and riscv64
arm32 saves 1424 bytes
arm64 saves 2176 bytes
riscv64 saves 2918 bytes
arm32/itx: Reuse 4x16 epilog, saves 268 bytes
arm32/itx: Reuse 16x4 epilog, saves 220 bytes
arm32/itx: Reuse 8x16 epilog, saves 48 bytes
arm32/itx: Remove 16x8 variant, saves 528 bytes
arm32/itx: Reuse horz_16x4 epilog, saves 336 bytes
arm32/itx16: Reuse horz_16x2 epilog, saves 24 bytes
arm64/itx: Reuse 4x16 epilog, saves 312 bytes
arm64/itx: Reuse 16x4 epilog, saves 264 bytes
arm64/itx: Reuse 8x16 epilog, saves 424 bytes
arm64/itx: Reuse 16x8 epilog, saves 568 bytes
arm64/itx: Reuse horz_16x8 epilog, saves 512 bytes
arm64/itx16: Reuse horz_16x4 epilog, saves 96 bytes
riscv64/itx: Fix unrolled .irp loops, saves 12 bytes
riscv64/itx: Reuse 4x16 epilog, saves 642 bytes
riscv64/itx: Reuse 16x4 epilog, saves 354 bytes
riscv64/itx: Tail call vert_8x16, saves 1086 bytes
riscv64/itx: Reuse 8x16 epilog, saves 24 bytes
riscv64/itx: Reuse 16x8 epilog, saves 706 bytes
riscv64/itx: Reuse horz_16x8 epilog, saves 94 bytes
Edited by Nathan E. Egge