Draft: aarch64: Test implementing sgr_x_by_x[] with fdiv
Test implementation done in sgr_box5_vert_neon; it may be possible to tweak things a little bit further (we use 32 bit vector elements throughout; we could narrow things down a bit first, like was done before, but we still need things in 32 bit quantities for the float steps). Overall, this doesn't seem to be beneficial compared to the current implementation that we have.
Before: Cortex A53 A55 A72 A73 A76 Apple M3
sgr_5x5_8bpc_neon: 258319.2 254398.7 195143.7 199321.0 117959.0 250.5
After:
sgr_5x5_8bpc_neon: 286970.0 275679.4 214980.5 224968.7 129278.1 266.8
Edited by Martin Storsjö