AArch64: New method for calculating sgr table
For the 3x3 part, double the width of the vertical loop. This is done to provide more latency in the new sgr calculation. Initial (master): Cortex A53 A55 A72 A73 A76 Apple M1 sgr_3x3_8bpc_neon: 387702.8 383154.2 295742.4 302100.1 185420.7 472.2 sgr_5x5_8bpc_neon: 261725.1 256919.8 194205.1 197585.6 128311.3 332.9 sgr_mix_8bpc_neon: 628085.0 593664.2 453551.8 450553.8 281956.0 711.2 Current: sgr_3x3_8bpc_neon: 368331.4 363949.7 275499.0 272056.3 169614.4 432.7 sgr_5x5_8bpc_neon: 257866.7 255265.5 195962.5 199557.8 120481.3 319.2 sgr_mix_8bpc_neon: 598234.1 572896.4 418500.4 438910.7 258977.7 659.3 Include a minor improvement that gets rid of a dup instruction.
Loading