-
Martin Storsjö authored
Use a post-increment with a register on the last increment, avoiding a separate increment. Avoid processing the last 8 pixels in the w32 case when we only output 24 pixels. Before: ARM32 Cortex A7 A8 A9 A53 A72 A73 blend_v_w4_8bpc_neon: 450.4 574.7 538.7 374.6 199.3 260.5 blend_v_w8_8bpc_neon: 559.6 351.3 552.5 357.6 214.8 204.3 blend_v_w16_8bpc_neon: 926.3 511.6 787.9 593.0 271.0 246.8 blend_v_w32_8bpc_neon: 1482.5 917.0 1149.5 991.9 354.0 368.9 ARM64 blend_v_w4_8bpc_neon: 351.1 200.0 224.1 blend_v_w8_8bpc_neon: 333.0 212.4 203.8 blend_v_w16_8bpc_neon: 495.2 302.0 247.0 blend_v_w32_8bpc_neon: 840.0 557.8 514.0 After: ARM32 blend_v_w4_8bpc_neon: 435.5 575.8 537.6 356.2 198.3 259.5 blend_v_w8_8bpc_neon: 545.2 347.9 553.5 339.1 207.8 204.2 blend_v_w16_8bpc_neon: 913.7 511.0 788.1 573.7 275.4 243.3 blend_v_w32_8bpc_neon: 1445.3 951.2 1079.1 920.4 352.2 361.6 ARM64 blend_v_w4_8bpc_neon: 333.0 191.3 225.9 blend_v_w8_8bpc_neon: 314.9 199.3 203.5 blend_v_w16_8bpc_neon: 476.9 301.3 241.1 blend_v_w32_8bpc_neon: 766.9 432.8 416.9
52e9b435
Loading