-
The existing functions could easily be used by just calling them twice - this would give the following cycle numbers from checkasm: Cortex A7 A8 A9 A53 var2_8x8_c: 7302 5342 5050 4400 var2_8x8_neon: 2645 1612 1932 1715 var2_8x16_c: 14300 10528 10020 8637 var2_8x16_neon: 5127 2695 3217 2651 However, by merging both passes into the same function, we get the following speedup: var2_8x8_neon: 2312 1190 1389 1300 var2_8x16_neon: 4862 2130 2293 2422
824802ad