-
Arpad Panyik authored
Add an Armv8.4-A DotProd code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element SDOT instruction. Benchmarks show up-to 7-29% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 6.5 KiB. Performance highly depends on the SDOT and MLA throughput ratio, this can be seen on the vertical filter cases. Small cores are also affected by the TBL execution latencies: Relative performance to the C reference on some CPUs: A76 A78 X1 A55 regular w4 hv neon: 5.52x 5.78x 10.75x 8.27x regular w4 hv dotprod: 7.94x 8.49x 16.84x 8.09x sharp w4 hv neon: 5.27x 5.22x 9.06x 7.87x sharp w4 hv dotprod: 6.61x 6.73x 12.64x 6.89x regular w8 hv neon: 1.95x 2.19x ...