-
Arpad Panyik authored
Add an Armv8.4-A DotProd code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element SDOT instruction. Benchmarks show up-to 7-29% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 6.5 KiB. Performance highly depends on the SDOT and MLA throughput ratio, this can be seen on the vertical filter cases. Small cores are also affected by the TBL execution latencies: Relative performance to the C reference on some CPUs: A76 A78 X1 A55 regular w4 hv neon: 5.52x 5.78x 10.75x 8.27x regular w4 hv dotprod: 7.94x 8.49x 16.84x 8.09x sharp w4 hv neon: 5.27x 5.22x 9.06x 7.87x sharp w4 hv dotprod: 6.61x 6.73x 12.64x 6.89x regular w8 hv neon: 1.95x 2.19x 2.56x 3.16x regular w8 hv dotprod: 3.23x 2.81x 3.20x 3.26x sharp w8 hv neon: 1.61x 1.79x 2.05x 2.72x sharp w8 hv dotprod: 2.72x 2.29x 2.66x 2.76x regular w16 hv neon: 1.63x 2.04x 2.16x 2.73x regular w16 hv dotprod: 2.72x 2.57x 2.67x 2.80x sharp w16 hv neon: 1.33x 1.67x 1.74x 2.34x sharp w16 hv dotprod: 2.31x 2.14x 2.26x 2.39x regular w32 hv neon: 1.48x 1.92x 1.94x 2.51x regular w32 hv dotprod: 2.49x 2.40x 2.33x 2.58x sharp w32 hv neon: 1.21x 1.56x 1.53x 2.14x sharp w32 hv dotprod: 2.12x 2.02x 2.00x 2.22x regular w64 hv neon: 1.42x 1.87x 1.85x 2.40x regular w64 hv dotprod: 2.40x 2.32x 2.21x 2.46x sharp w64 hv neon: 1.16x 1.52x 1.46x 2.04x sharp w64 hv dotprod: 2.02x 1.96x 1.90x 2.11x regular w128 hv neon: 1.39x 1.84x 1.80x 2.27x regular w128 hv dotprod: 2.33x 2.28x 2.14x 2.35x sharp w128 hv neon: 1.14x 1.50x 1.42x 1.94x sharp w128 hv dotprod: 1.98x 1.93x 1.84x 2.03x regular w8 h neon: 2.61x 3.20x 3.51x 3.55x regular w8 h dotprod: 4.43x 5.17x 6.26x 4.30x sharp w8 h neon: 2.01x 2.80x 2.89x 3.12x sharp w8 h dotprod: 4.42x 5.16x 6.27x 4.28x regular w16 h neon: 2.17x 3.13x 2.92x 3.35x regular w16 h dotprod: 4.38x 4.27x 4.53x 3.90x sharp w16 h neon: 1.74x 2.65x 2.48x 2.92x sharp w16 h dotprod: 4.33x 4.27x 4.53x 3.91x regular w64 h neon: 1.92x 2.82x 2.39x 2.96x regular w64 h dotprod: 3.68x 3.60x 3.40x 3.18x sharp w64 h neon: 1.47x 2.33x 2.05x 2.54x sharp w64 h dotprod: 3.68x 3.60x 3.40x 3.17x regular w4 v neon: 5.39x 7.38x 10.27x 11.41x regular w4 v dotprod: 9.46x 14.15x 18.72x 9.84x sharp w4 v neon: 4.51x 6.39x 8.17x 10.70x sharp w4 v dotprod: 9.35x 14.20x 18.63x 9.78x regular w16 v neon: 3.03x 4.03x 4.65x 6.28x regular w16 v dotprod: 4.64x 3.75x 4.78x 3.89x sharp w16 v neon: 2.29x 3.09x 3.44x 5.52x sharp w16 v dotprod: 4.62x 3.74x 4.77x 3.89x regular w64 v neon: 2.17x 3.14x 3.19x 4.46x regular w64 v dotprod: 3.43x 3.00x 3.31x 2.74x sharp w64 v neon: 1.61x 2.42x 2.34x 3.89x sharp w64 v dotprod: 3.38x 3.00x 3.29x 2.73x