Skip to content
  • Arpad Panyik's avatar
    AArch64: Add DotProd support for convolutions · 9d77b633
    Arpad Panyik authored
    Add an Armv8.4-A DotProd code path for standard bitdepth convolutions.
    Only horizontal-vertical (HV) convolutions have 6-tap specialisations
    of their vertical passes. All other convolutions are 4- or 8-tap
    filters which fit well with the 4-element SDOT instruction.
    
    Benchmarks show up-to 7-29% FPS increase depending on the input video
    and the CPU used.
    
    This patch will increase the .text by around 6.5 KiB.
    
    Performance highly depends on the SDOT and MLA throughput ratio, this
    can be seen on the vertical filter cases. Small cores are also
    affected by the TBL execution latencies:
    
    Relative performance to the C reference on some CPUs:
    
                              A76      A78       X1      A55
    regular w4 hv neon:      5.52x    5.78x   10.75x    8.27x
    regular w4 hv dotprod:   7.94x    8.49x   16.84x    8.09x
    sharp w4 hv neon:        5.27x    5.22x    9.06x    7.87x
    sharp w4 hv dotprod:     6.61x    6.73x   12.64x    6.89x
    
    regular w8 hv neon:      1.95x    2.19x    2.56x    3.16x
    regular w8 hv dotprod:   3.23x    2.81x    3.20x    3.26x
    sharp w8 hv neon:        1.61x    1.79x    2.05x    2.72x
    sharp w8 hv dotprod:     2.72x    2.29x    2.66x    2.76x
    
    regular w16 hv neon:     1.63x    2.04x    2.16x    2.73x
    regular w16 hv dotprod:  2.72x    2.57x    2.67x    2.80x
    sharp w16 hv neon:       1.33x    1.67x    1.74x    2.34x
    sharp w16 hv dotprod:    2.31x    2.14x    2.26x    2.39x
    
    regular w32 hv neon:     1.48x    1.92x    1.94x    2.51x
    regular w32 hv dotprod:  2.49x    2.40x    2.33x    2.58x
    sharp w32 hv neon:       1.21x    1.56x    1.53x    2.14x
    sharp w32 hv dotprod:    2.12x    2.02x    2.00x    2.22x
    
    regular w64 hv neon:     1.42x    1.87x    1.85x    2.40x
    regular w64 hv dotprod:  2.40x    2.32x    2.21x    2.46x
    sharp w64 hv neon:       1.16x    1.52x    1.46x    2.04x
    sharp w64 hv dotprod:    2.02x    1.96x    1.90x    2.11x
    
    regular w128 hv neon:    1.39x    1.84x    1.80x    2.27x
    regular w128 hv dotprod: 2.33x    2.28x    2.14x    2.35x
    sharp w128 hv neon:      1.14x    1.50x    1.42x    1.94x
    sharp w128 hv dotprod:   1.98x    1.93x    1.84x    2.03x
    
    regular w8 h neon:       2.61x    3.20x    3.51x    3.55x
    regular w8 h dotprod:    4.43x    5.17x    6.26x    4.30x
    sharp w8 h neon:         2.01x    2.80x    2.89x    3.12x
    sharp w8 h dotprod:      4.42x    5.16x    6.27x    4.28x
    
    regular w16 h neon:      2.17x    3.13x    2.92x    3.35x
    regular w16 h dotprod:   4.38x    4.27x    4.53x    3.90x
    sharp w16 h neon:        1.74x    2.65x    2.48x    2.92x
    sharp w16 h dotprod:     4.33x    4.27x    4.53x    3.91x
    
    regular w64 h neon:      1.92x    2.82x    2.39x    2.96x
    regular w64 h dotprod:   3.68x    3.60x    3.40x    3.18x
    sharp w64 h neon:        1.47x    2.33x    2.05x    2.54x
    sharp w64 h dotprod:     3.68x    3.60x    3.40x    3.17x
    
    regular w4 v neon:       5.39x    7.38x   10.27x   11.41x
    regular w4 v dotprod:    9.46x   14.15x   18.72x    9.84x
    sharp w4 v neon:         4.51x    6.39x    8.17x   10.70x
    sharp w4 v dotprod:      9.35x   14.20x   18.63x    9.78x
    
    regular w16 v neon:      3.03x    4.03x    4.65x    6.28x
    regular w16 v dotprod:   4.64x    3.75x    4.78x    3.89x
    sharp w16 v neon:        2.29x    3.09x    3.44x    5.52x
    sharp w16 v dotprod:     4.62x    3.74x    4.77x    3.89x
    
    regular w64 v neon:      2.17x    3.14x    3.19x    4.46x
    regular w64 v dotprod:   3.43x    3.00x    3.31x    2.74x
    sharp w64 v neon:        1.61x    2.42x    2.34x    3.89x
    sharp w64 v dotprod:     3.38x    3.00x    3.29x    2.73x
    9d77b633