Skip to content
  • Arpad Panyik's avatar
    AArch64: Add DotProd support for convolutions · 9d77b633
    Arpad Panyik authored
    Add an Armv8.4-A DotProd code path for standard bitdepth convolutions.
    Only horizontal-vertical (HV) convolutions have 6-tap specialisations
    of their vertical passes. All other convolutions are 4- or 8-tap
    filters which fit well with the 4-element SDOT instruction.
    
    Benchmarks show up-to 7-29% FPS increase depending on the input video
    and the CPU used.
    
    This patch will increase the .text by around 6.5 KiB.
    
    Performance highly depends on the SDOT and MLA throughput ratio, this
    can be seen on the vertical filter cases. Small cores are also
    affected by the TBL execution latencies:
    
    Relative performance to the C reference on some CPUs:
    
                              A76      A78       X1      A55
    regular w4 hv neon:      5.52x    5.78x   10.75x    8.27x
    regular w4 hv dotprod:   7.94x    8.49x   16.84x    8.09x
    sharp w4 hv neon:        5.27x    5.22x    9.06x    7.87x
    sharp w4 hv dotprod:     6.61x    6.73x   12.64x    6.89x
    
    regular w8 hv neon:      1.95x    2.19x    ...
    9d77b633