Skip to content
  • Henrik Gramner's avatar
    x86: Rewrite SGR AVX2 asm · fe2bb774
    Henrik Gramner authored
    The previous implementation did multiple passes in the horizontal
    and vertical directions, with the intermediate values being stored
    in buffers on the stack. This caused bad cache thrashing.
    
    By interleaving the all the different passes in combination with a
    ring buffer for storing only a few rows at a time the performance
    is improved by a significant amount.
    
    Also slightly speed up neighbor calculations by packing the a and b
    values into a single 32-bit unsigned integer which allows calculations
    on both values simultaneously.
    fe2bb774