x86: Rewrite SGR AVX2 asm
The previous implementation did multiple passes in the horizontal and vertical directions, with the intermediate values being stored in buffers on the stack. This caused bad cache thrashing. By interleaving the all the different passes in combination with a ring buffer for storing only a few rows at a time the performance is improved by a significant amount. Also slightly speed up neighbor calculations by packing the a and b values into a single 32-bit unsigned integer which allows calculations on both values simultaneously.
parent
c290c02e
No related branches found
No related tags found
This diff is collapsed.
-
mentioned in merge request !1545 (merged)
-
mentioned in commit c121b831
Please register or sign in to comment