x86: Rewrite wiener SSE2/SSSE3/AVX2 asm
The previous implementation did two separate passes in the horizontal and vertical directions, with the intermediate values being stored in a buffer on the stack. This caused bad cache thrashing. By interleaving the horizontal and vertical passes in combination with a ring buffer for storing only a few rows at a time the performance is improved by a significant amount. Also split the function into 7-tap and 5-tap versions. The latter is faster and fairly common (always for chroma, sometimes for luma).
Showing with 1732 additions and 867 deletions