Wiener optimizations

Improves overall decoding performance on AVX2-capable systems by around 1-3% depending on content.

wiener_7tap_8bpc_c:     203223.0
wiener_7tap_8bpc_sse2:   33425.1 (previously: 45781.5)
wiener_7tap_8bpc_ssse3:  21980.3 (previously: 30153.3)
wiener_7tap_8bpc_avx2:   12097.5 (previously: 17262.9)

wiener_5tap_8bpc_sse2:   26902.8
wiener_5tap_8bpc_ssse3:  19829.6
wiener_5tap_8bpc_avx2:   10592.6

Less cache thrashing benefits surrounding code as well, so the checkasm numbers doesn't paint the whole picture.

