Split MC blend

The mstride == 0, mstride == 1, and mstride == w cases are very different
from each other, and splitting them into separate functions makes it easier
top optimize them.

Also add some further optimizations to the AVX2 asm that became possible
after this change.
9 jobs for master in 3 minutes and 5 seconds (queued for 1 second)
Status Job ID Name Coverage
  Build
passed #229387
amd64 debian
build-debian

00:00:31

passed #229391
debian aarch64
build-debian-aarch64

00:01:50

passed #229392
debian clang5 aarch64
build-debian-aarch64-clang-5

00:01:26

passed #229388
amd64 debian
build-debian-static

00:00:31

passed #229394
debian aarch64
build-debian-werror

00:01:00

passed #229393
macos
build-macos

00:00:25

passed #229389
win32
build-win32

00:00:28

passed #229390
win64
build-win64

00:00:33

 
  Test
passed #229395
amd64 debian
test-debian

00:00:36