quant_4x4x4: quant one 8x8 block at a time
This reduces overhead and lets us use less branchy code for zigzag, dequant, decimate, and so on. Reorganize and optimize a lot of macroblock_encode using this new function. ~1-2% faster overall. Includes NEON and x86 versions of the new function. Using larger merged functions like this will also make wider SIMD, like AVX2, more effective.
Showing with 409 additions and 197 deletions