Reduce memory usage for 8-bit and 4:2:2/4:2:0/4:0:0

See Ronald's reaction in #256 (closed).

So, there's a couple of things here. First of all, in terms of buffers allocated for frame data, it's not that bad, since references are stacked, i.e. each thread can only add one new one (itself). So, assuming 16 frame threads, you have 16+7=23 frame buffers, and 23x12.4=285MB.

However, we also allocate internal buffers per frame thread. These are always allocated as if the decoding was 4:4:4 10-bits, and cover coefficients (4 bytes per pixel per plane, since for 10-bits, the coef type is int32_t). Then another one for palette indices (1 byte per pixel) - each per frame thread, so 13 bytes per pixel.

In addition, we have 16 byte per block (4x4 pixels worst case) for block data, one byte per plane for transform type, 2 bytes per plane for eob, 16 bytes for palette values (x2; chroma+luma) = 57 bytes per 4x4 block (worst case), plus some more (can't recall how much exactly) for postfilter, but let's assume approximately the same

So, for 4K, we're talking another 108.6+29.76=170MB per frame thread, so for 16 frame threads, that's 2.7GB or something thereabouts, which is probably why it blows up.

These allocations can to some extend be simplified, e.g. there's no reason to allocate assuming 4:4:4 10-bits, and that could bring down the coefficient block by up to 4x, which would make it in the neighbourhood of 80MB per frame thread, i.e. half overall memory consumption. If people are interested in that, it's not super-hard, just make sure to not introduce regressions in the fuzzing, i.e. need to make sure to re-allocate when bitdepth or subsampling changes.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information