Changes

Ronald S. Bultje · a271c127
--- a/task-list.md
+++ b/task-list.md
@@ -14,7 +14,7 @@ SIMD:
 Multi-threading:
 - film-grain threading (@psilokos; !1371);
 - in first-pass of frame threading with tile threading enabled, it may make sense (assuming no temporal interference from ref_mvs or seg_id) to first parse the tile marked as the one used to update the output CDF, since that would unblock the subsequent thread's pass 1. This is only true if use_ref_mvs=0 and segmentation.temporal_update=0;
- by not adding invisible frames to `out_delayed[]` and/or growing it so it can be bigger than the number of frame threads (and thus making the indexing between `out_delayed[]` and the actual frame thread doing the decoding independent), we could grow concurrency and scalability on typical sequences with frame-multithreading enabled.
+- show_existing_frame will be placed in the frame output queue as something keeping a frame thread busy, meaning for such cases, the frame thread will momentarily stall. This is partially required to prevent overflows of the output queue, or growing it to possibly infinite size on garbage input. But for the regular use case, we can make the output buffer queue twice as big, so that each invisible frame can have one matching show_existing_frame, allowing all frame-threads to be active for the worst-"real"-case while still never overflowing on pathological conditions.

 Removing redundancies:
 - the identity_* inverse transforms are stored transposed (as are all other coefficient tables). In all other cases, this saves a transpose in assembly, but for those, it actually means we have to transpose, even though in theory we wouldn't have to at all. Therefore, a potential optimization would be to have a special untransposed zigzag coefficient table and remove the transpose from the assembly, which would make those inverse transforms slightly faster.
@@ -25,6 +25,5 @@ Other speed optimizations:
 Cleanups:
 - palette buffers are always 16-bit, even if content is 8-bit (remaining item in #257);
 - lfmask and l/a ctx zero can be done in tile instead of frame context for better distribution.
- show_existing_frame will be placed in the frame output queue as something keeping a frame thread busy, meaning for such cases, the frame thread will momentarily stall. This is partially required to prevent overflows of the output queue, or growing it to possibly infinite size on garbage input. But for the regular use case, we can make the output buffer queue twice as big, so that each invisible frame can have one matching show_existing_frame, allowing all frame-threads to be active for the worst-"real"-case while still never overflowing on pathological conditions;
 - the output queue handling is duplicated in `decode.c`, `lib.c` and `obu.c`, so merge this in one common place.
 - The `looprestoration`, `mc`, `dav1d_apply_grain`, and `dav1d_init_wedge_masks` functions uses excessively large stack buffers. Rewrite them in a way that reduces the stack usage, for example by using ring buffers or windowed approaches (which we already use for MC/LR SIMD). This would allow us to reduce the thread stack size requirements.
\ No newline at end of file