Changes

Ronald S. Bultje · 28ad7a35
--- a/task-list.md
+++ b/task-list.md
-Loosely defined TODO list, intended both as a roadmap as well as an entry point for developers wishing to contribute but not knowing where to start.
-
-Missing software features:
- tile_ext (`aomenc --large-scale-tile=`?).
- error resilience (when dropping a frame which may be used as a reference, try to find a nearest-neighbour which can serve as a replacement reference so that other frames using this as a reference can be decoded at least with some accuracy; #307).
-
-SIMD:
- simd for any function already in a ${anything}DSPContext, for any platform (AVX512: #316, AVX2: #78, Neon: #215, SSSE3: #216, VSX: #281, RVV: #435);
- move dequant from `decode_coeffs()` to itx;
- order_palette() to dsp for simd;
- change coef contexting (hi/lo_ctx) to be diagonal-oriented for dsp/simd.
-
-Multi-threading:
- in first-pass of frame threading with tile threading enabled, it may make sense (assuming no temporal interference from ref_mvs or seg_id) to first parse the tile marked as the one used to update the output CDF, since that would unblock the subsequent thread's pass 1. This is only true if use_ref_mvs=0 and segmentation.temporal_update=0;
- show_existing_frame will be placed in the frame output queue as something keeping a frame thread busy, meaning for such cases, the frame thread will momentarily stall. This is partially required to prevent overflows of the output queue, or growing it to possibly infinite size on garbage input. But for the regular use case, it makes sense to dis-associate the input and output queue so show-existing-frame does not affect how many frames are actively being processed.
-
-Algorithmic optimizations:
- prevent per-plane `memcpy()` if some (but not all) planes have film grain. We tried this before but it had to be reverted (#426, !1522). This probably needs per-plane `allocater_data`.
- the identity_* inverse transforms are stored transposed (as are all other coefficient tables). In all other cases, this saves a transpose in assembly, but for those, it actually means we have to transpose, even though in theory we wouldn't have to at all. Therefore, a potential optimization would be to have a special untransposed zigzag coefficient table and remove the transpose from the assembly, which would make those inverse transforms slightly faster.
-
-Cleanups:
- lfmask and l/a ctx zero can be done in tile instead of frame context for better distribution.
- the output queue handling is duplicated in `decode.c`, `lib.c` and `obu.c`, so merge this in one common place.
- The `looprestoration`, `mc`, `dav1d_apply_grain`, and `dav1d_init_wedge_masks` functions uses excessively large stack buffers. Rewrite them in a way that reduces the stack usage, for example by using ring buffers or windowed approaches (which we already use for MC/LR SIMD). This would allow us to reduce the thread stack size requirements.
-
-Memory usage reductions:
+Loosely defined TODO list, intended both as a roadmap as well as an entry point for developers wishing to contribute but not knowing where to start.
+
+Missing software features:
+- tile_ext (`aomenc --large-scale-tile=`?).
+- error resilience (when dropping a frame which may be used as a reference, try to find a nearest-neighbour which can serve as a replacement reference so that other frames using this as a reference can be decoded at least with some accuracy; #307).
+
+SIMD:
+- simd for any function already in a ${anything}DSPContext, for any platform (AVX512: #316, AVX2: #78, Neon: #215, SSSE3: #216, VSX: #281, RVV: #435);
+- move dequant from `decode_coeffs()` to itx;
+- order_palette() to dsp for simd;
+- change coef contexting (hi/lo_ctx) to be diagonal-oriented for dsp/simd.
+
+Multi-threading:
+- in first-pass of frame threading with tile threading enabled, it may make sense (assuming no temporal interference from ref_mvs or seg_id) to first parse the tile marked as the one used to update the output CDF, since that would unblock the subsequent thread's pass 1. This is only true if use_ref_mvs=0 and segmentation.temporal_update=0;
+- show_existing_frame will be placed in the frame output queue as something keeping a frame thread busy, meaning for such cases, the frame thread will momentarily stall. This is partially required to prevent overflows of the output queue, or growing it to possibly infinite size on garbage input. But for the regular use case, it makes sense to dis-associate the input and output queue so show-existing-frame does not affect how many frames are actively being processed.
+
+Algorithmic optimizations:
+- prevent per-plane `memcpy()` if some (but not all) planes have film grain. We tried this before but it had to be reverted (#426, !1522). This probably needs per-plane `allocater_data`.
+- the identity_* inverse transforms are stored transposed (as are all other coefficient tables). In all other cases, this saves a transpose in assembly, but for those, it actually means we have to transpose, even though in theory we wouldn't have to at all. Therefore, a potential optimization would be to have a special untransposed zigzag coefficient table and remove the transpose from the assembly, which would make those inverse transforms slightly faster.
+
+Cleanups:
+- lfmask and l/a ctx zero can be done in tile instead of frame context for better distribution.
+- the output queue handling is duplicated in `decode.c`, `lib.c` and `obu.c`, so merge this in one common place.
+- The `looprestoration` (wiener-only; this is already fixed for SGR), `mc`, `dav1d_apply_grain`, and `dav1d_init_wedge_masks` functions uses excessively large stack buffers. Rewrite them in a way that reduces the stack usage, for example by using ring buffers or windowed approaches (which we already use for MC/LR SIMD). This would allow us to reduce the thread stack size requirements.
+
+Memory usage reductions:
 - Pack the four (y/uv \* h/v) 6-bit lf mask values into a 24-bit value, which should save 1 KiB / sb128. Requires changes to the mask loading asm code.
\ No newline at end of file