Changes

Ronald S. Bultje · 455dfa8a
--- a/task-list.md
+++ b/task-list.md
@@ -8,23 +8,30 @@ Missing software features:
 - error resilience (drop a frame but don't die);
 - export sequence header and frame header along with picture ([like this](https://code.videolan.org/rbultje/dav1d/commits/refcounted-headers)).

-Performance optimizations:
- it may make sense to copy one row (8px+2x2px edges) of pre-cdef data in `uint16_t` at a time so we don't need to extend buffers or add edge data inside the SIMD. This may make the code both simpler *and* faster. Same is true for looprestoration also. In one far-fetched design, cdef/LR might be able to share the top buffers between the two, thus reducing the amount of `memcpy` between the two;
+SIMD:
 - simd for any function already in a ${anything}DSPContext, for any platform (see #78 for AVX2);
 - move film grain into dspcontext for simd;
 - move dequant from `decode_coeffs()` to itx;
 - order_palette() to dsp for simd;
 - change coef contexting (hi/lo_ctx) to be diagonal-oriented for dsp/simd;
 - change multi-symbol coding `read_symbol()` symbol discovery loop and adaptivity to be simd'ed [Rostislav expressed interest in this];
- project_motion_field in `ref_mvs.c` can be SIMD'ed;
- `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this. Bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
+- project_motion_field in `ref_mvs.c` can maybe be SIMD'ed;
+- a specifically optimized version for `mc.put/prep_scaled()` for super_res, since then `my` is always 0, so there is only horizontal scaling, not vertical.
+
+Multi-threading:
 - postfilter and film-grain threading;
- film-grain GL shader;
- threading can become a generic worker queue (one tile_sbrow symbol parsing/recon, one sbrow postfilter(s)) and then use a generic single threadpool instead of separate tile/frame[/postfilter?] ones;
+- threading can become a generic worker queue (one tile_sbrow symbol parsing/recon, one sbrow postfilter(s)) and then use a generic single threadpool instead of separate tile/frame[/postfilter?] ones.
+
+Removing redundancies:
+- it may make sense to copy one row (8px+2x2px edges) of pre-cdef data in `uint16_t` at a time so we don't need to extend buffers or add edge data inside the SIMD. This may make the code both simpler *and* faster. Same is true for looprestoration also;
+- `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this;
+- bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
 - obmc blend masks have one quarter of zeroes at their tail, so would there be gains if we set height to be 0.75 of what it currently is (for mc and/or blend)? Does this impact SIMD design in some unwanted way?
 - for inter, test if prediction at 128x128 should be done at 64x64 subblocks to improve cache efficiency (and also simplify SIMD);
- faster seeking w/ frame-mt. Right now, a flush means pictures being decoded in frame-mt are marked as "discarded" but their decoding continues. It would be better to have a quick abort check in each sbrow or so, and then return early so the post-seek can continue earlier;
- a specifically optimized version for `mc.put/prep_scaled_c()` for super_res, since then `my` is always 0, so there is only horizontal scaling, not vertical.
+
+Other speed optimizations:
+- film-grain GL shader;
+- faster seeking w/ frame-mt. Right now, a flush means pictures being decoded in frame-mt are marked as "discarded" but their decoding continues. It would be better to have a quick abort check in each sbrow or so, and then return early so the post-seek can continue earlier.

 Cleanups:
 - LR/MC intermediate 2d buffers in C dsp can be reduced by doing windowed like in SIMD;