Changes

Ronald S. Bultje · e0f8bbab
--- a/task-list.md
+++ b/task-list.md
 Loosely defined TODO list, intended both as a roadmap as well as an entry point for developers wishing to contribute but not knowing where to start.

-Missing API:
- API to apply filmgrain to an image in software;
- export filmgrain (maybe also sequence header [see #30] and frame header) along with picture.
-
 Missing bitstream features:
- apply filmgrain in software (#34, !358);
- 12-bits/component decoding (without massively increasing the binary size).
-
-Missing support for weird header bit features (please provide samples if we don't have one yet):
- tile_ext (`aomenc --large-scale-tile=`?);
+- 12-bits/component decoding (without massively increasing the binary size);
+- tile_ext (`aomenc --large-scale-tile=`?).

 Missing software features:
 - error resilience (drop a frame but don't die);
+- export sequence header and frame header along with picture ([like this](https://code.videolan.org/rbultje/dav1d/commits/refcounted-headers)).

 Performance optimizations:
 - it may make sense to copy one row (8px+2x2px edges) of pre-cdef data in `uint16_t` at a time so we don't need to extend buffers or add edge data inside the SIMD. This may make the code both simpler *and* faster. Same is true for looprestoration also. In one far-fetched design, cdef/LR might be able to share the top buffers between the two, thus reducing the amount of `memcpy` between the two;
 - simd for any function already in a ${anything}DSPContext, for any platform (see #78 for AVX2);
+- move film grain into dspcontext for simd;
 - move dequant from `decode_coeffs()` to itx;
 - order_palette() to dsp for simd;
 - change coef contexting (hi/lo_ctx) to be diagonal-oriented for dsp/simd;
 - change multi-symbol coding `read_symbol()` symbol discovery loop and adaptivity to be simd'ed [Rostislav expressed interest in this];
 - project_motion_field in `ref_mvs.c` can be SIMD'ed;
 - `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this. Bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
- postfilter threading;
+- postfilter and film-grain threading;
+- film-grain GL shader;
 - threading can become a generic worker queue (one tile_sbrow symbol parsing/recon, one sbrow postfilter(s)) and then use a generic single threadpool instead of separate tile/frame[/postfilter?] ones;
 - obmc blend masks have one quarter of zeroes at their tail, so would there be gains if we set height to be 0.75 of what it currently is (for mc and/or blend)? Does this impact SIMD design in some unwanted way?
 - for inter, test if prediction at 128x128 should be done at 64x64 subblocks to improve cache efficiency (and also simplify SIMD);