Changes

Ronald S. Bultje · 40664458
--- a/task-list.md
+++ b/task-list.md
@@ -23,7 +23,7 @@ Removing redundancies:
 - it may make sense to copy one row (8px+2x2px edges) of pre-cdef data in `uint16_t` at a time so we don't need to extend buffers or add edge data inside the SIMD (!657). This may make the code both simpler *and* faster. Same is true for looprestoration also;
 - `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this;
 - bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
- obmc blend masks have one quarter of zeroes at their tail, so would there be gains if we set height to be 0.75 of what it currently is (for mc and/or blend)? Does this impact SIMD design in some unwanted way?
+- obmc blend masks have one quarter of zeroes at their tail, so would there be gains if we set height to be 0.75 of what it currently is (for mc and/or blend)? Does this impact SIMD design in some unwanted way? (!705)
 - for inter, test if prediction at 128x128 should be done at 64x64 subblocks to improve cache efficiency (and also simplify SIMD);
 - the identity_identity inverse transforms are stored transposed (as are all other coefficient tables). In all other cases, this saves a transpose in assembly, but for identity^2, it actually means we have to transpose, even though in theory we wouldn't have to at all. Therefore, a potential optimization would be to have a special identity^2 untransposed zigzag coefficient table and remove the transpose from the assembly, which would make identity^2 inverse transforms slightly faster.