Changes

Ronald S. Bultje · cca75a1d
--- a/task-list.md
+++ b/task-list.md
@@ -20,7 +20,6 @@ Removing redundancies:
 - it may make sense to copy one row (8px+2x2px edges) of pre-cdef data in `uint16_t` at a time so we don't need to extend buffers or add edge data inside the SIMD (!657). This may make the code both simpler *and* faster. Same is true for looprestoration also;
 - `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this;
 - bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
- for inter, test if prediction at 128x128 should be done at 64x64 subblocks to improve cache efficiency (and also simplify SIMD);
 - the identity_* inverse transforms are stored transposed (as are all other coefficient tables). In all other cases, this saves a transpose in assembly, but for those, it actually means we have to transpose, even though in theory we wouldn't have to at all. Therefore, a potential optimization would be to have a special untransposed zigzag coefficient table and remove the transpose from the assembly, which would make those inverse transforms slightly faster.

 Other speed optimizations: