Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
dav1d
dav1d
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 21
    • Issues 21
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 20
    • Merge Requests 20
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • VideoLAN
  • dav1ddav1d
  • Wiki
  • task list

Last edited by Henrik Gramner Feb 06, 2021
Page history

task list

Loosely defined TODO list, intended both as a roadmap as well as an entry point for developers wishing to contribute but not knowing where to start.

Missing software features:

  • tile_ext (aomenc --large-scale-tile=?).
  • error resilience (drop a frame but don't die; #307).
  • conformance resilience, i.e. as per the discussion in !1063, do not always require that all bitstream non-conformances in terms of header lead to an error, but allow instead to continue if the bitstream parsing allows us to.

SIMD:

  • simd for any function already in a ${anything}DSPContext, for any platform (AVX512: #316, AVX2: #78, Neon: #215, SSSE3: #216, VSX: #281);
  • move dequant from decode_coeffs() to itx;
  • order_palette() to dsp for simd;
  • change coef contexting (hi/lo_ctx) to be diagonal-oriented for dsp/simd;
  • _save_tmvs() and _load_tmvs() in refmvs.c can (maybe?) be SIMD'ed, along with all _splat_*() code in refmvs.h.

Multi-threading:

  • postfilter (!1086 (merged)) and film-grain threading;
  • in first-pass of frame threading with tile threading enabled, it may make sense (assuming no temporal interference from ref_mvs or seg_id) to first parse the tile marked as the one used to update the output CDF, since that would unblock the subsequent thread's pass 1. This is only true if use_ref_mvs=0 and segmentation.temporal_update=0;
  • threading can become a generic worker queue (one tile_sbrow symbol parsing/recon, one sbrow postfilter(s)) and then use a generic single threadpool instead of separate tile/frame[/postfilter?] ones (see also #206);
  • by not adding invisible frames to out_delayed[] and/or growing it so it can be bigger than the number of frame threads (and thus making the indexing between out_delayed[] and the actual frame thread doing the decoding independent), we could grow concurrency and scalability on typical sequences with frame-multithreading enabled.

Removing redundancies:

  • it may make sense to copy one row (8px+2x2px edges) of pre-cdef data in uint16_t at a time so we don't need to extend buffers or add edge data inside the SIMD (!657 (closed)). This may make the code both simpler and faster. Same is true for looprestoration also;
  • backup_lpf() in lr_apply_tmpl.c backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this;
  • bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
  • for inter, test if prediction at 128x128 should be done at 64x64 subblocks to improve cache efficiency (and also simplify SIMD);
  • the identity_* inverse transforms are stored transposed (as are all other coefficient tables). In all other cases, this saves a transpose in assembly, but for those, it actually means we have to transpose, even though in theory we wouldn't have to at all. Therefore, a potential optimization would be to have a special untransposed zigzag coefficient table and remove the transpose from the assembly, which would make those inverse transforms slightly faster.

Other speed optimizations:

  • get rid of memset(0) of seq_hdr and frame_hdr after allocation.

Cleanups:

  • palette buffers are always 16-bit, even if content is 8-bit (remaining item in #257 (closed));
  • LR/MC intermediate 2d buffers in C dsp can be reduced by doing windowed like in SIMD;
  • cdef: noskip_mask resolution can be 8x8 (!1121 (merged));
  • lfmask and l/a ctx zero can be done in tile instead of frame context for better distribution.
  • show_existing_frame will be placed in the frame output queue as something keeping a frame thread busy, meaning for such cases, the frame thread will momentarily stall. This is partially required to prevent overflows of the output queue, or growing it to possibly infinite size on garbage input. But for the regular use case, we can make the output buffer queue twice as big, so that each invisible frame can have one matching show_existing_frame, allowing all frame-threads to be active for the worst-"real"-case while still never overflowing on pathological conditions;
  • the output queue handling is duplicated in decode.c, lib.c and obu.c, so merge this in one common place.
  • The looprestoration, mc, dav1d_apply_grain, and dav1d_init_wedge_masks functions uses excessively large stack buffers. Rewrite them in a way that reduces the stack usage, for example by using ring buffers. This would allow us to reduce the thread stack size requirements.
Clone repository
  • Contributing
  • Coding style
  • Compilation guide
  • Task list