... | ... | @@ -30,7 +30,8 @@ Performance optimizations: |
|
|
- `cfl_ac` should take size (`w`/`h`) as function arguments rather than as function LUT indices, so that only subsampling (`420`, `422`, `444`) is a LUT entry;
|
|
|
- `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this. Bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
|
|
|
- postfilter threading;
|
|
|
- threading can become a generic worker queue (one tile_sbrow symbol parsing/recon, one sbrow postfilter(s)) and then use a generic single threadpool instead of separate tile/frame[/postfilter?] ones.
|
|
|
- threading can become a generic worker queue (one tile_sbrow symbol parsing/recon, one sbrow postfilter(s)) and then use a generic single threadpool instead of separate tile/frame[/postfilter?] ones;
|
|
|
- obmc blend masks have one quarter of zeroes at their tail, so would there be gains if we set height to be 0.75 of what it currently is (for mc and/or blend)? Does this impact SIMD design in some unwanted way?
|
|
|
|
|
|
Cleanups:
|
|
|
- LR/MC intermediate 2d buffers in C dsp can be reduced by doing windowed like in SIMD;
|
... | ... | |