| ... | @@ -10,7 +10,7 @@ SIMD: |
... | @@ -10,7 +10,7 @@ SIMD: |
|
|
- move dequant from `decode_coeffs()` to itx;
|
|
- move dequant from `decode_coeffs()` to itx;
|
|
|
- order_palette() to dsp for simd;
|
|
- order_palette() to dsp for simd;
|
|
|
- change coef contexting (hi/lo_ctx) to be diagonal-oriented for dsp/simd;
|
|
- change coef contexting (hi/lo_ctx) to be diagonal-oriented for dsp/simd;
|
|
|
- change multi-symbol coding `read_symbol()` symbol discovery loop and adaptivity to be simd'ed [!423];
|
|
- change multi-symbol coding `read_symbol()` symbol discovery loop and adaptivity to be simd'ed [!665];
|
|
|
- project_motion_field in `ref_mvs.c` can maybe be SIMD'ed;
|
|
- project_motion_field in `ref_mvs.c` can maybe be SIMD'ed;
|
|
|
- a specifically optimized version for `mc.put/prep_scaled()` for super_res, since then `my` is always 0, so there is only horizontal scaling, not vertical.
|
|
- a specifically optimized version for `mc.put/prep_scaled()` for super_res, since then `my` is always 0, so there is only horizontal scaling, not vertical.
|
|
|
|
|
|
| ... | @@ -21,7 +21,7 @@ Multi-threading: |
... | @@ -21,7 +21,7 @@ Multi-threading: |
|
|
- by not adding invisible frames to `out_delayed[]` and/or growing it so it can be bigger than the number of frame threads (and thus making the indexing between `out_delayed[]` and the actual frame thread doing the decoding independent), we could grow concurrency and scalability on typical sequences with frame-multithreading enabled.
|
|
- by not adding invisible frames to `out_delayed[]` and/or growing it so it can be bigger than the number of frame threads (and thus making the indexing between `out_delayed[]` and the actual frame thread doing the decoding independent), we could grow concurrency and scalability on typical sequences with frame-multithreading enabled.
|
|
|
|
|
|
|
|
Removing redundancies:
|
|
Removing redundancies:
|
|
|
- it may make sense to copy one row (8px+2x2px edges) of pre-cdef data in `uint16_t` at a time so we don't need to extend buffers or add edge data inside the SIMD. This may make the code both simpler *and* faster. Same is true for looprestoration also;
|
|
- it may make sense to copy one row (8px+2x2px edges) of pre-cdef data in `uint16_t` at a time so we don't need to extend buffers or add edge data inside the SIMD (!657). This may make the code both simpler *and* faster. Same is true for looprestoration also;
|
|
|
- `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this;
|
|
- `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this;
|
|
|
- bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
|
|
- bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
|
|
|
- obmc blend masks have one quarter of zeroes at their tail, so would there be gains if we set height to be 0.75 of what it currently is (for mc and/or blend)? Does this impact SIMD design in some unwanted way?
|
|
- obmc blend masks have one quarter of zeroes at their tail, so would there be gains if we set height to be 0.75 of what it currently is (for mc and/or blend)? Does this impact SIMD design in some unwanted way?
|
| ... | @@ -33,7 +33,7 @@ Other speed optimizations: |
... | @@ -33,7 +33,7 @@ Other speed optimizations: |
|
|
- get rid of `memset(0)` of seq_hdr and frame_hdr after allocation.
|
|
- get rid of `memset(0)` of seq_hdr and frame_hdr after allocation.
|
|
|
|
|
|
|
|
Cleanups:
|
|
Cleanups:
|
|
|
- internal buffers in `decode.c` are all allocated as if the stream is 10-bit 4:4:4, so changing these buffers could reduce memory usage somewhat, especially the palette and coef buffers;
|
|
- internal buffers in `decode.c` are all allocated as if the stream is 10-bit 4:4:4, so changing these buffers could reduce memory usage somewhat, especially the palette and coef buffers (#257);
|
|
|
- LR/MC intermediate 2d buffers in C dsp can be reduced by doing windowed like in SIMD;
|
|
- LR/MC intermediate 2d buffers in C dsp can be reduced by doing windowed like in SIMD;
|
|
|
- cdef: noskip_mask resolution can be 8x8;
|
|
- cdef: noskip_mask resolution can be 8x8;
|
|
|
- ref_mvs: non-cur frame MVs can be at 8x8 resolution, only direct neighbours need to be 4x4;
|
|
- ref_mvs: non-cur frame MVs can be at 8x8 resolution, only direct neighbours need to be 4x4;
|
| ... | |
... | |
| ... | | ... | |