| ... | @@ -23,11 +23,12 @@ Removing redundancies: |
... | @@ -23,11 +23,12 @@ Removing redundancies: |
|
|
- `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this;
|
|
- `backup_lpf()` in `lr_apply_tmpl.c` backs up 4 lines per 64 pixels per plane, and copies bottom to top per superblock (each 128 or 64 pixels). Most of this is unnecessary. Using a flippable index means we don't need the second copy, and using 64-pixel instead of sb (64 or 128) pixel cdef runs (and then running LR, and then optionally the second cdef and second LR) means we only need to copy the pre-cdef top pixels, not the bottom ones, saving 50% copies. CDEF backup already does all of this;
|
|
|
- bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
|
|
- bonus points for merging the CDEF backup and LR backup together so LR backs up nothing at all;
|
|
|
- for inter, test if prediction at 128x128 should be done at 64x64 subblocks to improve cache efficiency (and also simplify SIMD);
|
|
- for inter, test if prediction at 128x128 should be done at 64x64 subblocks to improve cache efficiency (and also simplify SIMD);
|
|
|
- the identity_identity inverse transforms are stored transposed (as are all other coefficient tables). In all other cases, this saves a transpose in assembly, but for identity^2, it actually means we have to transpose, even though in theory we wouldn't have to at all. Therefore, a potential optimization would be to have a special identity^2 untransposed zigzag coefficient table and remove the transpose from the assembly, which would make identity^2 inverse transforms slightly faster.
|
|
- the identity_* inverse transforms are stored transposed (as are all other coefficient tables). In all other cases, this saves a transpose in assembly, but for those, it actually means we have to transpose, even though in theory we wouldn't have to at all. Therefore, a potential optimization would be to have a special untransposed zigzag coefficient table and remove the transpose from the assembly, which would make those inverse transforms slightly faster.
|
|
|
|
|
|
|
|
Other speed optimizations:
|
|
Other speed optimizations:
|
|
|
- film-grain GL shader (like [placebo](https://github.com/haasn/libplacebo/blob/master/src/shaders/av1.c));
|
|
- film-grain GL shader (like [placebo](https://github.com/haasn/libplacebo/blob/master/src/shaders/av1.c));
|
|
|
- get rid of `memset(0)` of seq_hdr and frame_hdr after allocation.
|
|
- get rid of `memset(0)` of seq_hdr and frame_hdr after allocation.
|
|
|
|
- Reuse per-frame buffers when possible instead of freeing and reallocating them all the time. Applicable to (in order of importance) `picture_alloc_with_edges()`, `dav1d_submit_frame()`, `dav1d_cdf_thread_alloc()`, and `dav1d_parse_obus()`.
|
|
|
|
|
|
|
|
Cleanups:
|
|
Cleanups:
|
|
|
- palette buffers are always 16-bit, even if content is 8-bit (remaining item in #257);
|
|
- palette buffers are always 16-bit, even if content is 8-bit (remaining item in #257);
|
| ... | |
... | |
| ... | | ... | |