Luc Trudeau requested to merge chunked-mc into master May 21, 2021

Could dav1d decode 128x128 super blocks faster?

When analyzing the impact of 128x128 superblocks on dav1d and gav1 over the Netflix 1080p test sequences, I get the following results:

Decoder	sb	Average decode time	Delta %
Dav1d	64	3.0729	-2.16%
Dav1d	128	3.0066
Gav1	64	4.3452	-4.43%
Gav1	128	4.1528

Results from a C5Xlarge instance

One way to interpret this data is that 128x128 super blocks are more beneficial to gav1 than they are for dav1d. This result is similar to previous finding related to loop restoration where gav1 was found to be better than dav1d. This resulted in substantial improvements to the dav1d loop restoration code. The objective of this work is to determine if this is also the case for 128x128 superblocks.

Prediction Buffers vs. In-frame prediction

One difference between dav1d and gav1 that could be related to this data is how inter predictions are stored to memory. Gav1 uses scratch buffers for prediction, while dav1d does the prediction directly inside the frame buffer.

In-frame prediction could result in more cache misses for bigger blocks (i.e. 128x128 superblocks). For example, a 128x128 block in a 1080p frame only requires a scratch prediction buffer of 128x128 elements whereas the memory accessed in the frame buffer covers a 1920x128 surface. This could cause more cache misses and require memory access.

A simple test

AV1 is very complicated, a simple way to verify this, is to disable all inter coding tools except for single reference motion estimation (i.e. old school motion estimation, no compound me, no obmc, no interintra, no global motion, no warped motion).

Again using the Netflix 1080p test sequences, the relative speed up of 128x128 superblocks is almost the same on gav1 (-4.43% -> -4.28%) and slightly less on dav1d (-2.16% -> -1.76%).

Decoder	sb	Average decode time	Delta %
Dav1d	64	2.9002	-1.76%
Dav1d	128	2.8490
Gav1	64	4.2104	-4.28%
Gav1	128	4.0303

Results from a C5Xlarge instance

Chunked Inter Prediction

While dav1d does use prediction buffers for compound inter prediction it's mostly for the warping the actual merging of the two predictors is done in-frame. This being said, replacing in-frame prediction with prediction buffers in dav1d requires a considerable amount of work and might not even be desirable. One disadvantage of prediction buffers is that they require an extra copying for skips (memcpy is murder).

Another test

To validate that what we are seeing is really caused by cache misses on 128x128 superblock, a simple experiment is to perform chunked 64x64 inter predictions. Instead of predicting the entire 128x128 superblock we only do the first 64x64 chunk, followed by the reconstruction and afterward proceed to predicting the next 64x64 block. This isn't a perfect solution, but should considerably reduce cache misses.

Decoder	sb	Average decode time	Delta %
Dav1d	64	2.9008	-1.15%
Dav1d	128	2.8676
Gav1	64	4.2109	-4.12%
Gav1	128	4.0376

Results from a C5Xlarge instance

Inter Prediction Only

Inter Prediction (sb128)	Dav1d	Gav1	Delta %
128x128	0.218526	0.255277	-14.40%
64x64	0.196064	0.22852	-14.20%
32x32	0.242264	0.287912	-15.85%
16x16	0.208807	0.268232	-22.15%
8x8	0.100511	0.148878	-32.49%
4x4	0.017288	0.025214	-31.43%

Inter Prediction (sb64)	Dav1d	Gav1	Delta %
64x64	0.432062	0.530057	-18.49%
32x32	0.240078	0.287298	-16.44%
16x16	0.209979	0.269536	-22.10%
8x8	0.10139	0.149446	-32.16%
4x4	0.018133	0.025904	-30.00%

Edited May 27, 2021 by Luc Trudeau

Draft: WIP Prototype of chunked mc

Could dav1d decode 128x128 super blocks faster?

Prediction Buffers vs. In-frame prediction

A simple test

Chunked Inter Prediction

Another test

Inter Prediction Only

Merge request reports