Commit 84bb443d authored by Fiona Glaser's avatar Fiona Glaser

Update some of the information in doc/

parent 49e5105a
A qualitative overview of x264's ratecontrol methods
By Loren Merritt
Historical note:
This document is outdated, but a significant part of it is still accurate. Here are some important ways ratecontrol has changed since the authoring of this document:
- By default, MB-tree is used instead of qcomp for weighting frame quality based on complexity. MB-tree is effectively a generalization of qcomp to the macroblock level. MB-tree also replaces the constant offsets for B-frame quantizers. The legacy algorithm is still available for low-latency applications.
- Adaptive quantization is now used to distribute quality among each frame; frames are no longer constant quantizer, even if MB-tree is off.
- VBV runs per-row rather than per-frame to improve accuracy.
x264's ratecontrol is based on libavcodec's, and is mostly empirical. But I can retroactively propose the following theoretical points which underlie most of the algorithms:
......@@ -37,8 +42,3 @@ The goal is the same as in 2pass, but here we don't have the benefit of a previo
constant quantizer:
QPs are simply based on frame type.
all modes:
H.264 allows each macroblock to have a different QP. x264 does not do so. Ratecontrol returns one QP which is used for the whole frame.
......@@ -7,19 +7,18 @@ inherently caused by compression.
svn co svn://svn.videolan.org/x264/trunk x264
cd x264
./configure
perl -pi -e 's|//(#define DEBUG_DUMP_FRAME)|$1|' encoder/encoder.c # define DEBUG_DUMP_FRAME
make
cd ..
# Install and compile JM reference decoder :
wget http://iphome.hhi.de/suehring/tml/download/jm10.2.zip
unzip jm10.2.zip
wget http://iphome.hhi.de/suehring/tml/download/jm17.2.zip
unzip jm17.2.zip
cd JM
sh unixprep.sh
cd ldecod
make
cd ../..
./x264/x264 input.yuv -o output.h264 # this produces fdec.yuv
./x264/x264 input.yuv --dump-yuv fdec.yuv -o output.h264
./JM/bin/ldecod.exe -i output.h264 -o ref.yuv
diff ref.yuv fdec.yuv
Historical notes:
Slice-based threads was the original threading model of x264. It was replaced with frame-based threads in r607. This document was originally written at that time. Slice-based threading was brought back (as an optional mode) in r1364 for low-latency encoding. Furthermore, frame-based threading was modified significantly in r1246, with the addition of threaded lookahead.
Old threading method: slice-based
application calls x264
x264 runs B-adapt and ratecontrol (serial)
......@@ -9,24 +12,37 @@ In x264cli, there is one additional thread to decode the input.
New threading method: frame-based
application calls x264
x264 runs B-adapt and ratecontrol (serial to the application, but parallel to the other x264 threads)
x264 requests a frame from lookahead, which runs B-adapt and ratecontrol parallel to the current thread, separated by a buffer of size sync-lookahead
spawn a thread for this frame
thread runs encode in 1 slice, deblock, hpel filter
thread runs encode, deblock, hpel filter
meanwhile x264 waits for the oldest thread to finish
return to application, but the rest of the threads continue running in the background
No additional threads are needed to decode the input, unless decoding+B-adapt is slower than slice+deblock+hpel, in which case an additional input thread would allow decoding in parallel to B-adapt.
No additional threads are needed to decode the input, unless decoding is slower than slice+deblock+hpel, in which case an additional input thread would allow decoding in parallel.
Penalties for slice-based threading:
Each slice adds some bitrate (or equivalently reduces quality), for a variety of reasons: the slice header costs some bits, cabac contexts are reset, mvs and intra samples can't be predicted across the slice boundary.
In CBR mode, we have to allocate bits between slices before encoding them, which may lead to uneven quality.
In CBR mode, multiple slices encode simultaneously, thus increasing the maximum misprediction possible with VBV.
Some parts of the encoder are serial, so it doesn't scale well with lots of cpus.
Some numbers on penalties for slicing:
Tested at 720p with 45 slices (one per mb row) to maximize the total cost for easy measurement. Averaged over 4 movies at crf20 and crf30. Total cost: +30% bitrate at constant psnr.
I enabled the various components of slicing one at a time, and measured the portion of that cost they contribute:
* 34% intra prediction
* 25% redundant slice headers, nal headers, and rounding to whole bytes
* 16% mv prediction
* 16% reset cabac contexts
* 6% deblocking between slices (you don't strictly have to turn this off just for standard compliance, but you do if you want to use slices for decoder multithreading)
* 2% cabac neighbors (cbp, skip, etc)
The proportional cost of redundant headers should certainly depend on bitrate (since the header size is constant and everything else depends on bitrate). Deblocking should too (due to varing deblock strength).
But none of the proportions should depend strongly on the number of slices: some are triggered per slice while some are triggered per macroblock-that's-on-the-edge-of-a-slice, but as long as there's no more than 1 slice per row, the relative frequency of those two conditions is determined solely by the image width.
Penalties for frame-base threading:
To allow encoding of multiple frames in parallel, we have to ensure that any given macroblock uses motion vectors only from pieces of the reference frames that have been encoded already. This is usually not noticeable, but can matter for very fast upward motion.
We have to commit to one frame type before starting on the frame. Thus scenecut detection must run during the lowres pre-motion-estimation along with B-adapt, which makes it faster but less accurate than re-encoding the whole frame.
Ratecontrol gets delayed feedback, since it has to plan frame N before frame N-1 finishes.
NOTE: these benchmarks are from the original implementation of frame-based threads. They are likely not entirely accurate today, nor do the commandlines match up with modern x264. However, they still give a good idea of the relative performance of frame and slice-based threads.
Benchmarks:
cpu: 4x woodcrest 3GHz
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment