PoC 0latency
0latency
VLC is designed to process packets and frames according to their timestamp (PTS). This implies that it needs to wait a certain duration (until a computed date) before demuxing, decoding and displaying. The purpose is to preserve the interval between frames as much as possible, so at to avoid stuttering when watching a movie for example.
Real-time mirroring
Before making any change, we must be able to test glass-to-glass latency easily. For that purpose, we can mirror an Android device screen to VLC.
Download the latest server file from scrcpy, plug an Android device, and execute:
adb push scrcpy-server-v1.25 /data/local/tmp/vlc-scrcpy-server.jar
adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
app_process / com.genymobile.scrcpy.Server 1.25 \
tunnel_forward=true control=false cleanup=false \
max_size=1920 raw_video_stream=true
(Adapt max_size=1920
to use another resolution, that impacts the latency.)
As soon as a client connects to localhost:1234
via TCP, mirroring starts and the
device sends a raw H.264 video stream.
It can be played with:
./vlc -Idummy --demux=h264 --network-caching=0 tcp://localhost:1234
By playing a mire test on the device, and taking a picture (with a good camera) of the device next to the vlc window, we can measure the glass-to-glass delay.
Note that this delay includes the encoding time from the mobile device, which may be larger that the target hardware.
master
On On VLC4 master
without any change, the result is catastrophic (VLC is not designed to handle this use case):
The video is 30fps, and each increments represent 1 frame, so 30 frames represent 1 second. At the end of this small capture, there is almost a 10 seconds delay.
PoC
To mirror and control a remote device in real-time, the critical objective is to minimize latency. Therefore, any unnecessary wait is a bug.
Concretely, all waits based on a timestamp must be removed. Therefore, in 0latency mode, clocks become useless and timestamps are irrelevant. Also, buffering must be removed as much as possible.
To that end, this PoC changes several parts of the VLC pipeline.
--0latency
option
Global The first commit adds a new global option --0latency
, that will be read by several VLC components. By default, it is disabled (of course).
To enable it, pass --0latency
:
./vlc -Idummy --0latency --demux=h264 --network-caching=0 tcp://localhost:1234
Picture buffer
In VLC, when a picture is decoded, it is pushed by the decoder to a fifo queue, which is consumed by the video output.
For 0latency, at any time, we want to display the latest possible frame, so we don't want any fifo queue.
This PoC introduces a "picture buffer" (vlc_pic_buf
; yes, this is a poor name), which is a buffer of exactly 1 picture:
- the producer can push a new picture (overwriting any previous pending picture);
- the consumer can pop the latest picture, which is a blocking operation if no pending picture is available.
The producer is the decoder. The consumer is the video output.
Video output
In VLC, the video output attempts to display a picture at the expected date, so it waits for a specific timestamp. This is exactly what we want to avoid for 0latency.
If 0latency is enabled, this PoC replaces the vout thread function which does a lot of complicated things by a very simple loop (Thread0Latency()
):
- pop picture from the picture buffer;
- call vout
prepare()
; - call vout
display()
.
The function vout_PutPicture()
is also adapted to push the frame to our new picture buffer instead of the existing picture fifo.
Note that in this PoC, the picture is not redrawn on resize, so the content will be black until the next frame on resize. That could be improved later.
Input/demux
In VLC, the input MainLoop()
calls the demuxer to demux when necessary, but explicitly waits for a deadline between successive calls. We don't want to wait.
Therefore, this PoC provides an alternative MainLoop0Latency()
, which is called if 0latency is enabled. This function basically calls demux->pf_demux()
in a loop without ever waiting.
Some code in the es_out
(on control ES_OUT_SET_PCR
) based on clock (for handling jitter) is also totally bypassed.
Decoder
When the decoder implementation has a frame, it submits it to the vout via decoder_QueueVideo()
. The queue implementation is provided by the decoder owner in the core, which handles preroll and may wait.
This PoC replaces this implementation by a simple call to vout_PutPicture()
, to directly push the picture to our new picture buffer in the vout. If the vout was waiting for a picture, it is unblocked and will immediately prepare()
and
display()
.
On the module side, the avcodec decoder was adapted to disable dropping frames based on the clock (if a frame is "late"), and to enable the same options as if --low-delay
was passed.
H.264 AnnexB 1-frame latency
The input is a raw H.264 stream in AnnexB format (this is what Android MediaCodec produces). This raw H.264 is sent over TCP.
The format is:
(00 00 00 01 NALU) | ( 00 00 00 01 NALU) | …
The length of each NAL unit is not present in the stream. Therefore, on the receiving side, the parser detects the end of a NAL unit when it detects the following start code 00 00 00 01
.
However, this start code is sent as the prefix of the next frame, so the received packet will not be submitted to the decoder before the next frame is received, which adds 1-frame latency.
However, the length of the packet is known in advance on the device side. Therefore, a simple solution is to prefix the packet with its length (see Reduce latency by 1 frame).
For simplicity, for now I reused the scrcpy format, by requesting the server to send frame meta:
adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
app_process / com.genymobile.scrcpy.Server 1.25 \
tunnel_forward=true control=false cleanup=false max_size=1920 \
send_device_meta=false send_frame_meta=true send_dummy_byte=false
# ^^^^^^^^^^^^^^^^^^^^
I wrote a specific demuxer to handle it: h264_0latency
To use it, replace the --demux=
argument:
./vlc -Idummy --0latency --demux=h264_0latency --network-caching=0 tcp://localhost:1234
To make the difference obvious, I suggest to play a 1-fps video.
With all these changes, the latency is reduced to 1~2 frames (30 fps) glass-to-glass:
(the device is on the left, VLC is in the middle, scrcpy is on the right)
Protocol discussions
For this PoC, the video stream is received over TCP from an Android device connected via USB (or via wifi on a local network), using a custom protocol.
Packet loss is non-existent over USB and very low on a good local wifi network. However, packet loss would add an unacceptable latency over the Internet with a protocol taking care of packet retransmission (like TCP).
The following is some random thoughts.
Ideally, I think that:
- we want to never decode a non-I-frame packet when the previous (referenced) packets are not received/decoded (this would produce corrupted frames)
- we want to skip any previous packets (possibly lost) whenever a I-frame arrives
Concretely, the device sends:
[I] P P P P P P P P P P P P P P P [I] P P P P P P P P P P P …
If a packet is not received:
[I] P P P P P _ P P P P P P P P P [I] P P P P P P P P P P P …
^
lost
then one possible solution:
- the receiver does not decode further P-frames until the missing packet is received;
- if a more recent I-frame is received, it starts decoding it immediately and forgets/ignores all previous packets.
As a drawback, this forces to use small GOP (i.e. frequent I-frames).
To be continued…
Merge request reports
Activity
As an alternative, we could capture the screen (on X11) and stream it over RTP:
ffmpeg -re -video_size 1920x1080 -f x11grab -draw_mouse 0 -i :0.0 -c:v libx264 -pix_fmt yuv420p -preset veryfast -tune zerolatency -flags +global_header -f rtp -sdp_file video.sdp rtp://127.0.0.1:1234
On branch
0latency.15
, we can play:# without HW decoding ./vlc -Idummy --no-hw-dec --0latency --network-caching=0 ../video.sdp # with HW decoding ./vlc -Idummy --0latency --network-caching=0 ../video.sdp
To measure the time spent on the VLC side, I added tracking of the reception time of the RTP packet, from its reception in a
block_t
to the picture display (in practice, I stream to localhost, so there is no packet loss/reorder).I traced two durations:
- the time from packet reception to the vout (just before prepare/display)
- the time to prepare+display in the vout
Here is a graph (stacked bar chart) without HW decoding (Y-axis unit is µs):
And here is a graph with HW decoding (Y-axis unit is µs):
With hardware decoding, we gain a few milliseconds on the decoding itself, but the most important gain made possible by hardware decoding is the OpenGL interop, which avoids to transfer the decoded picture to/from the main memory.
Edited by Romain VimontThe previous result with hardware decoding was very good.
However, if the input stream is the raw H.264 (+header) received from Android as explained in the first post, even with hardware decoding, it takes a lot more time both to decode and display:
At first, it seems surprising, since in theory the player does not care how the video stream is captured.
VSync
This stacked bar chart does not highlight the underlying issue, so let's trace the same graph without stacking decoding and display duration:
On this graph, we see that displaying almost always takes 1/60 second. This suggests a vsync issue.
Here is the difference between the ffmpeg/x11/RTP capture and the Android capture:
MediaCodec
produces a new frame whenever the input surface is damaged, and my device screen is at 90Hz. If I slide quickly to animate the home screen, then it is common to reach 90fps, while VLC runs on my laptop with a screen at 60Hz.In that case, several OpenGL renderings will be requested during a single vsync period, so the second one is (AFAIU) forced to block until the next vsync.
Note that an input framerate greater than the output framerate does not add any latency in itself: some frames will just be lost (by design) by the
vlc_pic_buf
as explained in the first post (the decoder will overwrite the previous frame it has not be consumed by the vout).The fact that it also increases decoding times probably stems from the fact the decoder and display share resources (but the reason is not clear to me).
To confirm, let's limit the video stream capture to 30fps, by adding
max_fps=30
as ascrcpy-server
parameter (see first post):adb forward tcp:1234 localabstract:scrcpy adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \ app_process / com.genymobile.scrcpy.Server 1.22 \ tunnel_forward=true control=false max_size=1920 max_fps=30 \ send_device_meta=false send_frame_meta=true send_dummy_byte=false
The result is very good (stacked bar chart):
Limiting at 60fps (
max_fps=60
) is better than nothing, but not perfect:At 30 fps but without hardware decoding, decoding takes more time (of course), but display time is quite ok (it includes texture upload from main memory to GPU):
Without vsync
So it seems that VSync is the cause of the problem. But in scrcpy, even at 90fps, by default there is no such problem:
Note that scrcpy does not support HW decoding (yet?).
However, indeed, if I force VSync:
diff --git a/app/src/scrcpy.c b/app/src/scrcpy.c index 8c4920d6..aa87e2bf 100644 --- a/app/src/scrcpy.c +++ b/app/src/scrcpy.c @@ -126,6 +126,8 @@ sdl_set_hints(const char *render_driver) { if (!SDL_SetHint(SDL_HINT_VIDEO_MINIMIZE_ON_FOCUS_LOSS, "0")) { LOGW("Could not disable minimize on focus loss"); } + + SDL_SetHint(SDL_HINT_RENDER_VSYNC, "1"); } static void
Then I can reproduce the problem (let's use non-stacked bars directly):
I attempted to disable VSync in VLC, but I failed (the assertion fails):
diff --git a/modules/video_output/opengl/egl.c b/modules/video_output/opengl/egl.c index 147afe802c..94e2c93c1f 100644 --- a/modules/video_output/opengl/egl.c +++ b/modules/video_output/opengl/egl.c @@ -367,6 +367,10 @@ static int Open(vlc_gl_t *gl, const struct gl_api *api, gl->get_proc_address = GetSymbol; gl->destroy = Close; + // Disable VSync + bool ok = eglSwapInterval(sys->display, 0); + assert(ok); + return VLC_SUCCESS; error:
To be continued…
VSync latency
VSync (often) adds 1 frame latency, because once a frame rendering is requested, it will prevent any further frames that may arrive before the VSync tick to be rendered immediately.
For example, if frames A and B arrive before the first VSync tick:
vsync slots -------------|-------------|------- recv frames A B ^ C ^ vsync enabled A---------->[A]B--------->[B]C----- (very regular 16ms to display) vsync disabled A B [B] C [C]
Then:
- with VSync enabled: A then B will be displayed
- with VSync disabled: B then C will be displayed
I think there are basically 2 solutions to this problem:
- disable VSync (remove the problem)
- estimate rendering time and vsync ticks to submit a frame as late as possible (but this is more complicated and heuristic)
start rendering just before the vsync tick v v vsync slots ----------+--|----------+--|------- recv frames A B ^^^^ C^^^^ render frames B C vsync enabled B-[B] C-[C]
Edited by Romain Vimontestimate rendering time and vsync ticks to submit a frame as late as possible (but this is more complicated and heuristic)
I attempted a hack to start rendering 15ms after the last display time (so it should be approximately 1.66ms before the next vsync):
diff
diff --git a/include/vlc_picture.h b/include/vlc_picture.h index b41dd26578..a032105885 100644 --- a/include/vlc_picture.h +++ b/include/vlc_picture.h @@ -161,6 +161,7 @@ struct picture_t vlc_atomic_rc_t refs; vlc_tick_t recv_ts; + vlc_tick_t decoded_ts; }; static inline vlc_video_context* picture_GetVideoContext(picture_t *pic) diff --git a/src/input/decoder.c b/src/input/decoder.c index d0b9f1883d..365309e2cb 100644 --- a/src/input/decoder.c +++ b/src/input/decoder.c @@ -1137,6 +1137,7 @@ static void ModuleThread_QueueVideo( decoder_t *p_dec, picture_t *p_pic ) if (p_owner->zero_latency) { + p_pic->decoded_ts = vlc_tick_now(); vout_PutPicture(p_owner->p_vout, p_pic); return; } diff --git a/src/video_output/video_output.c b/src/video_output/video_output.c index 07ae0eb579..2b48bf6803 100644 --- a/src/video_output/video_output.c +++ b/src/video_output/video_output.c @@ -1724,8 +1724,16 @@ static void Thread0Latency(vout_thread_sys_t *sys) { vout_display_t *vd = sys->display; + vlc_tick_t last_display_ts = VLC_TICK_INVALID; + for (;;) { + if (last_display_ts != VLC_TICK_INVALID) + { + vlc_tick_t render_min_ts = last_display_ts + VLC_TICK_FROM_MS(15); + vlc_tick_wait(render_min_ts); + } + picture_t *pic = vlc_pic_buf_Pop(&sys->pic_buf); if (!pic) /* stopped */ @@ -1739,9 +1747,11 @@ static void Thread0Latency(vout_thread_sys_t *sys) if (vd->ops->display) vd->ops->display(vd, pic); + + last_display_ts = vlc_tick_now(); vlc_mutex_unlock(&sys->display_lock); - fprintf(stderr, "=== vout: %ld;%ld\n", t - pic->recv_ts, vlc_tick_now() - t); + fprintf(stderr, "=== vout: %ld;%ld\n", pic->decoded_ts - pic->recv_ts, vlc_tick_now() - t); picture_Release(pic); }
It clearly improves the display duration…
but it increases decoder time (the decoder probably depends on the GPU or VSync in some way: old graph):I finally managed to disable VSync in VLC (the EGL context needed to be current):
diff --git a/modules/video_output/opengl/egl.c b/modules/video_output/opengl/egl.c index 147afe802c..c8b4e9fb2d 100644 --- a/modules/video_output/opengl/egl.c +++ b/modules/video_output/opengl/egl.c @@ -359,6 +359,15 @@ static int Open(vlc_gl_t *gl, const struct gl_api *api, } sys->context = ctx; + int ret = MakeCurrent(gl); + assert(ret == VLC_SUCCESS); + + // Disable VSync + EGLBoolean ok = eglSwapInterval(sys->display, 0); + assert(ok); + + ReleaseCurrent(gl); + /* Initialize OpenGL callbacks */ gl->make_current = MakeCurrent; gl->release_current = ReleaseCurrent;
EDIT: or just
export vblank_mode=0
.Without VSync, the latency is better, but there are still irregular "spikes" that I don't explain:
If I draw again the previous graph for scrcpy at 90fps, but with non-stacked bars:
The display duration does not produce these spikes (but maybe they are avoided only because the decoding time is higher).
In any case, I think we should disable VSync when it's possible to get the lowest possible latency.
Edited by Romain VimontAndroid has a VideoFrameScheduler trying to dispatch frames at the middle of two vsync, to make the playback smooth.
mentioned in issue videolan/vlc#26411 (closed)
Here is a small demo of "audio mirroring":
The Android device is connected over wifi (USB would be cheating). The audio is captured on Android (using sndcpy), then transmitted in raw format (2 channels, 48kHz) over a TCP socket.
On the client side, VLC from branch
0latency.28
is called that way:./vlc -Idummy --demux rawaud -Aalsa --0latency --network-caching=0 --play-and-exit tcp://localhost:$PORT
To be more concrete to reproduce the demo, it is exactly run like this:
$ cat myvlc #!/bin/bash ~/projects/vlc/buildasan/vlc --0latency -Aalsa "$@" $ VLC=./myvlc ./sndcpy
This branch contains everything done for video 0latency (in particular the input 0latency, to avoid additional buffering), + some very minor changes in the end (surprisingly) for the audio part (specific to Alsa for the aout module part). I don't even understand how setting
AOUT_0LATENCY_TIME
to 0 (the last commit) does not make the sound glitch (probably because of the ~20ms "natural" buffering on the alsa side, as reported bytime_get()
, but I'm not sure).What you observe on the video:
- the Audio/Video Sync Test is played on the device (via VLC for Android);
- the device is mirrored on the computer to observe full mirroring (video + audio);
- at the beginning, the sound is played only on the computer;
- after 3~4 seconds, I increase the volume on the device, so the sound is played both on the device and the computer (we can hear the delay)
that POC is very interesting.
We are using VLC in a "live" captioning here. We use a low-latency hardware streamer that create a HEVC stream (approx 2 mbit), but can also create a H264 stream (10 mbit). The "captionner" listen the video stream (in fact, the audio.. but look the video at the same time).
The CC encoder is place before the streamer, so when the captionner add CC with his live caption sofware (Caption Maker), he can notice the result "live"... the normal delay we usually got with other player is approx 1-1.5 sec ... but fiew years ago, we mesure 363 ms between the insertion of the live caption, and the result "on screen".
On this POC, is there something i can do to test a built, with the proper command line to start a UDP or RTP stream ? (what should i enter)
for now, we only notice that Caption diseaper when use the --0latency option
On this POC, is there something i can do to test a built, with the proper command line to start a UDP or RTP stream ? (what should i enter)
Basically, you should test the commands from this comment as an example: !20 (comment 307154)
As an alternative, we could capture the screen (on X11) and stream it over RTP:
ffmpeg -re -video_size 1920x1080 -f x11grab -draw_mouse 0 -i :0.0 -c:v libx264 -pix_fmt yuv420p -preset veryfast -tune zerolatency -flags +global_header -f rtp -sdp_file video.sdp rtp://127.0.0.1:123
On branch
0latency.15
0latency.31
, we can play:# without HW decoding ./vlc -Idummy --no-hw-dec --0latency --network-caching=0 ../video.sdp # with HW decoding ./vlc -Idummy --0latency --network-caching=0 ../video.sdp
(check here to build VLC)
for now, we only notice that Caption diseaper when use the --0latency option
Yes, SPU/subtitles are totally skipped on this PoC.