PoC 0latency

changed the description

added 4 commits

313e90f6 - decoder: 0latency
bdf3a461 - avcodec: include low-delay settings for 0latency
57b548e6 - avcodec: never drop blocks on low delay
8e7071b5 - demux: h264_0latency

changed the description

As an alternative, we could capture the screen (on X11) and stream it over RTP:

ffmpeg -re -video_size 1920x1080 -f x11grab -draw_mouse 0 -i :0.0 -c:v libx264 -pix_fmt yuv420p -preset veryfast -tune zerolatency -flags +global_header -f rtp -sdp_file video.sdp rtp://127.0.0.1:1234

On branch 0latency.15, we can play:

# without HW decoding
./vlc -Idummy --no-hw-dec --0latency --network-caching=0 ../video.sdp
# with HW decoding
./vlc -Idummy --0latency --network-caching=0 ../video.sdp

To measure the time spent on the VLC side, I added tracking of the reception time of the RTP packet, from its reception in a block_t to the picture display (in practice, I stream to localhost, so there is no packet loss/reorder).

I traced two durations:

the time from packet reception to the vout (just before prepare/display)
the time to prepare+display in the vout

Here is a graph (stacked bar chart) without HW decoding (Y-axis unit is µs):

And here is a graph with HW decoding (Y-axis unit is µs):

With hardware decoding, we gain a few milliseconds on the decoding itself, but the most important gain made possible by hardware decoding is the OpenGL interop, which avoids to transfer the decoded picture to/from the main memory.

The previous result with hardware decoding was very good.

However, if the input stream is the raw H.264 (+header) received from Android as explained in the first post, even with hardware decoding, it takes a lot more time both to decode and display:

At first, it seems surprising, since in theory the player does not care how the video stream is captured.

VSync

This stacked bar chart does not highlight the underlying issue, so let's trace the same graph without stacking decoding and display duration:

On this graph, we see that displaying almost always takes 1/60 second. This suggests a vsync issue.

Here is the difference between the ffmpeg/x11/RTP capture and the Android capture: MediaCodec produces a new frame whenever the input surface is damaged, and my device screen is at 90Hz. If I slide quickly to animate the home screen, then it is common to reach 90fps, while VLC runs on my laptop with a screen at 60Hz.

In that case, several OpenGL renderings will be requested during a single vsync period, so the second one is (AFAIU) forced to block until the next vsync.

Note that an input framerate greater than the output framerate does not add any latency in itself: some frames will just be lost (by design) by the vlc_pic_buf as explained in the first post (the decoder will overwrite the previous frame it has not be consumed by the vout).

The fact that it also increases decoding times probably stems from the fact the decoder and display share resources (but the reason is not clear to me).

To confirm, let's limit the video stream capture to 30fps, by adding max_fps=30 as a scrcpy-server parameter (see first post):

adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
    app_process / com.genymobile.scrcpy.Server 1.22 \
    tunnel_forward=true control=false max_size=1920 max_fps=30 \
    send_device_meta=false send_frame_meta=true send_dummy_byte=false

The result is very good (stacked bar chart):

Limiting at 60fps (max_fps=60) is better than nothing, but not perfect:

At 30 fps but without hardware decoding, decoding takes more time (of course), but display time is quite ok (it includes texture upload from main memory to GPU):

Without vsync

So it seems that VSync is the cause of the problem. But in scrcpy, even at 90fps, by default there is no such problem:

Note that scrcpy does not support HW decoding (yet?).

However, indeed, if I force VSync:

diff --git a/app/src/scrcpy.c b/app/src/scrcpy.c
index 8c4920d6..aa87e2bf 100644
--- a/app/src/scrcpy.c
+++ b/app/src/scrcpy.c
@@ -126,6 +126,8 @@ sdl_set_hints(const char *render_driver) {
     if (!SDL_SetHint(SDL_HINT_VIDEO_MINIMIZE_ON_FOCUS_LOSS, "0")) {
         LOGW("Could not disable minimize on focus loss");
     }
+
+    SDL_SetHint(SDL_HINT_RENDER_VSYNC, "1");
 }
 
 static void

Then I can reproduce the problem (let's use non-stacked bars directly):

I attempted to disable VSync in VLC, but I failed (the assertion fails):

diff --git a/modules/video_output/opengl/egl.c b/modules/video_output/opengl/egl.c
index 147afe802c..94e2c93c1f 100644
--- a/modules/video_output/opengl/egl.c
+++ b/modules/video_output/opengl/egl.c
@@ -367,6 +367,10 @@ static int Open(vlc_gl_t *gl, const struct gl_api *api,
     gl->get_proc_address = GetSymbol;
     gl->destroy = Close;
 
+    // Disable VSync
+    bool ok = eglSwapInterval(sys->display, 0);
+    assert(ok);
+
     return VLC_SUCCESS;
 
 error:

To be continued…

VSync latency

VSync (often) adds 1 frame latency, because once a frame rendering is requested, it will prevent any further frames that may arrive before the VSync tick to be rendered immediately.

For example, if frames A and B arrive before the first VSync tick:

vsync slots      -------------|-------------|-------
recv frames      A      B     ^         C   ^
vsync enabled    A---------->[A]B--------->[B]C----- (very regular 16ms to display)
vsync disabled   A      B    [B]        C  [C]

Then:

with VSync enabled: A then B will be displayed
with VSync disabled: B then C will be displayed

I think there are basically 2 solutions to this problem:

disable VSync (remove the problem)
estimate rendering time and vsync ticks to submit a frame as late as possible (but this is more complicated and heuristic)

                       start rendering just before the vsync tick
                           v             v
vsync slots      ----------+--|----------+--|-------
recv frames      A      B  ^^^^         C^^^^
render frames              B             C
vsync enabled              B-[B]         C-[C]

estimate rendering time and vsync ticks to submit a frame as late as possible (but this is more complicated and heuristic)

I attempted a hack to start rendering 15ms after the last display time (so it should be approximately 1.66ms before the next vsync):

diff

diff --git a/include/vlc_picture.h b/include/vlc_picture.h
index b41dd26578..a032105885 100644
--- a/include/vlc_picture.h
+++ b/include/vlc_picture.h
@@ -161,6 +161,7 @@ struct picture_t
     vlc_atomic_rc_t refs;
 
     vlc_tick_t recv_ts;
+    vlc_tick_t decoded_ts;
 };
 
 static inline vlc_video_context* picture_GetVideoContext(picture_t *pic)
diff --git a/src/input/decoder.c b/src/input/decoder.c
index d0b9f1883d..365309e2cb 100644
--- a/src/input/decoder.c
+++ b/src/input/decoder.c
@@ -1137,6 +1137,7 @@ static void ModuleThread_QueueVideo( decoder_t *p_dec, picture_t *p_pic )
 
     if (p_owner->zero_latency)
     {
+        p_pic->decoded_ts = vlc_tick_now();
         vout_PutPicture(p_owner->p_vout, p_pic);
         return;
     }
diff --git a/src/video_output/video_output.c b/src/video_output/video_output.c
index 07ae0eb579..2b48bf6803 100644
--- a/src/video_output/video_output.c
+++ b/src/video_output/video_output.c
@@ -1724,8 +1724,16 @@ static void Thread0Latency(vout_thread_sys_t *sys)
 {
     vout_display_t *vd = sys->display;
 
+    vlc_tick_t last_display_ts = VLC_TICK_INVALID;
+
     for (;;)
     {
+        if (last_display_ts != VLC_TICK_INVALID)
+        {
+            vlc_tick_t render_min_ts = last_display_ts + VLC_TICK_FROM_MS(15);
+            vlc_tick_wait(render_min_ts);
+        }
+
         picture_t *pic = vlc_pic_buf_Pop(&sys->pic_buf);
         if (!pic)
             /* stopped */
@@ -1739,9 +1747,11 @@ static void Thread0Latency(vout_thread_sys_t *sys)
 
         if (vd->ops->display)
             vd->ops->display(vd, pic);
+
+        last_display_ts = vlc_tick_now();
         vlc_mutex_unlock(&sys->display_lock);
 
-        fprintf(stderr, "=== vout: %ld;%ld\n", t - pic->recv_ts, vlc_tick_now() - t);
+        fprintf(stderr, "=== vout: %ld;%ld\n", pic->decoded_ts - pic->recv_ts, vlc_tick_now() - t);
 
         picture_Release(pic);
     }

It clearly improves the display duration… ~~but it increases decoder time (the decoder probably depends on the GPU or VSync in some way: old graph)~~:

I finally managed to disable VSync in VLC (the EGL context needed to be current):

diff --git a/modules/video_output/opengl/egl.c b/modules/video_output/opengl/egl.c
index 147afe802c..c8b4e9fb2d 100644
--- a/modules/video_output/opengl/egl.c
+++ b/modules/video_output/opengl/egl.c
@@ -359,6 +359,15 @@ static int Open(vlc_gl_t *gl, const struct gl_api *api,
     }
     sys->context = ctx;
 
+    int ret = MakeCurrent(gl);
+    assert(ret == VLC_SUCCESS);
+
+    // Disable VSync
+    EGLBoolean ok = eglSwapInterval(sys->display, 0);
+    assert(ok);
+
+    ReleaseCurrent(gl);
+
     /* Initialize OpenGL callbacks */
     gl->make_current = MakeCurrent;
     gl->release_current = ReleaseCurrent;

EDIT: or just export vblank_mode=0.

Without VSync, the latency is better, but there are still irregular "spikes" that I don't explain:

If I draw again the previous graph for scrcpy at 90fps, but with non-stacked bars:

The display duration does not produce these spikes (but maybe they are avoided only because the decoding time is higher).

In any case, I think we should disable VSync when it's possible to get the lowest possible latency.

Android has a VideoFrameScheduler trying to dispatch frames at the middle of two vsync, to make the playback smooth.

https://android.googlesource.com/platform/frameworks/av/+/master/media/libstagefright/include/media/stagefright/VideoFrameSchedulerBase.h

mentioned in issue videolan/vlc#26411 (closed)

As a drawback, this forces to use small GOP (i.e. frequent I-frames).

x264 --intra-refresh may help in this case.

Here is a small demo of "audio mirroring":

0latency_audio

The Android device is connected over wifi (USB would be cheating). The audio is captured on Android (using sndcpy), then transmitted in raw format (2 channels, 48kHz) over a TCP socket.

On the client side, VLC from branch 0latency.28 is called that way:

./vlc -Idummy --demux rawaud -Aalsa --0latency --network-caching=0 --play-and-exit tcp://localhost:$PORT

To be more concrete to reproduce the demo, it is exactly run like this:

$ cat myvlc
#!/bin/bash
~/projects/vlc/buildasan/vlc --0latency -Aalsa "$@"
$ VLC=./myvlc ./sndcpy

This branch contains everything done for video 0latency (in particular the input 0latency, to avoid additional buffering), + some very minor changes in the end (surprisingly) for the audio part (specific to Alsa for the aout module part). I don't even understand how setting AOUT_0LATENCY_TIME to 0 (the last commit) does not make the sound glitch (probably because of the ~20ms "natural" buffering on the alsa side, as reported by time_get(), but I'm not sure).

What you observe on the video:

the Audio/Video Sync Test is played on the device (via VLC for Android);
the device is mirrored on the computer to observe full mirroring (video + audio);
at the beginning, the sound is played only on the computer;
after 3~4 seconds, I increase the volume on the device, so the sound is played both on the device and the computer (we can hear the delay)

that POC is very interesting.

We are using VLC in a "live" captioning here. We use a low-latency hardware streamer that create a HEVC stream (approx 2 mbit), but can also create a H264 stream (10 mbit). The "captionner" listen the video stream (in fact, the audio.. but look the video at the same time).

The CC encoder is place before the streamer, so when the captionner add CC with his live caption sofware (Caption Maker), he can notice the result "live"... the normal delay we usually got with other player is approx 1-1.5 sec ... but fiew years ago, we mesure 363 ms between the insertion of the live caption, and the result "on screen".

On this POC, is there something i can do to test a built, with the proper command line to start a UDP or RTP stream ? (what should i enter)

for now, we only notice that Caption diseaper when use the --0latency option

On this POC, is there something i can do to test a built, with the proper command line to start a UDP or RTP stream ? (what should i enter)

Basically, you should test the commands from this comment as an example: !20 (comment 307154)

As an alternative, we could capture the screen (on X11) and stream it over RTP:

ffmpeg -re -video_size 1920x1080 -f x11grab -draw_mouse 0 -i :0.0 -c:v libx264 -pix_fmt yuv420p -preset veryfast -tune zerolatency -flags +global_header -f rtp -sdp_file video.sdp rtp://127.0.0.1:123

On branch ~~0latency.15~~ 0latency.31, we can play:

# without HW decoding
./vlc -Idummy --no-hw-dec --0latency --network-caching=0 ../video.sdp
# with HW decoding
./vlc -Idummy --0latency --network-caching=0 ../video.sdp

(check here to build VLC)

for now, we only notice that Caption diseaper when use the --0latency option

Yes, SPU/subtitles are totally skipped on this PoC.

BTW.. the result is very impressive (delay). Is there any chance to see on the PoC, close Caption ?

Is there any chance to see on the PoC, close Caption ?

TBH, it's not the priority. We need audio first

changed the description

PoC 0latency

0latency

Real-time mirroring

On `master`

PoC

Global `--0latency` option

Picture buffer

Video output

Input/demux

Decoder

H.264 AnnexB 1-frame latency

Protocol discussions

Activity

VSync

Without vsync

VSync latency

PoC 0latency

0latency

Real-time mirroring

On master

PoC

Global --0latency option

Picture buffer

Video output

Input/demux

Decoder

H.264 AnnexB 1-frame latency

Protocol discussions

Merge request reports

Activity

VSync

Without vsync

VSync latency

On `master`

Global `--0latency` option