Skip to content
Snippets Groups Projects

PoC 0latency

Open Romain Vimont requested to merge 0latency into master
3 unresolved threads

0latency

VLC is designed to process packets and frames according to their timestamp (PTS). This implies that it needs to wait a certain duration (until a computed date) before demuxing, decoding and displaying. The purpose is to preserve the interval between frames as much as possible, so at to avoid stuttering when watching a movie for example.

Real-time mirroring

Before making any change, we must be able to test glass-to-glass latency easily. For that purpose, we can mirror an Android device screen to VLC.

Download the latest server file from scrcpy, plug an Android device, and execute:

adb push scrcpy-server-v1.25 /data/local/tmp/vlc-scrcpy-server.jar
adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
    app_process / com.genymobile.scrcpy.Server 1.25 \
    tunnel_forward=true control=false cleanup=false \
    max_size=1920 raw_video_stream=true

(Adapt max_size=1920 to use another resolution, that impacts the latency.)

As soon as a client connects to localhost:1234 via TCP, mirroring starts and the device sends a raw H.264 video stream.

It can be played with:

./vlc -Idummy --demux=h264 --network-caching=0 tcp://localhost:1234

By playing a mire test on the device, and taking a picture (with a good camera) of the device next to the vlc window, we can measure the glass-to-glass delay.

Note that this delay includes the encoding time from the mobile device, which may be larger that the target hardware.

On master

On VLC4 master without any change, the result is catastrophic (VLC is not designed to handle this use case):

before

The video is 30fps, and each increments represent 1 frame, so 30 frames represent 1 second. At the end of this small capture, there is almost a 10 seconds delay.

PoC

To mirror and control a remote device in real-time, the critical objective is to minimize latency. Therefore, any unnecessary wait is a bug.

Concretely, all waits based on a timestamp must be removed. Therefore, in 0latency mode, clocks become useless and timestamps are irrelevant. Also, buffering must be removed as much as possible.

To that end, this PoC changes several parts of the VLC pipeline.

Global --0latency option

The first commit adds a new global option --0latency, that will be read by several VLC components. By default, it is disabled (of course).

To enable it, pass --0latency:

./vlc -Idummy --0latency --demux=h264 --network-caching=0 tcp://localhost:1234

Picture buffer

In VLC, when a picture is decoded, it is pushed by the decoder to a fifo queue, which is consumed by the video output.

For 0latency, at any time, we want to display the latest possible frame, so we don't want any fifo queue.

This PoC introduces a "picture buffer" (vlc_pic_buf; yes, this is a poor name), which is a buffer of exactly 1 picture:

  • the producer can push a new picture (overwriting any previous pending picture);
  • the consumer can pop the latest picture, which is a blocking operation if no pending picture is available.

The producer is the decoder. The consumer is the video output.

Video output

In VLC, the video output attempts to display a picture at the expected date, so it waits for a specific timestamp. This is exactly what we want to avoid for 0latency.

If 0latency is enabled, this PoC replaces the vout thread function which does a lot of complicated things by a very simple loop (Thread0Latency()):

  1. pop picture from the picture buffer;
  2. call vout prepare();
  3. call vout display().

The function vout_PutPicture() is also adapted to push the frame to our new picture buffer instead of the existing picture fifo.

Note that in this PoC, the picture is not redrawn on resize, so the content will be black until the next frame on resize. That could be improved later.

Input/demux

In VLC, the input MainLoop() calls the demuxer to demux when necessary, but explicitly waits for a deadline between successive calls. We don't want to wait.

Therefore, this PoC provides an alternative MainLoop0Latency(), which is called if 0latency is enabled. This function basically calls demux->pf_demux() in a loop without ever waiting.

Some code in the es_out (on control ES_OUT_SET_PCR) based on clock (for handling jitter) is also totally bypassed.

Decoder

When the decoder implementation has a frame, it submits it to the vout via decoder_QueueVideo(). The queue implementation is provided by the decoder owner in the core, which handles preroll and may wait.

This PoC replaces this implementation by a simple call to vout_PutPicture(), to directly push the picture to our new picture buffer in the vout. If the vout was waiting for a picture, it is unblocked and will immediately prepare() and display().

On the module side, the avcodec decoder was adapted to disable dropping frames based on the clock (if a frame is "late"), and to enable the same options as if --low-delay was passed.

H.264 AnnexB 1-frame latency

The input is a raw H.264 stream in AnnexB format (this is what Android MediaCodec produces). This raw H.264 is sent over TCP.

The format is:

(00 00 00 01 NALU) | ( 00 00 00 01 NALU) | …

The length of each NAL unit is not present in the stream. Therefore, on the receiving side, the parser detects the end of a NAL unit when it detects the following start code 00 00 00 01.

However, this start code is sent as the prefix of the next frame, so the received packet will not be submitted to the decoder before the next frame is received, which adds 1-frame latency.

However, the length of the packet is known in advance on the device side. Therefore, a simple solution is to prefix the packet with its length (see Reduce latency by 1 frame).

For simplicity, for now I reused the scrcpy format, by requesting the server to send frame meta:

adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
    app_process / com.genymobile.scrcpy.Server 1.25 \
    tunnel_forward=true control=false cleanup=false max_size=1920 \
    send_device_meta=false send_frame_meta=true send_dummy_byte=false
#                          ^^^^^^^^^^^^^^^^^^^^

I wrote a specific demuxer to handle it: h264_0latency

To use it, replace the --demux= argument:

./vlc -Idummy --0latency --demux=h264_0latency --network-caching=0 tcp://localhost:1234

To make the difference obvious, I suggest to play a 1-fps video.

With all these changes, the latency is reduced to 1~2 frames (30 fps) glass-to-glass:

0latency_poc2

(the device is on the left, VLC is in the middle, scrcpy is on the right)

Protocol discussions

For this PoC, the video stream is received over TCP from an Android device connected via USB (or via wifi on a local network), using a custom protocol.

Packet loss is non-existent over USB and very low on a good local wifi network. However, packet loss would add an unacceptable latency over the Internet with a protocol taking care of packet retransmission (like TCP).

The following is some random thoughts.

Ideally, I think that:

  • we want to never decode a non-I-frame packet when the previous (referenced) packets are not received/decoded (this would produce corrupted frames)
  • we want to skip any previous packets (possibly lost) whenever a I-frame arrives

Concretely, the device sends:

 [I] P P P P P P P P P P P P P P P [I] P P P P P P P P P P P …

If a packet is not received:

 [I] P P P P P _ P P P P P P P P P [I] P P P P P P P P P P P …
               ^
             lost

then one possible solution:

  • the receiver does not decode further P-frames until the missing packet is received;
  • if a more recent I-frame is received, it starts decoding it immediately and forgets/ignores all previous packets.

As a drawback, this forces to use small GOP (i.e. frequent I-frames).

To be continued…

Edited by Romain Vimont

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Romain Vimont changed the description

    changed the description

  • Romain Vimont added 4 commits

    added 4 commits

    • 313e90f6 - decoder: 0latency
    • bdf3a461 - avcodec: include low-delay settings for 0latency
    • 57b548e6 - avcodec: never drop blocks on low delay
    • 8e7071b5 - demux: h264_0latency

    Compare with previous version

  • Romain Vimont changed the description

    changed the description

  • Romain Vimont changed the description

    changed the description

  • Romain Vimont changed the description

    changed the description

  • As an alternative, we could capture the screen (on X11) and stream it over RTP:

    ffmpeg -re -video_size 1920x1080 -f x11grab -draw_mouse 0 -i :0.0 -c:v libx264 -pix_fmt yuv420p -preset veryfast -tune zerolatency -flags +global_header -f rtp -sdp_file video.sdp rtp://127.0.0.1:1234

    On branch 0latency.15, we can play:

    # without HW decoding
    ./vlc -Idummy --no-hw-dec --0latency --network-caching=0 ../video.sdp
    # with HW decoding
    ./vlc -Idummy --0latency --network-caching=0 ../video.sdp

    To measure the time spent on the VLC side, I added tracking of the reception time of the RTP packet, from its reception in a block_t to the picture display (in practice, I stream to localhost, so there is no packet loss/reorder).

    I traced two durations:

    • the time from packet reception to the vout (just before prepare/display)
    • the time to prepare+display in the vout

    Here is a graph (stacked bar chart) without HW decoding (Y-axis unit is µs):

    0latency_nohwdec

    And here is a graph with HW decoding (Y-axis unit is µs):

    0latency_hwdec

    With hardware decoding, we gain a few milliseconds on the decoding itself, but the most important gain made possible by hardware decoding is the OpenGL interop, which avoids to transfer the decoded picture to/from the main memory.

    Edited by Romain Vimont
  • The previous result with hardware decoding was very good.

    However, if the input stream is the raw H.264 (+header) received from Android as explained in the first post, even with hardware decoding, it takes a lot more time both to decode and display:

    0latency_hwdec_from_android

    At first, it seems surprising, since in theory the player does not care how the video stream is captured.

    VSync

    This stacked bar chart does not highlight the underlying issue, so let's trace the same graph without stacking decoding and display duration:

    0latency_hwdec_from_android_non_stacked

    On this graph, we see that displaying almost always takes 1/60 second. This suggests a vsync issue.

    Here is the difference between the ffmpeg/x11/RTP capture and the Android capture: MediaCodec produces a new frame whenever the input surface is damaged, and my device screen is at 90Hz. If I slide quickly to animate the home screen, then it is common to reach 90fps, while VLC runs on my laptop with a screen at 60Hz.

    In that case, several OpenGL renderings will be requested during a single vsync period, so the second one is (AFAIU) forced to block until the next vsync.

    Note that an input framerate greater than the output framerate does not add any latency in itself: some frames will just be lost (by design) by the vlc_pic_buf as explained in the first post (the decoder will overwrite the previous frame it has not be consumed by the vout).

    The fact that it also increases decoding times probably stems from the fact the decoder and display share resources (but the reason is not clear to me).

    To confirm, let's limit the video stream capture to 30fps, by adding max_fps=30 as a scrcpy-server parameter (see first post):

    adb forward tcp:1234 localabstract:scrcpy
    adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
        app_process / com.genymobile.scrcpy.Server 1.22 \
        tunnel_forward=true control=false max_size=1920 max_fps=30 \
        send_device_meta=false send_frame_meta=true send_dummy_byte=false

    The result is very good (stacked bar chart):

    0latency_30fps

    Limiting at 60fps (max_fps=60) is better than nothing, but not perfect:

    0latency_60fps

    At 30 fps but without hardware decoding, decoding takes more time (of course), but display time is quite ok (it includes texture upload from main memory to GPU):

    0latency_30fps_nohwdec

    Without vsync

    So it seems that VSync is the cause of the problem. But in scrcpy, even at 90fps, by default there is no such problem:

    scrcpy_90fps

    Note that scrcpy does not support HW decoding (yet?).

    However, indeed, if I force VSync:

    diff --git a/app/src/scrcpy.c b/app/src/scrcpy.c
    index 8c4920d6..aa87e2bf 100644
    --- a/app/src/scrcpy.c
    +++ b/app/src/scrcpy.c
    @@ -126,6 +126,8 @@ sdl_set_hints(const char *render_driver) {
         if (!SDL_SetHint(SDL_HINT_VIDEO_MINIMIZE_ON_FOCUS_LOSS, "0")) {
             LOGW("Could not disable minimize on focus loss");
         }
    +
    +    SDL_SetHint(SDL_HINT_RENDER_VSYNC, "1");
     }
     
     static void

    Then I can reproduce the problem (let's use non-stacked bars directly):

    scrcpy_90fps_vsync

    I attempted to disable VSync in VLC, but I failed (the assertion fails):

    diff --git a/modules/video_output/opengl/egl.c b/modules/video_output/opengl/egl.c
    index 147afe802c..94e2c93c1f 100644
    --- a/modules/video_output/opengl/egl.c
    +++ b/modules/video_output/opengl/egl.c
    @@ -367,6 +367,10 @@ static int Open(vlc_gl_t *gl, const struct gl_api *api,
         gl->get_proc_address = GetSymbol;
         gl->destroy = Close;
     
    +    // Disable VSync
    +    bool ok = eglSwapInterval(sys->display, 0);
    +    assert(ok);
    +
         return VLC_SUCCESS;
     
     error:

    To be continued…

    VSync latency

    VSync (often) adds 1 frame latency, because once a frame rendering is requested, it will prevent any further frames that may arrive before the VSync tick to be rendered immediately.

    For example, if frames A and B arrive before the first VSync tick:

    vsync slots      -------------|-------------|-------
    recv frames      A      B     ^         C   ^
    vsync enabled    A---------->[A]B--------->[B]C----- (very regular 16ms to display)
    vsync disabled   A      B    [B]        C  [C]

    Then:

    • with VSync enabled: A then B will be displayed
    • with VSync disabled: B then C will be displayed

    I think there are basically 2 solutions to this problem:

    1. disable VSync (remove the problem)
    2. estimate rendering time and vsync ticks to submit a frame as late as possible (but this is more complicated and heuristic)
                           start rendering just before the vsync tick
                               v             v
    vsync slots      ----------+--|----------+--|-------
    recv frames      A      B  ^^^^         C^^^^
    render frames              B             C
    vsync enabled              B-[B]         C-[C]
    Edited by Romain Vimont
    • estimate rendering time and vsync ticks to submit a frame as late as possible (but this is more complicated and heuristic)

      I attempted a hack to start rendering 15ms after the last display time (so it should be approximately 1.66ms before the next vsync):

      diff
      diff --git a/include/vlc_picture.h b/include/vlc_picture.h
      index b41dd26578..a032105885 100644
      --- a/include/vlc_picture.h
      +++ b/include/vlc_picture.h
      @@ -161,6 +161,7 @@ struct picture_t
           vlc_atomic_rc_t refs;
       
           vlc_tick_t recv_ts;
      +    vlc_tick_t decoded_ts;
       };
       
       static inline vlc_video_context* picture_GetVideoContext(picture_t *pic)
      diff --git a/src/input/decoder.c b/src/input/decoder.c
      index d0b9f1883d..365309e2cb 100644
      --- a/src/input/decoder.c
      +++ b/src/input/decoder.c
      @@ -1137,6 +1137,7 @@ static void ModuleThread_QueueVideo( decoder_t *p_dec, picture_t *p_pic )
       
           if (p_owner->zero_latency)
           {
      +        p_pic->decoded_ts = vlc_tick_now();
               vout_PutPicture(p_owner->p_vout, p_pic);
               return;
           }
      diff --git a/src/video_output/video_output.c b/src/video_output/video_output.c
      index 07ae0eb579..2b48bf6803 100644
      --- a/src/video_output/video_output.c
      +++ b/src/video_output/video_output.c
      @@ -1724,8 +1724,16 @@ static void Thread0Latency(vout_thread_sys_t *sys)
       {
           vout_display_t *vd = sys->display;
       
      +    vlc_tick_t last_display_ts = VLC_TICK_INVALID;
      +
           for (;;)
           {
      +        if (last_display_ts != VLC_TICK_INVALID)
      +        {
      +            vlc_tick_t render_min_ts = last_display_ts + VLC_TICK_FROM_MS(15);
      +            vlc_tick_wait(render_min_ts);
      +        }
      +
               picture_t *pic = vlc_pic_buf_Pop(&sys->pic_buf);
               if (!pic)
                   /* stopped */
      @@ -1739,9 +1747,11 @@ static void Thread0Latency(vout_thread_sys_t *sys)
       
               if (vd->ops->display)
                   vd->ops->display(vd, pic);
      +
      +        last_display_ts = vlc_tick_now();
               vlc_mutex_unlock(&sys->display_lock);
       
      -        fprintf(stderr, "=== vout: %ld;%ld\n", t - pic->recv_ts, vlc_tick_now() - t);
      +        fprintf(stderr, "=== vout: %ld;%ld\n", pic->decoded_ts - pic->recv_ts, vlc_tick_now() - t);
       
               picture_Release(pic);
           }

      It clearly improves the display duration… but it increases decoder time (the decoder probably depends on the GPU or VSync in some way: old graph):

      0latency_vsync_hack2

      I finally managed to disable VSync in VLC (the EGL context needed to be current):

      diff --git a/modules/video_output/opengl/egl.c b/modules/video_output/opengl/egl.c
      index 147afe802c..c8b4e9fb2d 100644
      --- a/modules/video_output/opengl/egl.c
      +++ b/modules/video_output/opengl/egl.c
      @@ -359,6 +359,15 @@ static int Open(vlc_gl_t *gl, const struct gl_api *api,
           }
           sys->context = ctx;
       
      +    int ret = MakeCurrent(gl);
      +    assert(ret == VLC_SUCCESS);
      +
      +    // Disable VSync
      +    EGLBoolean ok = eglSwapInterval(sys->display, 0);
      +    assert(ok);
      +
      +    ReleaseCurrent(gl);
      +
           /* Initialize OpenGL callbacks */
           gl->make_current = MakeCurrent;
           gl->release_current = ReleaseCurrent;

      EDIT: or just export vblank_mode=0.

      Without VSync, the latency is better, but there are still irregular "spikes" that I don't explain:

      0latency_novsync_90fps

      If I draw again the previous graph for scrcpy at 90fps, but with non-stacked bars:

      scrcpy_90fps_non_stacked_bars

      The display duration does not produce these spikes (but maybe they are avoided only because the decoding time is higher).

      In any case, I think we should disable VSync when it's possible to get the lowest possible latency.

      Edited by Romain Vimont
    • Android has a VideoFrameScheduler trying to dispatch frames at the middle of two vsync, to make the playback smooth.

      https://android.googlesource.com/platform/frameworks/av/+/master/media/libstagefright/include/media/stagefright/VideoFrameSchedulerBase.h

    • Please register or sign in to reply
  • As a drawback, this forces to use small GOP (i.e. frequent I-frames).

    x264 --intra-refresh may help in this case.

  • Here is a small demo of "audio mirroring":

    0latency_audio

    The Android device is connected over wifi (USB would be cheating). The audio is captured on Android (using sndcpy), then transmitted in raw format (2 channels, 48kHz) over a TCP socket.

    On the client side, VLC from branch 0latency.28 is called that way:

    ./vlc -Idummy --demux rawaud -Aalsa --0latency --network-caching=0 --play-and-exit tcp://localhost:$PORT

    To be more concrete to reproduce the demo, it is exactly run like this:

    $ cat myvlc
    #!/bin/bash
    ~/projects/vlc/buildasan/vlc --0latency -Aalsa "$@"
    $ VLC=./myvlc ./sndcpy

    This branch contains everything done for video 0latency (in particular the input 0latency, to avoid additional buffering), + some very minor changes in the end (surprisingly) for the audio part (specific to Alsa for the aout module part). I don't even understand how setting AOUT_0LATENCY_TIME to 0 (the last commit) does not make the sound glitch (probably because of the ~20ms "natural" buffering on the alsa side, as reported by time_get(), but I'm not sure).

    What you observe on the video:

    • the Audio/Video Sync Test is played on the device (via VLC for Android);
    • the device is mirrored on the computer to observe full mirroring (video + audio);
    • at the beginning, the sound is played only on the computer;
    • after 3~4 seconds, I increase the volume on the device, so the sound is played both on the device and the computer (we can hear the delay)
    • that POC is very interesting.

      We are using VLC in a "live" captioning here. We use a low-latency hardware streamer that create a HEVC stream (approx 2 mbit), but can also create a H264 stream (10 mbit). The "captionner" listen the video stream (in fact, the audio.. but look the video at the same time).

      The CC encoder is place before the streamer, so when the captionner add CC with his live caption sofware (Caption Maker), he can notice the result "live"... the normal delay we usually got with other player is approx 1-1.5 sec ... but fiew years ago, we mesure 363 ms between the insertion of the live caption, and the result "on screen".

      On this POC, is there something i can do to test a built, with the proper command line to start a UDP or RTP stream ? (what should i enter)

      for now, we only notice that Caption diseaper when use the --0latency option

    • On this POC, is there something i can do to test a built, with the proper command line to start a UDP or RTP stream ? (what should i enter)

      Basically, you should test the commands from this comment as an example: !20 (comment 307154)

      As an alternative, we could capture the screen (on X11) and stream it over RTP:

      ffmpeg -re -video_size 1920x1080 -f x11grab -draw_mouse 0 -i :0.0 -c:v libx264 -pix_fmt yuv420p -preset veryfast -tune zerolatency -flags +global_header -f rtp -sdp_file video.sdp rtp://127.0.0.1:123

      On branch 0latency.15 0latency.31, we can play:

      # without HW decoding
      ./vlc -Idummy --no-hw-dec --0latency --network-caching=0 ../video.sdp
      # with HW decoding
      ./vlc -Idummy --0latency --network-caching=0 ../video.sdp

      (check here to build VLC)

      for now, we only notice that Caption diseaper when use the --0latency option

      Yes, SPU/subtitles are totally skipped on this PoC.

    • Please register or sign in to reply
  • Romain Vimont changed the description

    changed the description

  • Romain Vimont changed the description

    changed the description

Please register or sign in to reply
Loading