dav1d get new picture buffer via NewPicture(). It turns out that the error is triggered by decoder_NewPicture(), which in turn is triggered by picture_pool_Cancel() in vlc_input_decoder_Flush().
@tguillem add picture_pool_Cancel() to vlc_input_decoder_Flush() to solve the slow flush issue:
commit 9602cf7e738c5d3cc4f0b0156017110c844cdfb3Author: Thomas Guillem <thomas@gllm.fr>Date: Fri Nov 27 16:18:44 2020 +0100 decoder: fix slow video flush Video flush was taking between 1 seconds to 10 seconds because the flush request was not processed by the DecoderThread that was stuck in pf_decode() callbacks waiting for new pictures.
Looks like it doesn't work well with decoder_NewPicture():
decoder_NewPicture() return NULL on error, we don't know if it's malloc failure, or cancel on purpose.
Most decoder can continue after decoder_NewPicture() return NULL, dav1d is an exception.
Continue after decoder_NewPicture() failure is more like a bug, unless decoder_NewPicture() can notify error code.
Even if decoder_NewPicture() can notify something like EAGAIN, third party decoders may not able to handle that and recover, since picture buffer callback is more like malloc(), either success or critical failure.
Maybe we need another method to solve the slow flush issue.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Causing decoder_NewPicture() to fail immediately is the whole point of picture_pool_Cancel(). It's doing exactly what it's supposed to.
The underlying assumption is that decoder picture allocation started right before or during a decoder flush are for pre-flush data. I doubt that that is always true, as nothing says that a decoder cannot pre-allocate pictures.
So this is the case of a work-around for a bug for poorly implemented threaded decoders, that ends up introducing bugs elsewhere.
In principle, it should never be needed at all. Depending on the decoder, it may be an useful optimisation, a useless no-op, or a harmful kludge.
Threaded decoders really should manage their picture pool directly. I don't think that it is possible to accommodate all models with the common picture pool.
Technically, currently, every decoder needs it as long as the core cannot unblock every pictures sent to the output in a reliable way, including those that will be output by the decoder as soon as some pictures are released to release the backpressure. But then it might be considered that the bug is elsewhere.
Is NewPicture() called during the flush ? In that case it's normal not to provide pictures. We don't want to use them at all. If the problem is that dav1d don't continue, maybe the bug is in our flush handling. We may need some extra flag, locking, etc in this code:
Paraphrasing my own self, but there is no rule that says pictures cannot be allocated during flush. For that matter there is no rule that pictures allocated before flush cannot be used and queued after flush.
It is an implementation detail of the decoder library whether they wait for "data" (pf_decode) to allocate corresponding pictures or not.
That's why I think cancelling the pool is a bug in the general case. That is not to deny that some decoders rely on it as of today, unfortunately.
Actually when the error occurs, the flush is not called yet. The picture pool is canceled but the decoder doesn't know about it. I think the picture cancel is called to early. It should be canceled in the decoder thread, not in the input thread.
Though it's a workaround, the whole point of the picture pool cancellation is to cancel from the input thread, and avoid the wait on other callbacks before flushing. Same "typical" issue as audio_output time get, vout display prepare/display, etc...
I don't think it's the same issue really, at least yet.
For audio outputs, the fix exists: make play non-blocking. For video outputs, it might be hard to implement but it's also a matter of making callbacks non-blocking.
For decoders (also filters) that option doesn't exist, as there are no other ways to pace the input as of yet, than making pf_decode block.
I don't think it's the same issue really, at least yet.
Well, you can interpret it as the same solutions or not, though I'm not sure what "yet" means here and it might have been a source of confusion for me with your answer. But I might not have been clear with what I meant by "issue" also.
In any of those case, pacing is done by blocking a module callback. You can solve it by solving the symptoms (pictures quickly and correctly flushed to unblock the callback) or by solving the architecture (never pace from a module callback), but both would make sense (and it would even make sense to do both too anyway).
In the end, it's exactly the same kind of discussion for me, only the situation changes and might lead to a different answer. The model is exactly the same.
Here, the main situational difference is that we can control the component after this one, and so interact with the backpressure in a more easy way. With outputs, the backpressure is mostly handled by the underlying system (for instance, with VSYNC or audio buffer callback, and time).
Yet as in 4.0. 5.0 is supposed to have a redesign of buffering and flow control.
Point being output modules have the option not to block, because pacing will take place elsewhere anyhow. If they do block to the point of causing issues, it's a bug in the module.
But filter modules (in a general sense including packetisers and decoders) do not currently have the option not to block because that is the only way to pace.
I tried to add a decoder_IsFlushing() so the decoder (dav1d) can tell when the decoder_NewPicture is returning NULL in normal usage. But it seems dav1d is not designed to handle EAGAIN in that case. It doesn't stop the decoder, but it seems to be in an unstable state (sometimes it works, most of the time it doesn't).
That would be racy (ToCToU), since the state could change between the failure and the reason check (unless the flag were returned by NewPicture itself).
Well, we can say it can only be called from the decoder thread (and assert when it's not). This is the case for dav1d and pretty much all decoders (not lavc which can request frames from different threads IIRC).
They all suffer from the same issue: decoder_NewPicture() returns NULL just because there was a flush call that wants to used the blocked decoder thread.
Maybe we need a cleaner way to stop waiting decoder threads (I proposed one in the past). And that could involve telling why the picture/buffer wait is canceled.