Spurious decoder_NewPicture failure leading to decoder error

Causing decoder_NewPicture() to fail immediately is the whole point of picture_pool_Cancel(). It's doing exactly what it's supposed to.

The underlying assumption is that decoder picture allocation started right before or during a decoder flush are for pre-flush data. I doubt that that is always true, as nothing says that a decoder cannot pre-allocate pictures.

So this is the case of a work-around for a bug for poorly implemented threaded decoders, that ends up introducing bugs elsewhere.

Causing decoder_NewPicture() to fail immediately is the whole point of picture_pool_Cancel(). It's doing exactly what it's supposed to.

So picture_pool_Cancel() can only be used when release a decoder?

In principle, it should never be needed at all. Depending on the decoder, it may be an useful optimisation, a useless no-op, or a harmful kludge.

Threaded decoders really should manage their picture pool directly. I don't think that it is possible to accommodate all models with the common picture pool.

mentioned in merge request !1888 (closed)

mentioned in merge request !1889 (closed)

We could trigger the picture_pool_Cancel workaround only if the decoder needs it (adding some quirks_flag in decoder_t filled from Open()).

I know that avcodec and one other needs it (mediacodec?).

And yes, ideally, it should return -EAGAIN

Technically, currently, every decoder needs it as long as the core cannot unblock every pictures sent to the output in a reliable way, including those that will be output by the decoder as soon as some pictures are released to release the backpressure. But then it might be considered that the bug is elsewhere.

Is NewPicture() called during the flush ? In that case it's normal not to provide pictures. We don't want to use them at all. If the problem is that dav1d don't continue, maybe the bug is in our flush handling. We may need some extra flag, locking, etc in this code:

static void FlushDecoder(decoder_t *dec)
{
    decoder_sys_t *p_sys = dec->p_sys;
    dav1d_flush(p_sys->c);
    cc_Flush(&p_sys->cc);
}

Note that dav1d has a frame delay, so we may still get delayed pictures after the flush (which would not be correct).

Paraphrasing my own self, but there is no rule that says pictures cannot be allocated during flush. For that matter there is no rule that pictures allocated before flush cannot be used and queued after flush.

It is an implementation detail of the decoder library whether they wait for "data" (pf_decode) to allocate corresponding pictures or not.

That's why I think cancelling the pool is a bug in the general case. That is not to deny that some decoders rely on it as of today, unfortunately.

Actually when the error occurs, the flush is not called yet. The picture pool is canceled but the decoder doesn't know about it. I think the picture cancel is called to early. It should be canceled in the decoder thread, not in the input thread.

Though it's a workaround, the whole point of the picture pool cancellation is to cancel from the input thread, and avoid the wait on other callbacks before flushing. Same "typical" issue as audio_output time get, vout display prepare/display, etc...

I don't think it's the same issue really, at least yet.

For audio outputs, the fix exists: make play non-blocking. For video outputs, it might be hard to implement but it's also a matter of making callbacks non-blocking.

For decoders (also filters) that option doesn't exist, as there are no other ways to pace the input as of yet, than making pf_decode block.

I don't think it's the same issue really, at least yet.

Well, you can interpret it as the same solutions or not, though I'm not sure what "yet" means here and it might have been a source of confusion for me with your answer. But I might not have been clear with what I meant by "issue" also.

In any of those case, pacing is done by blocking a module callback. You can solve it by solving the symptoms (pictures quickly and correctly flushed to unblock the callback) or by solving the architecture (never pace from a module callback), but both would make sense (and it would even make sense to do both too anyway).

In the end, it's exactly the same kind of discussion for me, only the situation changes and might lead to a different answer. The model is exactly the same.

Here, the main situational difference is that we can control the component after this one, and so interact with the backpressure in a more easy way. With outputs, the backpressure is mostly handled by the underlying system (for instance, with VSYNC or audio buffer callback, and time).

Yet as in 4.0. 5.0 is supposed to have a redesign of buffering and flow control.

Point being output modules have the option not to block, because pacing will take place elsewhere anyhow. If they do block to the point of causing issues, it's a bug in the module. But filter modules (in a general sense including packetisers and decoders) do not currently have the option not to block because that is the only way to pace.

mentioned in merge request !1892 (closed)

mentioned in merge request !1893 (closed)

I tried to add a decoder_IsFlushing() so the decoder (dav1d) can tell when the decoder_NewPicture is returning NULL in normal usage. But it seems dav1d is not designed to handle EAGAIN in that case. It doesn't stop the decoder, but it seems to be in an unstable state (sometimes it works, most of the time it doesn't).

I think mixing it with !1889 (closed) should work. See !1893 (closed).

That would be racy (ToCToU), since the state could change between the failure and the reason check (unless the flag were returned by NewPicture itself).

Well, we can say it can only be called from the decoder thread (and assert when it's not). This is the case for dav1d and pretty much all decoders (not lavc which can request frames from different threads IIRC).

They all suffer from the same issue: decoder_NewPicture() returns NULL just because there was a flush call that wants to used the blocked decoder thread.

Maybe we need a cleaner way to stop waiting decoder threads (I proposed one in the past). And that could involve telling why the picture/buffer wait is canceled.

I think it's better to change decoder_NewPicture() arguments/return to int decoder_NewPicture(decoder_t *dec, picture_t **pic);

Or if you don't feel like patching all decoders modules, create a new decoder_NewPicture function.

I agree with Thomas. We should rather change the callback prototype, and add a new helper for it.

The existing inline helper doesn't need to change, so it won't break every existing decoder.

See !1899 (closed)

mentioned in merge request !1899 (closed)

BTW, it's probably the same issue as #25406 that I was planning to work on "someday".

mentioned in merge request !2144 (merged)

mentioned in merge request !2504 (merged)

closed with commit e9eb73ea

closed with merge request !2504 (merged)

added Status::fixed label

changed milestone to %4.0

Spurious decoder_NewPicture failure leading to decoder error

Child items 0

Activity