Re: [PATCH 1/1] drm/lima: implement the file flush callback

From: Qiang Yu
Date: Tue Apr 15 2025 - 21:53:41 EST


On Mon, Apr 14, 2025 at 9:19 PM Erico Nunes <nunes.erico@xxxxxxxxx> wrote:
>
> On Thu, Apr 10, 2025 at 4:04 PM Christian König
> <christian.koenig@xxxxxxx> wrote:
> >
> > Am 10.04.25 um 15:56 schrieb Qiang Yu:
> > >>>> This prevents applications with multiple contexts from running into a
> > >>>> race condition between running tasks and context destroy when
> > >>>> terminating.
> > >>>>
> > > If implementing flush callback fix the problem, it must not be when SIGKILL.
> >
> > SIGKILL also calls flush (again eventually), but we can detect that in the scheduler code.
> >
> > See the code and comment in drm_sched_entity_flush(). We could potentially improve the comment cause it's not 100% correct, but it should work under all cases.
> >
> > > Could you describe the problem and how this fix solves it? As I don't understand
> > > how the above difference could fix a race bug.
> >
> > That was the point I wasn't sure about either. It should *not* fix any race, it's just gives a nicer SIGKILL experience.
>
> Thanks for this feedback. So as mentioned in the initial letter, I'm
> also trying to understand if this is the correct fix.
>
> This problem is unfortunately not trivial to reproduce, there is only
> one application which can reproduce this so far and it is a
> complicated one with multiple contexts and relatively lenghty jobs.
> What I observed so far is that as it is right now, a context might be
> destroyed while a job is running (e.g. by killing the application at
> the right time), and a context destroy appears to release buffers that
> are still being used by the running job which is what causes the read
> faults.

Free buffer in context destroy when job running is the bug, we should
dig into it how it happens. And if multiple context play a key role.

> This added flush made it so that the jobs always finish before the
> context destroy gets called, which prevents the issue for this
> application in my attempts. But I suppose it might also just be that
> it had some more time to finish now. If this is not the correct fix
> then there might be some more missing synchronization between running
> job and context destroy in lima?
>
I'm afraid this patch does not fix the root cause, we should find out
the root cause first. This patch could be added as SIGKILL improvement
later.

> Thanks
>
> Erico