Re: [PATCH v3] drm/panthor: Make the timeout per-queue instead of per-job

From: Ashley Smith
Date: Mon Apr 14 2025 - 12:45:11 EST

Next message: Dullfire: "Re: [PATCH 1/2] PCI/MSI: Add MSIX option to write to ENTRY_DATA before any reads"
Previous message: Jonathan Corbet: "Re: [PATCH 4/4] Documentation/rtla: Include BPF sample collection"
In reply to: Steven Price: "Re: [PATCH v3] drm/panthor: Make the timeout per-queue instead of per-job"
Next in thread: Boris Brezillon: "Re: [PATCH v3] drm/panthor: Make the timeout per-queue instead of per-job"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 11 Apr 2025 16:51:52 +0100 Steven Price wrote:
> Hi Ashley,
>
> On 10/04/2025 13:57, Ashley Smith wrote:
> > The timeout logic provided by drm_sched leads to races when we try
> > to suspend it while the drm_sched workqueue queues more jobs. Let's
> > overhaul the timeout handling in panthor to have our own delayed work
> > that's resumed/suspended when a group is resumed/suspended. When an
> > actual timeout occurs, we call drm_sched_fault() to report it
> > through drm_sched, still. But otherwise, the drm_sched timeout is
> > disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of
> > how we protect modifications on the timer.
> >
> > One issue seems to be when we call drm_sched_suspend_timeout() from
> > both queue_run_job() and tick_work() which could lead to races due to
> > drm_sched_suspend_timeout() not having a lock. Another issue seems to
> > be in queue_run_job() if the group is not scheduled, we suspend the
> > timeout again which undoes what drm_sched_job_begin() did when calling
> > drm_sched_start_timeout(). So the timeout does not reset when a job
> > is finished.
> >
> > Co-developed-by: Boris Brezillon boris.brezillon@xxxxxxxxxxxxx>
> > Signed-off-by: Boris Brezillon boris.brezillon@xxxxxxxxxxxxx>
> > Tested-by: Daniel Stone daniels@xxxxxxxxxxxxx>
> > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
> > Signed-off-by: Ashley Smith ashley.smith@xxxxxxxxxxxxx>
> > ---
> > drivers/gpu/drm/panthor/panthor_sched.c | 244 +++++++++++++++++-------
> > 1 file changed, 177 insertions(+), 67 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index 446ec780eb4a..32f5a75bc4f6 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
>
> [...]
>
> > @@ -2727,8 +2784,17 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
> >              * automatically terminate all active groups, so let's
> >              * force the state to halted here.
> >              */
> > -            if (csg_slot->group->state != PANTHOR_CS_GROUP_TERMINATED)
> > +            if (csg_slot->group->state != PANTHOR_CS_GROUP_TERMINATED) {
> >                 csg_slot->group->state = PANTHOR_CS_GROUP_TERMINATED;
> > +
> > +                /* Reset the queue slots manually if the termination
> > +                * request failed.
> > +                */
> > +                for (i = 0; i queue_count; i++) {
> > +                    if (group->queues[i])
> > +                        cs_slot_reset_locked(ptdev, csg_id, i);
> > +                }
> > +            }
> >             slot_mask &= ~BIT(csg_id);
> >         }
> >     }
>
> So this seems to be the only change from v2 (a changelog can be
> helpful!). And I'm not convinced it belongs in this patch? It's not just
> "[making] the timeout per-queue instead of per-job".
>
> I haven't dug through the details, but I think this belongs in a
> separate patch.
>
> Thanks,
> Steve
>

Hi Steve,

Apologies my change log did not make it in:

v2: Suspend timeout before setting _TERMINATED so that if a terminate
times out (Due to firmware hang) we don't get a spurious timeout

This was a bug fix related to the scheduling patch that we found, I was thinking to include it here. However I'm happy to move it to a different patch.

Thanks,
Ash

Next message: Dullfire: "Re: [PATCH 1/2] PCI/MSI: Add MSIX option to write to ENTRY_DATA before any reads"
Previous message: Jonathan Corbet: "Re: [PATCH 4/4] Documentation/rtla: Include BPF sample collection"
In reply to: Steven Price: "Re: [PATCH v3] drm/panthor: Make the timeout per-queue instead of per-job"
Next in thread: Boris Brezillon: "Re: [PATCH v3] drm/panthor: Make the timeout per-queue instead of per-job"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]