Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend

From: Rob Clark
Date: Fri Mar 18 2022 - 12:23:13 EST


On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
<andrey.grodzovsky@xxxxxxx> wrote:
>
>
> On 2022-03-17 16:35, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
> > <andrey.grodzovsky@xxxxxxx> wrote:
> >>
> >> On 2022-03-17 14:25, Rob Clark wrote:
> >>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> >>> <andrey.grodzovsky@xxxxxxx> wrote:
> >>>> On 2022-03-17 13:35, Rob Clark wrote:
> >>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> >>>>> <christian.koenig@xxxxxxx> wrote:
> >>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>>>>>> <christian.koenig@xxxxxxx> wrote:
> >>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>>>>>> [SNIP]
> >>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>>>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>>>>>> and the scheduler shoveling more jobs at us.
> >>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>>>>>> more jobs at us until it is empty.
> >>>>>>>>
> >>>>>>>> Thinking more about it we will then keep some dma_fence instance
> >>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>>>>>> deadlocks during suspend.
> >>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>>>>>> It is at least not a problem when vram is not involved.
> >>>>>> No, it's much wider than that.
> >>>>>>
> >>>>>> See what can happen is that the memory management shrinkers want to wait
> >>>>>> for a dma_fence during suspend.
> >>>>> we don't wait on fences in shrinker, only purging or evicting things
> >>>>> that are already ready. Actually, waiting on fences in shrinker path
> >>>>> sounds like a pretty bad idea.
> >>>>>
> >>>>>> And if you stop the scheduler they will just wait forever.
> >>>>>>
> >>>>>> What you need to do instead is to drain the scheduler, e.g. call
> >>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
> >>>>>> created.
> >>>>> yeah, it would work to drain the scheduler.. I guess that might be the
> >>>>> more portable approach as far as generic solution for suspend.
> >>>>>
> >>>>> BR,
> >>>>> -R
> >>>> I am not sure how this drains the scheduler ? Suppose we done the
> >>>> waiting in drm_sched_entity_flush,
> >>>> what prevents someone to push right away another job into the same
> >>>> entity's queue right after that ?
> >>>> Shouldn't we first disable further pushing of jobs into entity before we
> >>>> wait for sched->job_scheduled ?
> >>>>
> >>> In the system suspend path, userspace processes will have already been
> >>> frozen, so there should be no way to push more jobs to the scheduler,
> >>> unless they are pushed from the kernel itself.
> >>> amdgpu_device_suspend
> >>
> >> It was my suspicion but I wasn't sure about it.
> >>
> >>
> >>> We don't do that in
> >>> drm/msm, but maybe you need to to move things btwn vram and system
> >>> memory?
> >>
> >> Exactly, that was my main concern - if we use this method we have to use
> >> it in a point in
> >> suspend sequence when all the in kernel job submissions activity already
> >> suspended
> >>
> >>> But even in that case, if the # of jobs you push is bounded I
> >>> guess that is ok?
> >> Submissions to scheduler entities are using unbounded queue, the bounded
> >> part is when
> >> you extract next job from entity to submit to HW ring and it rejects if
> >> submission limit reached (drm_sched_ready)
> >>
> >> In general - It looks to me at least that what we what we want her is
> >> more of a drain operation then flush (i.e.
> >> we first want to disable any further job submission to entity's queue
> >> and then flush all in flight ones). As example
> >> for this i was looking at flush_workqueue vs. drain_workqueue
> > Would it be possible for amdgpu to, in the system suspend task,
> >
> > 1) first queue up all the jobs needed to migrate bos out of vram, and
> > whatever other housekeeping jobs are needed
> > 2) then drain gpu scheduler's queues
> > 3) and then finally wait for jobs executing on GPU to complete
>
>
> We already do most of it in amdgpu_device_suspend,
> amdgpu_device_ip_suspend_phase1
> followed by amdgpu_device_evict_resources followed by
> amdgpu_fence_driver_hw_fini is
> exactly steps 1 + 3. What we are missing is step 2). For this step I
> suggest adding a function
> called drm_sched_entity_drain which basically sets entity->stopped =
> true and then calls drm_sched_entity_flush.
> This will both reject any new insertions into entity's job queue and
> will flush all pending job submissions to HW from that entity.
> One point is we need to make make drm_sched_entity_push_job return value
> so the caller knows about job enqueue
> rejection.

Hmm, seems like job enqueue that is rejected because we are in the
process of suspending should be more of a WARN_ON() sort of thing?
Not sure if there is something sensible to do for the caller at that
point?

>
> What about runtime suspend ? I guess same issue with scheduler racing
> against HW susppend is relevant there ?

Runtime suspend should be ok, as long as the driver holds a runpm
reference whenever the hw needs to be awake. The problem with system
suspend (at least if you are using pm_runtime_force_suspend() or doing
something equivalent) is that it bypasses the runpm reference.
(Which, IMO, seems like a bad design..)

> Also, could you point to a particular buggy scenario where the race
> between SW shceduler and suspend is causing a problem ?

I wrote a piglit test[1] to try to trigger this scenario.. it isn't
really that easy to hit

BR,
-R

[1] https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/643

> Andrey
>
>
> >
> > BR,
> > -R
> >
> >> Andrey
> >>
> >>
> >>> BR,
> >>> -R