Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend

From: Andrey Grodzovsky
Date: Fri Mar 18 2022 - 16:15:05 EST



On 2022-03-18 13:22, Rob Clark wrote:
On Fri, Mar 18, 2022 at 9:27 AM Andrey Grodzovsky
<andrey.grodzovsky@xxxxxxx> wrote:

On 2022-03-18 12:20, Rob Clark wrote:
On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
<andrey.grodzovsky@xxxxxxx> wrote:
On 2022-03-17 16:35, Rob Clark wrote:
On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
<andrey.grodzovsky@xxxxxxx> wrote:
On 2022-03-17 14:25, Rob Clark wrote:
On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
<andrey.grodzovsky@xxxxxxx> wrote:
On 2022-03-17 13:35, Rob Clark wrote:
On Thu, Mar 17, 2022 at 9:45 AM Christian König
<christian.koenig@xxxxxxx> wrote:
Am 17.03.22 um 17:18 schrieb Rob Clark:
On Thu, Mar 17, 2022 at 9:04 AM Christian König
<christian.koenig@xxxxxxx> wrote:
Am 17.03.22 um 16:10 schrieb Rob Clark:
[SNIP]
userspace frozen != kthread frozen .. that is what this patch is
trying to address, so we aren't racing between shutting down the hw
and the scheduler shoveling more jobs at us.
Well exactly that's the problem. The scheduler is supposed to shoveling
more jobs at us until it is empty.

Thinking more about it we will then keep some dma_fence instance
unsignaled and that is and extremely bad idea since it can lead to
deadlocks during suspend.
Hmm, perhaps that is true if you need to migrate things out of vram?
It is at least not a problem when vram is not involved.
No, it's much wider than that.

See what can happen is that the memory management shrinkers want to wait
for a dma_fence during suspend.
we don't wait on fences in shrinker, only purging or evicting things
that are already ready. Actually, waiting on fences in shrinker path
sounds like a pretty bad idea.

And if you stop the scheduler they will just wait forever.

What you need to do instead is to drain the scheduler, e.g. call
drm_sched_entity_flush() with a proper timeout for each entity you have
created.
yeah, it would work to drain the scheduler.. I guess that might be the
more portable approach as far as generic solution for suspend.

BR,
-R
I am not sure how this drains the scheduler ? Suppose we done the
waiting in drm_sched_entity_flush,
what prevents someone to push right away another job into the same
entity's queue right after that ?
Shouldn't we first disable further pushing of jobs into entity before we
wait for sched->job_scheduled ?

In the system suspend path, userspace processes will have already been
frozen, so there should be no way to push more jobs to the scheduler,
unless they are pushed from the kernel itself.
amdgpu_device_suspend
It was my suspicion but I wasn't sure about it.


We don't do that in
drm/msm, but maybe you need to to move things btwn vram and system
memory?
Exactly, that was my main concern - if we use this method we have to use
it in a point in
suspend sequence when all the in kernel job submissions activity already
suspended

But even in that case, if the # of jobs you push is bounded I
guess that is ok?
Submissions to scheduler entities are using unbounded queue, the bounded
part is when
you extract next job from entity to submit to HW ring and it rejects if
submission limit reached (drm_sched_ready)

In general - It looks to me at least that what we what we want her is
more of a drain operation then flush (i.e.
we first want to disable any further job submission to entity's queue
and then flush all in flight ones). As example
for this i was looking at flush_workqueue vs. drain_workqueue
Would it be possible for amdgpu to, in the system suspend task,

1) first queue up all the jobs needed to migrate bos out of vram, and
whatever other housekeeping jobs are needed
2) then drain gpu scheduler's queues
3) and then finally wait for jobs executing on GPU to complete
We already do most of it in amdgpu_device_suspend,
amdgpu_device_ip_suspend_phase1
followed by amdgpu_device_evict_resources followed by
amdgpu_fence_driver_hw_fini is
exactly steps 1 + 3. What we are missing is step 2). For this step I
suggest adding a function
called drm_sched_entity_drain which basically sets entity->stopped =
true and then calls drm_sched_entity_flush.
This will both reject any new insertions into entity's job queue and
will flush all pending job submissions to HW from that entity.
One point is we need to make make drm_sched_entity_push_job return value
so the caller knows about job enqueue
rejection.
Hmm, seems like job enqueue that is rejected because we are in the
process of suspending should be more of a WARN_ON() sort of thing?
Not sure if there is something sensible to do for the caller at that
point?

What about the job's fence the caller is waiting on ? If we rejected
job submission the caller must know about it to not get stuck waiting
on that fence.

Hmm, perhaps I'm not being imaginative enough, but this sort of
scenario seems like it should only arise from a bug in the driver's
suspend path, Ie. not doing all the job submission before shutting
down the scheduler. I don't think anything good is going to result
either way, which is why I was thinking you'd want a WARN_ON() to help
debug/fix that case.


Yes, I just wanted the code to not allow such bugs to go through unnoticed. I guess
WARN_ON should give laud enough warning anyway

Andrey


What about runtime suspend ? I guess same issue with scheduler racing
against HW susppend is relevant there ?
Runtime suspend should be ok, as long as the driver holds a runpm
reference whenever the hw needs to be awake. The problem with system
suspend (at least if you are using pm_runtime_force_suspend() or doing
something equivalent) is that it bypasses the runpm reference.
(Which, IMO, seems like a bad design..)

I am not totally clear yet - can you expand a bit why one case is ok
but the other
problematic ?

Sure, normally pm_runtime_get/put increment a reference count, as long
as there have been more get's than puts, the device won't runtime
suspend. So, for ex, msm's run_job fxn does a pm_runtime_get_sync().
And retire_submit() which runs after job completes on GPU does a
pm_runtime_put_autosuspend().

System suspend, OTOH, bypasses this reference counting. Which is why
extra care is needed.

BR,
-R


Andrey


Also, could you point to a particular buggy scenario where the race
between SW shceduler and suspend is causing a problem ?
I wrote a piglit test[1] to try to trigger this scenario.. it isn't
really that easy to hit

BR,
-R

[1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C35f0d7d9282044651c9708da0903d4f4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832209324217553%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=dwjPEVAYgCI%2BtEyzBirfAQkJjZax2NdiLQfNeFfImtU%3D&amp;reserved=0

Andrey


BR,
-R

Andrey


BR,
-R