[PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers

From: Lukas Wunner
Date: Sun Feb 11 2018 - 04:38:55 EST


Fix a deadlock on hybrid graphics laptops that's been present since 2013:

DRM drivers poll connectors in 10 sec intervals. The poll worker is
stopped on ->runtime_suspend with cancel_delayed_work_sync(). However
the poll worker invokes the DRM drivers' ->detect callbacks, which call
pm_runtime_get_sync(). If the poll worker starts after runtime suspend
has begun, pm_runtime_get_sync() will wait for runtime suspend to finish
with the intention of runtime resuming the device afterwards. The result
is a circular wait between poll worker and autosuspend worker.

I'm seeing this deadlock so often it's not funny anymore. I've also found
3 nouveau bugzillas about it and 1 radeon bugzilla. (See patch [3/5] and
[4/5].)

One theoretical solution would be to add a flag to the ->detect callback
to tell it that it's running in the poll worker's context. In that case
it doesn't need to call pm_runtime_get_sync() because the poll worker is
only enabled while runtime active and we know that ->runtime_suspend
waits for it to finish. But this approach would require touching every
single DRM driver's ->detect hook. Moreover the ->detect hook is called
from numerous other places, both in the DRM library as well as in each
driver, making it necessary to trace every possible call chain and check
if it's coming from the poll worker or not. I've tried to do that for
nouveau (see the call sites listed in the commit message of patch [3/5])
and concluded that this approach is a nightmare to implement.

Another reason for the infeasibility of this approach is that ->detect
is called from within the DRM library without driver involvement, e.g.
via DRM's sysfs interface. In those cases, pm_runtime_get_sync() would
have to be called by the DRM library, but the DRM library is supposed to
stay generic, wheras runtime PM is driver-specific.

So instead, I've come up with this much simpler solution which gleans
from the current task's flags if it's a workqueue worker, retrieves the
work_struct and checks if it's the poll worker. All that's needed is
one small helper in the workqueue code and another small helper in the
DRM library. This solution lends itself nicely to stable-backporting.

The patches for radeon and amdgpu are compile-tested only, I only have a
MacBook Pro with an Nvidia GK107 to test. To test the patches, add an
"msleep(12*1000);" at the top of the driver's ->runtime_suspend hook.
This ensures that the poll worker runs after ->runtime_suspend has begun.
Wait 12 sec after the GPU has begun runtime suspend, then check
/sys/bus/pci/devices/0000:01:00.0/power/runtime_status. Without this
series, the status will be stuck at "suspending" and you'll get hung task
errors in dmesg after a few minutes.

i915, malidp and msm "solved" this issue by not stopping the poll worker
on runtime suspend. But this results in the GPU bouncing back and forth
between D0 and D3 continuously. That's a total no-go for GPUs which
runtime suspend to D3cold since every suspend/resume cycle costs a
significant amount of time and energy. (i915 and malidp do not seem
to acquire a runtime PM ref in the ->detect callbacks, which seems
questionable. msm however does and would also deadlock if it disabled
the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx)

Please review. Thanks,

Lukas

Lukas Wunner (5):
workqueue: Allow retrieval of current task's work struct
drm: Allow determining if current task is output poll worker
drm/nouveau: Fix deadlock on runtime suspend
drm/radeon: Fix deadlock on runtime suspend
drm/amdgpu: Fix deadlock on runtime suspend

drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 58 +++++++++++++-------
drivers/gpu/drm/drm_probe_helper.c | 14 +++++
drivers/gpu/drm/nouveau/nouveau_connector.c | 18 +++++--
drivers/gpu/drm/radeon/radeon_connectors.c | 74 +++++++++++++++++---------
include/drm/drm_crtc_helper.h | 1 +
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 16 ++++++
7 files changed, 132 insertions(+), 50 deletions(-)

--
2.15.1