Re: [PATCH v2] drm/msm: Make .remove and .shutdown HW shutdown consistent
From: Dmitry Baryshkov
Date: Sun Jul 24 2022 - 14:36:38 EST
On Sun, 24 Jul 2022 at 14:13, Javier Martinez Canillas
<javierm@xxxxxxxxxx> wrote:
>
> Drivers' .remove and .shutdown callbacks are executed on different code
> paths. The former is called when a device is removed from the bus, while
> the latter is called at system shutdown time to quiesce the device.
>
> This means that some overlap exists between the two, because both have to
> take care of properly shutting down the hardware. But currently the logic
> used in these two callbacks isn't consistent in msm drivers, which could
> lead to kernel oops.
>
> For example, on .remove the component is deleted and its .unbind callback
> leads to the hardware being shutdown but only if the DRM device has been
> marked as registered.
>
> That check doesn't exist in the .shutdown logic and this can lead to the
> driver calling drm_atomic_helper_shutdown() for a DRM device that hasn't
> been properly initialized.
>
> A situation like this can happen if drivers for expected sub-devices fail
> to probe, since the .bind callback will never be executed. If that is the
> case, drm_atomic_helper_shutdown() will attempt to take mutexes that are
> only initialized if drm_mode_config_init() is called during a device bind.
>
> This bug was attempted to be fixed in commit 623f279c7781 ("drm/msm: fix
> shutdown hook in case GPU components failed to bind"), but unfortunately
> it still happens in some cases as the one mentioned above, i.e:
>
> [ 169.495897] systemd-shutdown[1]: Powering off.
> [ 169.500466] kvm: exiting hardware virtualization
> [ 169.554787] platform wifi-firmware.0: Removing from iommu group 12
> [ 169.610238] platform video-firmware.0: Removing from iommu group 10
> [ 169.682164] ------------[ cut here ]------------
> [ 169.686909] WARNING: CPU: 6 PID: 1 at drivers/gpu/drm/drm_modeset_lock.c:317 drm_modeset_lock_all_ctx+0x3c4/0x3d0
> ...
> [ 169.775691] Hardware name: Google CoachZ (rev3+) (DT)
> [ 169.780874] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 169.788021] pc : drm_modeset_lock_all_ctx+0x3c4/0x3d0
> [ 169.793205] lr : drm_modeset_lock_all_ctx+0x48/0x3d0
> [ 169.798299] sp : ffff80000805bb80
> [ 169.801701] x29: ffff80000805bb80 x28: ffff327c00128000 x27: 0000000000000000
> [ 169.809025] x26: 0000000000000000 x25: 0000000000000001 x24: ffffc95d820ec030
> [ 169.816349] x23: ffff327c00bbd090 x22: ffffc95d8215eca0 x21: ffff327c039c5800
> [ 169.823674] x20: ffff327c039c5988 x19: ffff80000805bbe8 x18: 0000000000000034
> [ 169.830998] x17: 000000040044ffff x16: ffffc95d80cac920 x15: 0000000000000000
> [ 169.838322] x14: 0000000000000315 x13: 0000000000000315 x12: 0000000000000000
> [ 169.845646] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
> [ 169.852971] x8 : ffff80000805bc28 x7 : 0000000000000000 x6 : 0000000000000000
> [ 169.860295] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> [ 169.867619] x2 : ffff327c00128000 x1 : 0000000000000000 x0 : ffff327c039c59b0
> [ 169.874944] Call trace:
> [ 169.877467] drm_modeset_lock_all_ctx+0x3c4/0x3d0
> [ 169.882297] drm_atomic_helper_shutdown+0x70/0x134
> [ 169.887217] msm_drv_shutdown+0x30/0x40
> [ 169.891159] platform_shutdown+0x28/0x40
> [ 169.895191] device_shutdown+0x148/0x350
> [ 169.899221] kernel_power_off+0x38/0x80
> [ 169.903163] __do_sys_reboot+0x288/0x2c0
> [ 169.907192] __arm64_sys_reboot+0x28/0x34
> [ 169.911309] invoke_syscall+0x48/0x114
> [ 169.915162] el0_svc_common.constprop.0+0x44/0xec
> [ 169.919992] do_el0_svc+0x2c/0xc0
> [ 169.923394] el0_svc+0x2c/0x84
> [ 169.926535] el0t_64_sync_handler+0x11c/0x150
> [ 169.931013] el0t_64_sync+0x18c/0x190
> [ 169.934777] ---[ end trace 0000000000000000 ]---
> [ 169.939557] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
> [ 169.948574] Mem abort info:
> [ 169.951452] ESR = 0x0000000096000004
> [ 169.955307] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 169.960765] SET = 0, FnV = 0
> [ 169.963901] EA = 0, S1PTW = 0
> [ 169.967127] FSC = 0x04: level 0 translation fault
> [ 169.972136] Data abort info:
> [ 169.975093] ISV = 0, ISS = 0x00000004
> [ 169.979037] CM = 0, WnR = 0
> [ 169.982083] user pgtable: 4k pages, 48-bit VAs, pgdp=000000010eab1000
> [ 169.988697] [0000000000000018] pgd=0000000000000000, p4d=0000000000000000
> [ 169.995669] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> ...
> [ 170.079614] Hardware name: Google CoachZ (rev3+) (DT)
> [ 170.084801] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 170.091941] pc : ww_mutex_lock+0x28/0x32c
> [ 170.096064] lr : drm_modeset_lock_all_ctx+0x1b0/0x3d0
> [ 170.101254] sp : ffff80000805bb50
> [ 170.104658] x29: ffff80000805bb50 x28: ffff327c00128000 x27: 0000000000000000
> [ 170.111977] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000018
> [ 170.119296] x23: ffff80000805bc10 x22: ffff327c039c5ad8 x21: ffff327c039c5800
> [ 170.126615] x20: ffff80000805bbe8 x19: 0000000000000018 x18: 0000000000000034
> [ 170.133933] x17: 000000040044ffff x16: ffffc95d80cac920 x15: 0000000000000000
> [ 170.141252] x14: 0000000000000315 x13: 0000000000000315 x12: 0000000000000000
> [ 170.148571] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
> [ 170.155890] x8 : ffff80000805bc28 x7 : 0000000000000000 x6 : 0000000000000000
> [ 170.163209] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> [ 170.170528] x2 : ffff327c00128000 x1 : 0000000000000000 x0 : 0000000000000018
> [ 170.177847] Call trace:
> [ 170.180364] ww_mutex_lock+0x28/0x32c
> [ 170.184127] drm_modeset_lock_all_ctx+0x1b0/0x3d0
> [ 170.188957] drm_atomic_helper_shutdown+0x70/0x134
> [ 170.193876] msm_drv_shutdown+0x30/0x40
> [ 170.197820] platform_shutdown+0x28/0x40
> [ 170.201854] device_shutdown+0x148/0x350
> [ 170.205888] kernel_power_off+0x38/0x80
> [ 170.209832] __do_sys_reboot+0x288/0x2c0
> [ 170.213866] __arm64_sys_reboot+0x28/0x34
> [ 170.217990] invoke_syscall+0x48/0x114
> [ 170.221843] el0_svc_common.constprop.0+0x44/0xec
> [ 170.226672] do_el0_svc+0x2c/0xc0
> [ 170.230079] el0_svc+0x2c/0x84
> [ 170.233215] el0t_64_sync_handler+0x11c/0x150
> [ 170.237686] el0t_64_sync+0x18c/0x190
> [ 170.241451] Code: aa0103f4 d503201f d2800001 aa0103e3 (c8e37c02)
> [ 170.247704] ---[ end trace 0000000000000000 ]---
> [ 170.252457] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [ 170.260654] Kernel Offset: 0x495d77c00000 from 0xffff800008000000
> [ 170.266910] PHYS_OFFSET: 0xffffcd8500000000
> [ 170.271212] CPU features: 0x800,00c2a015,19801c82
> [ 170.276042] Memory Limit: none
> [ 170.279183] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
>
> Fixes: 623f279c7781 ("drm/msm: fix shutdown hook in case GPU components failed to bind")
> Signed-off-by: Javier Martinez Canillas <javierm@xxxxxxxxxx>
> ---
>
> Changes in v2:
> - Take the registered check out of the msm_shutdown_hw() and make callers to check instead.
> - Make msm_shutdown_hw() an inline function.
> - Add a Fixes: tag.
>
> drivers/gpu/drm/msm/msm_drv.c | 29 +++++++++++++++++------------
> 1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> index 1ed4cd09dbf8..6deecb13a31c 100644
> --- a/drivers/gpu/drm/msm/msm_drv.c
> +++ b/drivers/gpu/drm/msm/msm_drv.c
> @@ -190,6 +190,20 @@ static int vblank_ctrl_queue_work(struct msm_drm_private *priv,
> return 0;
> }
>
> +/*
> + * Shutdown the hw if we're far enough along where things might be on.
> + * If we run this too early, we'll end up panicking in any variety of
> + * places. Since we don't register the drm device until late in
> + * msm_drm_init, drm_dev->registered is used as an indicator that the
> + * shutdown will be successful.
> + *
> + * This function must only be called if drm_dev->registered is true.
> + */
> +static inline void msm_shutdown_hw(struct drm_device *dev)
> +{
> + drm_atomic_helper_shutdown(dev);
> +}
Now there is no point in having this as a separate function. Could you
please inline it?
> +
> static int msm_drm_uninit(struct device *dev)
> {
> struct platform_device *pdev = to_platform_device(dev);
> @@ -198,16 +212,9 @@ static int msm_drm_uninit(struct device *dev)
> struct msm_kms *kms = priv->kms;
> int i;
>
> - /*
> - * Shutdown the hw if we're far enough along where things might be on.
> - * If we run this too early, we'll end up panicking in any variety of
> - * places. Since we don't register the drm device until late in
> - * msm_drm_init, drm_dev->registered is used as an indicator that the
> - * shutdown will be successful.
> - */
> if (ddev->registered) {
> drm_dev_unregister(ddev);
> - drm_atomic_helper_shutdown(ddev);
> + msm_shutdown_hw(ddev);
> }
>
> /* We must cancel and cleanup any pending vblank enable/disable
> @@ -1242,10 +1249,8 @@ void msm_drv_shutdown(struct platform_device *pdev)
> struct msm_drm_private *priv = platform_get_drvdata(pdev);
> struct drm_device *drm = priv ? priv->dev : NULL;
>
> - if (!priv || !priv->kms)
> - return;
> -
> - drm_atomic_helper_shutdown(drm);
It might be worth repeating the comment here.
> + if (drm && drm->registered)
> + msm_shutdown_hw(drm);
> }
>
> static struct platform_driver msm_platform_driver = {
--
With best wishes
Dmitry