Re: [PATCH v3] driver core: Add timeout for device shutdown

From: Khasnis Soumya
Date: Fri Jun 07 2024 - 07:38:06 EST


On Thu, Jun 06, 2024 at 05:23:19PM +0200, Daniel Lezcano wrote:
> On 06/06/2024 10:50, Soumya Khasnis wrote:
> > The device shutdown callbacks invoked during shutdown/reboot
> > are prone to errors depending on the device state or mishandling
> > by one or more driver. In order to prevent a device hang in such
> > scenarios, we bail out after a timeout while dumping a meaningful
> > call trace of the shutdown callback to kernel logs, which blocks
> > the shutdown or reboot process.
>
> Is that not somehow already achieved by the watchdog mechanism ?
The hard or software watchdog enabled by config_lockup_detector couldn’t
detect the cases when stalled on IO wait (wait_for_completion/io)

>
> > Signed-off-by: Soumya Khasnis <soumya.khasnis@xxxxxxxx>
> > Signed-off-by: Srinavasa Nagaraju <Srinavasa.Nagaraju@xxxxxxxx>
> > ---
> > Changes in v3:
> > -fix review comments
> > -updated commit message
> >
> > drivers/base/Kconfig | 18 ++++++++++++++++++
> > drivers/base/base.h | 8 ++++++++
> > drivers/base/core.c | 40 ++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 66 insertions(+)
> >
> > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> > index 2b8fd6bb7da0..342d3f87a404 100644
> > --- a/drivers/base/Kconfig
> > +++ b/drivers/base/Kconfig
> > @@ -243,3 +243,21 @@ config FW_DEVLINK_SYNC_STATE_TIMEOUT
> > work on.
> >
> > endmenu
> > +
> > +config DEVICE_SHUTDOWN_TIMEOUT
> > + bool "device shutdown timeout"
> > + default y
> > + help
> > + Enable timeout for device shutdown. In case of device shutdown is
> > + broken or device is not responding, system shutdown or restart may hang.
> > + This timeout handles such situation and triggers emergency_restart or
> > + machine_power_off. Also dumps call trace of shutdown process.
> > +
> > +
> > +config DEVICE_SHUTDOWN_TIMEOUT_SEC
> > + int "device shutdown timeout in seconds"
> > + range 10 60
> > + default 10
>
> How do you know the shutdown time is between this range?
>
> What about large systems ?
Agree it is difficult to set single timeout for all device.
This range I have based on consumer device where response time cannot be more.
But still as you mentioned we can not make this configuration by default "true/y"
with some fixed range. I will change patch to set this configuration default to
"false/n" as before, and will also remove range.

>
> > + depends on DEVICE_SHUTDOWN_TIMEOUT
> > + help
> > + sets time for device shutdown timeout in seconds
> > diff --git a/drivers/base/base.h b/drivers/base/base.h
> > index 0738ccad08b2..97eea57a8868 100644
> > --- a/drivers/base/base.h
> > +++ b/drivers/base/base.h
> > @@ -243,3 +243,11 @@ static inline int devtmpfs_delete_node(struct device *dev) { return 0; }
> >
> > void software_node_notify(struct device *dev);
> > void software_node_notify_remove(struct device *dev);
> > +
> > +#ifdef CONFIG_DEVICE_SHUTDOWN_TIMEOUT
> > +struct device_shutdown_timeout {
> > + struct timer_list timer;
> > + struct task_struct *task;
> > +};
> > +#define SHUTDOWN_TIMEOUT CONFIG_DEVICE_SHUTDOWN_TIMEOUT_SEC
> > +#endif
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index b93f3c5716ae..dab455054a80 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -35,6 +35,12 @@
> > #include "base.h"
> > #include "physical_location.h"
> > #include "power/power.h"
> > +#include <linux/sched/debug.h>
> > +#include <linux/reboot.h>
> > +
> > +#ifdef CONFIG_DEVICE_SHUTDOWN_TIMEOUT
> > +struct device_shutdown_timeout devs_shutdown;
> > +#endif
> >
> > /* Device links support. */
> > static LIST_HEAD(deferred_sync);
> > @@ -4799,6 +4805,38 @@ int device_change_owner(struct device *dev, kuid_t kuid, kgid_t kgid)
> > }
> > EXPORT_SYMBOL_GPL(device_change_owner);
> >
> > +#ifdef CONFIG_DEVICE_SHUTDOWN_TIMEOUT
> > +static void device_shutdown_timeout_handler(struct timer_list *t)
> > +{
> > + pr_emerg("**** device shutdown timeout ****\n");
> > + show_stack(devs_shutdown.task, NULL, KERN_EMERG);
> > + if (system_state == SYSTEM_RESTART)
> > + emergency_restart();
> > + else
> > + machine_power_off();
> > +}
>
> So if one device is misbehaving, all the others shutdown callbacks are
> skipped with emergency halt/reboot ? That is prone to break the system, no?
Skipping other callback may not cause system break, but emergency shutdown or
reboot is better then leave system in hung state. That is the main functionality
of this patch.
>
> > +static void device_shutdown_timer_set(void)
> > +{
> > + devs_shutdown.task = current;
> > + timer_setup(&devs_shutdown.timer, device_shutdown_timeout_handler, 0);
> > + devs_shutdown.timer.expires = jiffies + SHUTDOWN_TIMEOUT * HZ;
> > + add_timer(&devs_shutdown.timer);
> > +}
> > +
> > +static void device_shutdown_timer_clr(void)
> > +{
> > + del_timer(&devs_shutdown.timer);
> > +}
> > +#else
> > +static inline void device_shutdown_timer_set(void)
> > +{
> > +}
> > +static inline void device_shutdown_timer_clr(void)
> > +{
> > +}
> > +#endif
> > +
> > /**
> > * device_shutdown - call ->shutdown() on each device to shutdown.
> > */
> > @@ -4810,6 +4848,7 @@ void device_shutdown(void)
> > device_block_probing();
> >
> > cpufreq_suspend();
> > + device_shutdown_timer_set();
> >
> > spin_lock(&devices_kset->list_lock);
> > /*
> > @@ -4869,6 +4908,7 @@ void device_shutdown(void)
> > spin_lock(&devices_kset->list_lock);
> > }
> > spin_unlock(&devices_kset->list_lock);
> > + device_shutdown_timer_clr();
> > }
> >
> > /*
>
> --
> <https://urldefense.com/v3/__http://www.linaro.org/__;!!JmoZiZGBv3RvKRSx!6XWB4gl8L3rRMPtMmiqJdKcGhAMKhZ9UVvLyqOiGr3vHiQzlgwInwY3OVNNzXZsLONbeCLZZ-CY-APJdHGYO7DpNrCqk$ [linaro[.]org]> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <https://urldefense.com/v3/__http://www.facebook.com/pages/Linaro__;!!JmoZiZGBv3RvKRSx!6XWB4gl8L3rRMPtMmiqJdKcGhAMKhZ9UVvLyqOiGr3vHiQzlgwInwY3OVNNzXZsLONbeCLZZ-CY-APJdHGYO7AtMvPiK$ [facebook[.]com]> Facebook |
> <https://urldefense.com/v3/__http://twitter.com/*!/linaroorg__;Iw!!JmoZiZGBv3RvKRSx!6XWB4gl8L3rRMPtMmiqJdKcGhAMKhZ9UVvLyqOiGr3vHiQzlgwInwY3OVNNzXZsLONbeCLZZ-CY-APJdHGYO7Imo3W2M$ [twitter[.]com]> Twitter |
> <https://urldefense.com/v3/__http://www.linaro.org/linaro-blog/__;!!JmoZiZGBv3RvKRSx!6XWB4gl8L3rRMPtMmiqJdKcGhAMKhZ9UVvLyqOiGr3vHiQzlgwInwY3OVNNzXZsLONbeCLZZ-CY-APJdHGYO7DxWnKe3$ [linaro[.]org]> Blog
>