Re: [PATCH 00/11] reboot: support runtime configuration of emergency hw_protection action

From: Ahmad Fatoum
Date: Sun Dec 22 2024 - 04:59:41 EST


Hello Matti,

On 22.12.24 10:38, Matti Vaittinen wrote:
> On 19/12/2024 09:31, Ahmad Fatoum wrote:
>> We currently leave the decision of whether to shutdown or reboot to
>> protect hardware in an emergency situation to the individual drivers.
>>
>> This works out in some cases, where the driver detecting the critical
>> failure has inside knowledge: It binds to the system management controller
>> for example or is guided by hardware description that defines what to do.
>>
>> This is inadequate in the general case though as a driver reporting e.g.
>> an imminent power failure can't know whether a shutdown or a reboot would
>> be more appropriate for a given hardware platform.
>
> Sometimes it can. There are platforms where the hardware is such we know that poweroff or reboot are the way to go. In such case the driver should get the information from the hardware description (like device-tree).

Sure, the thermal framework for example has a device tree property that
tells it what the critical action should be. This continues to work as
before, but this series adjust only the default behavior when that device
tree property is missing.

>> To address this, this series adds a hw_protection kernel parameter and
>> sysfs toggle that can be used to change the action from the shutdown
>> default to reboot. A new hw_protection_trigger API then makes use of
>> this default action.
>>
>> My particular use case is unattended embedded systems that don't
>> have support for shutdown and that power on automatically when power is
>> supplied:
>>
>>    - A brief power cycle gets detected by the driver
>>    - The kernel powers down the system and SoC goes into shutdown mode
>>    - Power is restored
>>    - The system remains oblivious to the restored power
>
> This sounds like a consequence of a hardware design as restoring the power doesn't wake up the SoC(?)

There are two thresholds involved: One when the regulator first reports imminent
voltage loss and the kernel does something about it and one when the capacitor
have been depleted enough that the CPU is powered off.

With short voltage glitches (or big enough capacitors), I run into the situation,
where the kernel gets a voltage loss interrupt, goes into shutdown, but voltage
is restored before the CPU is actually powered off. As the system isn't designed
for shutdown anyway, there is no mechanism to wake up from it and we need to
hard power cycle the device.

>>    - System needs to be manually power cycled for a duration long enough
>>      to drain the capacitors
>>
>> With this series, such systems can configure the kernel with
>> hw_protection=reboot to have the boot firmware worry about critical
>> conditions.
>
> I am not against the change though. Just wondering if this is still a consequence of the hardware design, and if the device-tree would be proper place to indicate that poweroff shouldn't be used.

I considered this initially: add a device tree property for regulators, like there's
already for thermal zones. But I concluded that that this is a system-wide decision
and should be decided on at a system-wide level. We already have a
reboot=[warm,cold,...etc.] parameter, so this fits right in and it allows configuring
this also for non-DT platforms.

> I'm about to leave my computer behind for holidays, so I am probably not able to do a proper review until the next year. Thus this quick comment :) Also, no strong opinion so I'm not expecting anyone to hold back waiting for me!

Thanks and wishing you happy holidays as well.

Cheers,
Ahmad

>
> Good luck and happy holidays!
> -- Matti
>
>> ---
>> Ahmad Fatoum (11):
>>        reboot: replace __hw_protection_shutdown bool action parameter with an enum
>>        reboot: reboot, not shutdown, on hw_protection_reboot timeout
>>        docs: thermal: sync hardware protection doc with code
>>        reboot: rename now misleading hw_protection symbols
>>        reboot: indicate whether it is a HARDWARE PROTECTION reboot or shutdown
>>        reboot: add support for configuring emergency hardware protection action
>>        regulator: allow user configuration of hardware protection action
>>        platform/chrome: cros_ec_lpc: prepare for hw_protection_shutdown removal
>>        dt-bindings: thermal: give OS some leeway in absence of critical-action
>>        thermal: core: allow user configuration of hardware protection action
>>        reboot: retire hw_protection_reboot and hw_protection_shutdown helpers
>>
>>   Documentation/ABI/testing/sysfs-kernel-reboot      |   8 ++
>>   Documentation/admin-guide/kernel-parameters.txt    |   6 +
>>   .../devicetree/bindings/thermal/thermal-zones.yaml |   5 +-
>>   Documentation/driver-api/thermal/sysfs-api.rst     |  25 +++--
>>   drivers/platform/chrome/cros_ec_lpc.c              |   2 +-
>>   drivers/regulator/core.c                           |   4 +-
>>   drivers/regulator/irq_helpers.c                    |  16 +--
>>   drivers/thermal/thermal_core.c                     |  17 +--
>>   drivers/thermal/thermal_core.h                     |   1 +
>>   drivers/thermal/thermal_of.c                       |   7 +-
>>   include/linux/reboot.h                             |  25 +++--
>>   include/uapi/linux/capability.h                    |   1 +
>>   kernel/reboot.c                                    | 122 ++++++++++++++++-----
>>   13 files changed, 173 insertions(+), 66 deletions(-)
>> ---
>> base-commit: 78d4f34e2115b517bcbfe7ec0d018bbbb6f9b0b8
>> change-id: 20241218-hw_protection-reboot-96953493726a
>>
>> Best regards,
>
>


--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |