Re: [PATCH] thermal: core: Add a back up thermal shutdown mechanism

From: Keerthy
Date: Wed Apr 12 2017 - 03:56:37 EST




On Wednesday 12 April 2017 09:35 AM, Eduardo Valentin wrote:
> Keerthy,
>
> On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote:
>>
>>
>> On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
>>> On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
>>>>
>>>> On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
>>>>>
>>>>> Hey,
>>>>>
>>>>> On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
>>>>>>
>>>>>> off).
>
> <cut>
>
>>>>> OK... This seams to me, still a corner case supposed to be fixed at
>>>>> orderly_power_off, not at thermal. But..
>>>>>
>
> ^^^ Then again, this must be fixed not at thermal core. And re-reading
> the history of this thread, this seams to be really something broken at
> OMAP/DRA7, as mentioned in previous messages. That is probably why you
> are pushing for pm_power_off(), which seams to be the one that works for
> you, pulling the plug correctly (DRA7).

Zhang/Eduardo,

OMAP5/DRA7 is one case.

I believe i this is the root cause of this failure.

thermal_zone_device_check --> thermal_zone_device_update -->
handle_thermal_trip --> handle_critical_trips --> orderly_poweroff

The above sequence happens every 250/500 mS based on the configuration.
The orderly_poweroff function is getting called every 250/500 mS and i
see with a full fledged nfs file system it takes at least 5-10 Seconds
to shutdown and during that time we bombard with orderly_poweroff calls
multiple times due to the thermal_zone_device_check triggering periodically.

To confirm that i made sure that handle_critical_trips calls
orderly_poweroff only once and i no longer see the failure on DRA72-EVM
board.

So IMHO once we get to handle_critical_trips case where we do
orderly_poweroff we need to do the following:

1) Make sure that orderly_poweroff is called only once.
2) Cancel all the scheduled work queues to monitor the temperature as
we have already reached a point of shutting down the system.

Let me know your thoughts on this.

Best Regards,
Keerthy
>
>>>>>>
>>>>>>
>>>>>> However, there is no clean way of detecting such failure of
>>>>>> userspace
>>>>>> powering off the system. In such scenarios, it is necessary for a
>>>>>> backup
>>>>>> workqueue to be able to force a shutdown of the system when
>>>>>> orderly
>>>>>> shutdown is not successful after a configurable time period.
>>>>>>
>>>>> Given that system running hot is a thermal issue, I guess we care
>>>>> more
>>>>> on this matter then..
>>>> Yes!
>>>>
>>> I just read this thread again https://patchwork.kernel.org/patch/802458
>>> 1/ to recall the previous discussion.
>>>
>>> https://patchwork.kernel.org/patch/8149891/
>>> https://patchwork.kernel.org/patch/8149861/
>>> should be the solution made based on Ingo' suggestion, right?
>>>
>>> And to me, this sounds like the right direction to go, thermal does not
>>> need a back up shutdown solution, it just needs a kernel function call
>>> which guarantees the system can be shutdown/reboot immediately.
>>>
>>> is there any reason that patch 1/2 is not accepted?
>>
>> Zhang,
>>
>> http://www.serverphorums.com/read.php?12,1400964
>>
>> I got a NAK from Alan and was given this direction on a thermal_poweroff
>> which is more or less what is done in this patch.
>>
>
>
> Actually, Alan's suggestion is more for you to define a
> thermal_poweroff() that can be defined per architecture.
>
> Also, please, keep track of your patch versions and also do copy
> everybody who has stated their opinion on previous discussions. These
> patches must have Ingo, Alan, and RMK copied too. In this way we avoid
> loosing track of what has been suggested and we also converge faster to
> something everybody (or most of us) agree. Next version, please, fix
> that.
>
>
> To me, thermal core needs a function that simply powers off the system.
> No timeouts, delayed works, backups, etc. Simple and straight.
>
> The idea of having a per architecture implementation, as per Alan's
> suggestion, makes sense to me too. Having something different from
> pm_power_off(), specific to thermal, might also give the opportunity to
> save the power off reason.
>
> BR,
>
> Eduardo Valentin
>