Re: [PATCH] thermal: core: Add a back up thermal shutdown mechanism

From: Keerthy
Date: Wed Apr 12 2017 - 04:37:23 EST




On Wednesday 12 April 2017 01:56 PM, Zhang Rui wrote:
> On Wed, 2017-04-12 at 13:25 +0530, Keerthy wrote:
>>
>> On Wednesday 12 April 2017 09:35 AM, Eduardo Valentin wrote:
>>>
>>> Keerthy,
>>>
>>> On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote:
>>>>
>>>>
>>>>
>>>> On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
>>>>>
>>>>> On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
>>>>>>
>>>>>>
>>>>>> On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> off).
>>> <cut>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> OK... This seams to me, still a corner case supposed to be
>>>>>>> fixed at
>>>>>>> orderly_power_off, not at thermal. But..
>>>>>>>
>>> ^^^ Then again, this must be fixed not at thermal core. And re-
>>> reading
>>> the history of this thread, this seams to be really something
>>> broken at
>>> OMAP/DRA7, as mentioned in previous messages. That is probably why
>>> you
>>> are pushing for pm_power_off(), which seams to be the one that
>>> works for
>>> you, pulling the plug correctly (DRA7).
>> Zhang/Eduardo,
>>
>> OMAP5/DRA7 is one case.
>>
>> I believe i this is the root cause of this failure.
>>
>> thermal_zone_device_check --> thermal_zone_device_update -->
>> handle_thermal_trip --> handle_critical_trips --> orderly_poweroff
>>
>> The above sequence happens every 250/500 mS based on the
>> configuration.
>> The orderly_poweroff function is getting called every 250/500 mS and
>> i
>> see with a full fledged nfs file system it takes at least 5-10
>> Seconds
>> to shutdown and during that time we bombard with orderly_poweroff
>> calls
>> multiple times due to the thermal_zone_device_check triggering
>> periodically.
>>
>> To confirm that i made sure that handle_critical_trips calls
>> orderly_poweroff only once and i no longer see the failure on DRA72-
>> EVM
>> board.
>>
> Nice catch!

Thanks.

>
>> So IMHO once we get to handle_critical_trips case where we do
>> orderly_poweroff we need to do the following:
>>
>> 1) Make sure that orderly_poweroff is called only once.
>
> agreed.
>
>> 2) Cancel all the scheduled work queues to monitor the temperature as
>> we have already reached a point of shutting down the system.
>>
> agreed.
>
> now I think we've found the root cause of the problem.
> orderly_poweroff() is not reenterable and it does not have to be.
> If we're using orderly_poweroff() for emergency power off, we have to
> use it correctly.
>
> will you generate a patch to do this?

Sure. I will generate a patch to take care of 1) To make sure that
orderly_poweroff is called only once right away. I have already tested.

for 2) Cancel all the scheduled work queues to monitor the temperature.
I will take some more time to make it and test.

Is that okay? Or you want me to send both together?

Regards,
Keerthy

>
> thanks,
> rui
>
>> Let me know your thoughts on this.
>>
>> Best Regards,
>> Keerthy
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> However, there is no clean way of detecting such failure
>>>>>>>> of
>>>>>>>> userspace
>>>>>>>> powering off the system. In such scenarios, it is
>>>>>>>> necessary for a
>>>>>>>> backup
>>>>>>>> workqueue to be able to force a shutdown of the system
>>>>>>>> when
>>>>>>>> orderly
>>>>>>>> shutdown is not successful after a configurable time
>>>>>>>> period.
>>>>>>>>
>>>>>>> Given that system running hot is a thermal issue, I guess
>>>>>>> we care
>>>>>>> more
>>>>>>> on this matter then..
>>>>>> Yes!
>>>>>>
>>>>> I just read this thread again https://patchwork.kernel.org/patc
>>>>> h/802458
>>>>> 1/ to recall the previous discussion.
>>>>>
>>>>> https://patchwork.kernel.org/patch/8149891/
>>>>> https://patchwork.kernel.org/patch/8149861/
>>>>> should be the solution made based on Ingo' suggestion, right?
>>>>>
>>>>> And to me, this sounds like the right direction to go, thermal
>>>>> does not
>>>>> need a back up shutdown solution, it just needs a kernel
>>>>> function call
>>>>> which guarantees the system can be shutdown/reboot immediately.
>>>>>
>>>>> is there any reason that patch 1/2 is not accepted?
>>>> Zhang,
>>>>
>>>> http://www.serverphorums.com/read.php?12,1400964
>>>>
>>>> I got a NAK from Alan and was given this direction on a
>>>> thermal_poweroff
>>>> which is more or less what is done in this patch.
>>>>
>>>
>>> Actually, Alan's suggestion is more for you to define a
>>> thermal_poweroff() that can be defined per architecture.
>>>
>>> Also, please, keep track of your patch versions and also do copy
>>> everybody who has stated their opinion on previous discussions.
>>> These
>>> patches must have Ingo, Alan, and RMK copied too. In this way we
>>> avoid
>>> loosing track of what has been suggested and we also converge
>>> faster to
>>> something everybody (or most of us) agree. Next version, please,
>>> fix
>>> that.
>>>
>>>
>>> To me, thermal core needs a function that simply powers off the
>>> system.
>>> No timeouts, delayed works, backups, etc. Simple and straight.
>>>
>>> The idea of having a per architecture implementation, as per Alan's
>>> suggestion, makes sense to me too. Having something different from
>>> pm_power_off(), specific to thermal, might also give the
>>> opportunity to
>>> save the power off reason.
>>>
>>> BR,
>>>
>>> Eduardo Valentin
>>>