Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

From: Borislav Petkov
Date: Thu Oct 17 2019 - 17:44:57 EST


On Thu, Oct 17, 2019 at 09:31:30PM +0000, Luck, Tony wrote:
> That sounds like the right short term action.
>
> Depending on what we end up with from Srinivas ... we may want
> to reconsider the severity. The basic premise of Srinivas' patch
> is to avoid printing anything for short excursions above temperature
> threshold. But the effect of that is that when we find the core/package
> staying above temperature for an extended period of time, we are
> in a serious situation where some action may be needed. E.g.
> move the laptop off the soft surface that is blocking the air vents.

I don't think having a critical severity message is nearly enough.
There are cases where the users simply won't see that message, no shell
opened, nothing scanning dmesg, nothing pops up on the desktop to show
KERN_CRIT messages, etc.

If we really wanna handle this case then we must be much more reliable:

* we throttle the machine from within the kernel - whatever that may mean
* if that doesn't help, we stop scheduling !root tasks
* if that doesn't help, we halt
* ...

These are purely hypothetical things to do but I'm pointing them out as
an example that in a high temperature situation we should be actively
doing something and not wait for the user to do that.

Come to think of it, one can apply the same type of logic here and split
the temp severity into action-required events and action-optional events
and then depending on the type, we do things.

Now what those things are, should be determined by the severity of the
events. Which would mean, we'd need to know how severe those events are.
And since this is left in the hands of the OEMs, good luck to us. ;-\

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette