Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

From: Srinivas Pandruvada
Date: Fri Oct 18 2019 - 08:26:41 EST


On Thu, 2019-10-17 at 23:44 +0200, Borislav Petkov wrote:
> On Thu, Oct 17, 2019 at 09:31:30PM +0000, Luck, Tony wrote:
> > That sounds like the right short term action.
> >
> > Depending on what we end up with from Srinivas ... we may want
> > to reconsider the severity. The basic premise of Srinivas' patch
> > is to avoid printing anything for short excursions above
> > temperature
> > threshold. But the effect of that is that when we find the
> > core/package
> > staying above temperature for an extended period of time, we are
> > in a serious situation where some action may be needed. E.g.
> > move the laptop off the soft surface that is blocking the air
> > vents.
>
> I don't think having a critical severity message is nearly enough.
> There are cases where the users simply won't see that message, no
> shell
> opened, nothing scanning dmesg, nothing pops up on the desktop to
> show
> KERN_CRIT messages, etc.
>
> If we really wanna handle this case then we must be much more
> reliable:
>
> * we throttle the machine from within the kernel - whatever that may
> mean
There are actions associated with the high temperature using acpi
thermal subsystems. The problem with associating with this warning
directly is that, this threhold temperature is set to too low in some
recent laptops at power up.

Server/desktops generally rely on the embedded controller for FAN
control, which kernel have no control. For them this warning helps to
either bring in additional cooling or fix existing cooling.

If something needs to force throttle from kernel, then we should use
some offset from the max temperature (aka TJMax), instead of this
warning threshold. Then we can use idle injection or change duty cycle
of CPU clocks.

Thanks,
Srinivas

> * if that doesn't help, we stop scheduling !root tasks
> * if that doesn't help, we halt
> * ...
>
> These are purely hypothetical things to do but I'm pointing them out
> as
> an example that in a high temperature situation we should be actively
> doing something and not wait for the user to do that.
>
> Come to think of it, one can apply the same type of logic here and
> split
> the temp severity into action-required events and action-optional
> events
> and then depending on the type, we do things.
>
> Now what those things are, should be determined by the severity of
> the
> events. Which would mean, we'd need to know how severe those events
> are.
> And since this is left in the hands of the OEMs, good luck to us. ;-\
>