Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

From: Peter Zijlstra
Date: Fri Oct 18 2019 - 03:17:17 EST


On Thu, Oct 17, 2019 at 11:44:45PM +0200, Borislav Petkov wrote:
> On Thu, Oct 17, 2019 at 09:31:30PM +0000, Luck, Tony wrote:
> > That sounds like the right short term action.
> >
> > Depending on what we end up with from Srinivas ... we may want
> > to reconsider the severity. The basic premise of Srinivas' patch
> > is to avoid printing anything for short excursions above temperature
> > threshold. But the effect of that is that when we find the core/package
> > staying above temperature for an extended period of time, we are
> > in a serious situation where some action may be needed. E.g.
> > move the laptop off the soft surface that is blocking the air vents.
>
> I don't think having a critical severity message is nearly enough.
> There are cases where the users simply won't see that message, no shell
> opened, nothing scanning dmesg, nothing pops up on the desktop to show
> KERN_CRIT messages, etc.
>
> If we really wanna handle this case then we must be much more reliable:
>
> * we throttle the machine from within the kernel - whatever that may mean
> * if that doesn't help, we stop scheduling !root tasks
> * if that doesn't help, we halt
> * ...

We have forced idle injection, that should be able to reduce the system
to barely functional but non-cooker status.