Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

From: Borislav Petkov
Date: Sat Oct 19 2019 - 04:11:31 EST


On Fri, Oct 18, 2019 at 01:38:32PM -0700, Luck, Tony wrote:
> Sorry to have caused confusion.

Ditto. But us causing confusion is fine - this way we can talk about
what we really wanna do!

:-)))

> The thoughts behind that statement are that we currently have an issue
> with too many noisy high severity messages. The interim solution we
> are going with is to downgrade the severity. But if we apply a time
> based filter to remove most of the noise by not printing at all, maybe
> what we have left is a very small number of high severity messages.
>
> But that's completely up for debate.

Well, I think those messages being pr_warn are fine if one wants to
inspect dmesg for signs of overheating and the platform is hitting some
thermal limits.

And if the time-based filter is not too accurate, that's fine too, I
guess, as long as we don't flood dmesg.

What I don't like is the command line parameter and us putting the onus
on the user to decide although we have all that info in the kernel
already and we can do that decision automatically.

> I agree it is a good thing to look at. I'm not so sure we will find
> a good enough method that works all the way from tablet to server,
> so we might end up with "#define MAX_THERM_TIME 8000" ... but some
> study of options would either turn up a good heuristic, or provide
> evidence for why that is either hard, or no better than a constant.

Yeah, I still think a simple avg filter which starts from a sufficiently
high value and improves it over time, should be good enough.

Hell, even the trivial formula we use in the CMCI interrupt for polling,
might work, where we either double the interval or halve it, depending
on recent history.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette