Re: x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK?

From: Arnd Bergmann
Date: Thu Jun 02 2022 - 16:42:33 EST


On Thu, Jun 2, 2022 at 10:10 PM srinivas pandruvada
<srinivas.pandruvada@xxxxxxxxxxxxxxx> wrote:
> On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote:
> >
> > I wonder how common this problem it is. Would it help to add a driver
> > workaround
> > like this?
> This issue affects only certain skews. The others already working as
> expected. These are important log bits for debug, we don't want to
> clear in this path. Printing warning for CLX stepping is fine without
> clearing unrelated bits 13 and 15.
> Read-modify-update should always work where we only update the bits of
> interest. Writing 1s to this register should be NOP.

The patch I suggested doesn't change the behavior unless the initial
write causes an exception. As long as only buggy microcode rejects the
write, the second write just serves to clear the state that causes the
repeated stack dumps.

Arnd

> > @@ -214,7 +214,13 @@ static void clear_therm_status_log(int level)
> >
> > rdmsrl(msr, msr_val);
> > msr_val &= mask;
> > - wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
> > + if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) {
> > + /* work around Cascade Lake SKZ57 erratum */
> > + printk_once(KERN_WARNING "Failed to update IA32_THERM_STATUS, "
> > + "please upgrade microcode\n");
> > + wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG &
> > + ~BIT(13) & ~BIT(15));
> > + }
> > }
> >