Re: x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK?
From: srinivas pandruvada
Date: Thu Jun 02 2022 - 17:13:17 EST
On Thu, 2022-06-02 at 22:42 +0200, Arnd Bergmann wrote:
> On Thu, Jun 2, 2022 at 10:10 PM srinivas pandruvada
> <srinivas.pandruvada@xxxxxxxxxxxxxxx> wrote:
> > On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote:
> > >
> > > I wonder how common this problem it is. Would it help to add a
> > > driver
> > > workaround
> > > like this?
> > This issue affects only certain skews. The others already working
> > as
> > expected. These are important log bits for debug, we don't want to
> > clear in this path. Printing warning for CLX stepping is fine
> > without
> > clearing unrelated bits 13 and 15.
> > Read-modify-update should always work where we only update the bits
> > of
> > interest. Writing 1s to this register should be NOP.
>
> The patch I suggested doesn't change the behavior unless the initial
> write causes an exception. As long as only buggy microcode rejects
> the
> write, the second write just serves to clear the state that causes
> the
> repeated stack dumps.
But it will clear BIT 13 and 15 in this case. So atleast print the
current msr value in the warning message so that we don't loose the BIT
13 and BIT 15 values, in case we need them for debug.
Thanks,
Srinivas
>
> Arnd
>
> > > @@ -214,7 +214,13 @@ static void clear_therm_status_log(int
> > > level)
> > >
> > > rdmsrl(msr, msr_val);
> > > msr_val &= mask;
> > > - wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
> > > + if (wrmsrl_safe(msr, msr_val &
> > > ~THERM_STATUS_PROCHOT_LOG)) {
> > > + /* work around Cascade Lake SKZ57 erratum */
> > > + printk_once(KERN_WARNING "Failed to update
> > > IA32_THERM_STATUS, "
> > > + "please upgrade
> > > microcode\n");
> > > + wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG &
> > > + ~BIT(13) & ~BIT(15));
> > > + }
> > > }
> > >