Re: x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK?

From: srinivas pandruvada
Date: Thu Jun 02 2022 - 16:10:33 EST


On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote:
> On Thu, Jun 2, 2022 at 6:25 PM srinivas pandruvada
> <srinivas.pandruvada@xxxxxxxxxxxxxxx> wrote:
> > On Thu, 2022-06-02 at 18:18 +0200, Arnd Bergmann wrote:
> > > On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada
> > > <srinivas.pandruvada@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote:
> > > > > I have a Xeon W-2265 (family 6, model 85, stepping 7) that
> > > > > started
> > > > > constantly spewing messages from the therm_throt driver after
> > > > > one
> > > > > core overheated:
> > > > >
> > > > I think this is a Cascade Lake system. Have you tried the
> > > > latest
> > > > micro-
> > > > code?
> > >
> > > Thanks for your quick reply. I have installed the latest
> > > microcode
> > > 0x5003302
> > > now (manually, because the version provided by the distro was
> > > still
> > > using
> > > version 0x5003102).
> > >
> > > After that, I tried writing the value 0x2a80 from userspace, and
> > > that did not cause a trap, so I assume that fixed it.
> > >
> > Thanks for reporting.
> > I am aware of this issue and should be fixed by microcode update.
>
> I wonder how common this problem it is. Would it help to add a driver
> workaround
> like this?
This issue affects only certain skews. The others already working as
expected. These are important log bits for debug, we don't want to
clear in this path. Printing warning for CLX stepping is fine without
clearing unrelated bits 13 and 15. 
Read-modify-update should always work where we only update the bits of
interest. Writing 1s to this register should be NOP.

Thanks,
Srinivas

>
> diff --git a/drivers/thermal/intel/therm_throt.c
> b/drivers/thermal/intel/therm_throt.c
> index 8352083b87c7..acb402e56796 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -214,7 +214,13 @@ static void clear_therm_status_log(int level)
>
>         rdmsrl(msr, msr_val);
>         msr_val &= mask;
> -       wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
> +       if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) {
> +               /* work around Cascade Lake SKZ57 erratum */
> +               printk_once(KERN_WARNING "Failed to update
> IA32_THERM_STATUS, "
> +                                       "please upgrade
> microcode\n");
> +               wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG &
> +                       ~BIT(13) & ~BIT(15));
> +       }
>  }
>
>  static void get_therm_status(int level, bool *proc_hot, u8 *temp)
>
>         Arnd