Re: [PATCH] x86: handle MSR exception when setting energy perf bias

From: Alan Cox
Date: Thu Oct 12 2017 - 06:03:29 EST


On Thu, 12 Oct 2017 01:30:07 -0300
Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxxxx> wrote:

> On very rare occasions, immediately after a suspend, one of our
> SandyBridge CI boxes hits the exception below on CPU0 while trying to
> reconfigure the energy bias register. As far as I can tell, this is not
> likely a race in the kernel, since we have only one cpu online, no
> preempt and irqs_disabled, and it can only be reproduced in this
> specific SNB-2600 on rare occasions. It looks more of a faulty hardware
> thing to me.
>
> Still, we can handle this exception more gracefully to silence the CI,
> by using the safe version of the msrl_read/write wrapper.

Which means we would silently fail to discover any real problems (like
this one) on systems with a bug. Your system appears to have a problem -
whether it's firmware or Linux I don't know but we should not be covering
it up silently and hoping ignoring it makes it go away - especially when
it will hide other bugs in the future.

At the point it occurs dump bit 3 of ECX from CPUID leaf 6 on that
logical cpu. If that bit is set you should have IA32_ENERGY_PERF_BIAS, if
it's clear you don't. If it's clear here and set somewhere else (eg at
boot) then you've got some hints as to what is maybe going on. If you get
the mismatch see if it is different per core or something insane like
that.

Alan