Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update
From: Ashok Raj
Date: Wed Aug 17 2022 - 08:31:21 EST
On Wed, Aug 17, 2022 at 02:10:51PM +0200, Borislav Petkov wrote:
> On Wed, Aug 17, 2022 at 11:57:37AM +0000, Ashok Raj wrote:
> > You have this reversed. if you get an MCE and MCIP=1 -> shutdown
>
> Yeah yeah.
>
> > When MCE's happen during the update they are always fatal errors.
>
> How did you decide that?
>
> Because all CPUs are executing the loop and thus no user process?
Correct. There can be a fatal IOMCA which is asynchronous and can fire
anytime. We removed any CPU initiated async errors like patrol-scrub and
cache errors observed during eviction (EWB) into CMCI's when we developed
LMCE.
>
> > What we do here by setting MCIP=1, we promote to a more severe shutdown.
>
> It probably should say somewhere that a shutdown is possible. Because if
> the shutdown really happens, you get a black screen and no info as to
> why...
You will find out when system returns after reboot and hopefully wasn't
promoted to a cold-boot which will loose MCE banks.
I can check with the HW team and get back..
>
> > Ideally I would rather let the fallout happen since its observable vs a
> > blind shutdown is what we are promoting to.
>
> What fallout do you mean exactly?
Meaning deal with the effect of a really rare MCE. Rather than trying to
avoid it. Taking the MCE is more important than finishing the update,
and loosing what the error signaled was trying to convey.
>
> > Shutdown, shutdown.. There is only 1 MCE no matter how many CPUs you have.
>
> Because all CPUs are executing the loop? Or how do you decide this?
Fatal errors signaled with PCC=1 in the MCAx.STATUS is *ALWAYS*
broadcast[1] to all CPUs in the system. So MCIP is set at the source of
the CPU and if any other CPU is also attempting they go down. You can't
even collect data.
[1] Broadcast is true for Intel CPUs, its legacy behavior
LMCE is the only one that isn't broadcast and truly local to the logical
thread.
>
> > Exception is the Local MCE which is recoverable, but only to user space.
> >
> > If you get an error in the atomic we are polling, its a fatal error since
> > SW can't recover and we shutdown.
>
> Aha, I think you mean that: the MCE is fatal because if it happens on
> any CPU, it will be in kernel mode.
Yep!
>
> > Overthinking :-).. If there is concensus, if Boris feels comfortable
> > enough, i would drop this patch.
>
> This is what we're doing right now - thinking about the consensus. And
> Boris will feel comfortable once we've arrived at a sensible decision.
>
> :-)
>
I'm waiting for the results. :-). And if you feel we can merge the
- Patch1 - bug fix
- Patch2 - min-rev id
I do have the comments from Ingo captured, but I'll wait for other comments
before i resend just those 2 and we can leave the NMI handling to get more
testing and review before we consider.
Cheers,
Ashok