Re: [patch 00/37] cpu/hotplug, x86: Reworked parallel CPU bringup
From: Andrew Cooper
Date: Mon Apr 17 2023 - 07:03:49 EST
On 17/04/2023 11:30 am, Peter Zijlstra wrote:
> On Sat, Apr 15, 2023 at 01:44:13AM +0200, Thomas Gleixner wrote:
>
>> Background
>> ----------
>>
>> The reason why people are interested in parallel bringup is to shorten
>> the (kexec) reboot time of cloud servers to reduce the downtime of the
>> VM tenants. There are obviously other interesting use cases for this
>> like VM startup time, embedded devices...
> ...
>
>> There are two issue there:
>>
>> a) The death by MCE broadcast problem
>>
>> Quite some (contemporary) x86 CPU generations are affected by
>> this:
>>
>> - MCE can be broadcasted to all CPUs and not only issued locally
>> to the CPU which triggered it.
>>
>> - Any CPU which has CR4.MCE == 0, even if it sits in a wait
>> for INIT/SIPI state, will cause an immediate shutdown of the
>> machine if a broadcasted MCE is delivered.
> When doing kexec, CR4.MCE should already have been set to 1 by the prior
> kernel, no?
No(ish). Purgatory can't take #MC, or NMIs for that matter.
It's cleaner to explicitly disable CR4.MCE and let the system reset
(with all the MC banks properly preserved), than it is to take #MC while
the IDT isn't in sync with the handlers, and wander off into the weeds.
~Andrew