Re: [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
From: Borislav Petkov
Date: Tue Jun 20 2023 - 05:23:54 EST
On Thu, Jun 15, 2023 at 10:33:57PM +0200, Thomas Gleixner wrote:
> TLDR: It's a mess.
LOL. You don't need the rest of the commit message - this says it all.
:-P
> When kexec() is executed on a system with "offline" CPUs, which are parked
I'd say simply "with offlined CPUs"
> in mwait_play_dead() it can end up in a triple fault during the bootup of
> the kexec kernel or cause hard to diagnose data corruption.
>
> The reason is that kexec() eventually overwrites the previous kernels text,
kernel's
> page tables, data and stack, If it writes to the cache line which is
... and stack. If it ...
> monitored by an previously offlined CPU, MWAIT resumes execution and ends
... by a previously...
> up executing the wrong text, dereferencing overwritten page tables or
> corrupting the kexec kernels data.
Lovely.
> Cure this by bringing the offline CPUs out of MWAIT into HLT.
... offlined CPUs... :
> Write to the monitored cache line of each offline CPU, which makes MWAIT
ditto.
> resume execution. The written control word tells the offline CPUs to issue
ditto and so on.
> HLT, which does not have the MWAIT problem.
>
> That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
> those make it come out of HLT.
>
> A follow up change will put them into INIT, which protects at least against
> NMI and SMI.
...
> @@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void)
> (highest_subcstate - 1);
> }
>
> + /* Set up state for the kexec() hack below */
> + md->status = CPUDEAD_MWAIT_WAIT;
> + md->control = CPUDEAD_MWAIT_WAIT;
> +
> wbinvd();
>
> while (1) {
> @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)
JFYI: that last hunk has some conflicts applying to latest tip/master.
Might need merge resolving...
> mb();
> __mwait(eax, 0);
>
> + if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> + /*
> + * Kexec is about to happen. Don't go back into mwait() as
> + * the kexec kernel might overwrite text and data including
> + * page tables and stack. So mwait() would resume when the
> + * monitor cache line is written to and then the CPU goes
> + * south due to overwritten text, page tables and stack.
> + *
> + * Note: This does _NOT_ protect against a stray MCE, NMI,
> + * SMI. They will resume execution at the instruction
> + * following the HLT instruction and run into the problem
> + * which this is trying to prevent.
> + */
> + WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
> + while(1)
> + native_halt();
> + }
> +
> cond_wakeup_cpu0();
> }
> }
>
> +/*
> + * Kick all "offline" CPUs out of mwait on kexec(). See comment in
> + * mwait_play_dead().
> + */
> +void smp_kick_mwait_play_dead(void)
> +{
> + u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;
Do you even need this newstate thing?
> + struct mwait_cpu_dead *md;
> + unsigned int cpu, i;
> +
> + for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
> + md = per_cpu_ptr(&mwait_cpu_dead, cpu);
> +
> + /* Does it sit in mwait_play_dead() ? */
> + if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
> + continue;
> +
> + /* Wait maximal 5ms */
/* Wait 5ms maximum */
> + for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
> + /* Bring it out of mwait */
> + WRITE_ONCE(md->control, newstate);
> + udelay(5);
> + }
> +
> + if (READ_ONCE(md->status) != newstate)
> + pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);
Shouldn't this be a pr_err_once thing so that it doesn't flood the
console unnecessarily?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette