Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
From: Huang, Kai
Date: Wed Mar 19 2025 - 20:03:57 EST
On Tue, 2025-03-18 at 03:41 +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-03-17 at 10:11 +0000, Huang, Kai wrote:
> > On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> > > On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > > > TL;DR:
> > > >
> > > > Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> > > > cover kexec support for both AMD SME and Intel TDX. Previously there
> > > > _was_ some issue preventing from doing so but now it has been fixed.
> > > ^ Adding "the kernel" here would read a little
> > > cleaner to me.
> >
> > OK.
> >
> > >
> > >
> > > When I read "some issue" I start assuming it wasn't fully debugged and it is
> > > "some issue" that no one knows. But below it sounds like it was root caused.
> >
> > I am not sure what's wrong with "some issue". I used "_was_" to convey this
> > issue was fixed (thus root caused). Perhaps the "root caused" part isn't clear?
> > I can explicitly call it out.
> >
> > Previously there _was_ some issue preventing the kernel from doing so but
> > now it has been root caused and fixed.
>
> The problem is the phrase "some issue" sounds like the issue is not understood.
> You could just change it to "an issue". I don't even think you need the "_"
> around "_was_". Just "Previously there was an issue preventing the kernel..." is
> more reassuring.
(sorry for getting back late).
Sure.
>
>
> [snip]
>
> > >
> > > Instead of a play-by-play, it might be more informative to summarize the edges
> > > covered in this history:
> >
> > I think the above is also informative. Boris suggested to keep the discussion
> > around those relevant commits in the changelog:
> >
> > https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
>
> Summary: Kirill says it's too verbose, can be dropped.
>
> > https://lore.kernel.org/linux-kernel/20240415175912.GAZh1q8GgpY3tpJj5a@fat_crate.local/
>
> Summary: Boris says not to drop it and it's ok to be verbose.
>
> But what I'm saying is that the section doesn't summarize the issue well. It
> just links to a bunch of commits for the reviewer to go figure it out
> themselves. So I think explaining the takeaways instead of only linking to
> threads wouldn't be objectionable. You don't need to drop the commit references,
> just don't leave so much homework for the reader.
I'll refine per discussion with you.
>
> >
> > > - Don't do anything that writes memory between wbinvd and native_halt(). This
> > > includes function calls that touch the stack.
> >
> > This is a requirement to SME, but changing to unconditional WBINVD on bare-metal
> > doesn't change this behaviour. What's the purpose of mentioning here?
>
> The purpose is to help the reviewer understand the delicate design of this
> function so that they can evaluate whether the changes upset it.
As discussed, I will mention the SME problem that you discovered and the code
change will not make any worse for SME.
>
> >
> > > - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> > > too expensive.
> >
> > Boris suggested to do unconditional WBINVD. The test by Dave Young also
> > confirms there was no issue on the once-problematic platforms:
> >
> > https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@xxxxxxxxxxxxxx/
>
> I'm not sure what your point is here. That there is no issue?
>
I meant "too expensive" doesn't mean there will be issue, and indeed no issue
was discovered at least on the once-problematic platforms as tested by Dave
Young.
> This log points to
> a commit that contradicts the assertions it is making. Especially since to
> understand this part of the log, the reviewer is going to have to read those
> referenced commits, don't you think it's a problem? It is opening new doubts.
Sorry I am not able to follow where's the "contradiction".
Anyway I have spent some time to read the code around stop_other_cpus() and
stop_this_cpu() and it is now clear (at least to me) doing wbinvd for all bare-
metal is safe. Please see below.
>
> >
> > > - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> > > is a race still.
> > >
> > > Hmm, on the last one tglx says:
> > > The cpumask cannot plug all holes either, but it's better than a raw
> > > counter and allows to restrict the NMI fallback IPI to be sent only the
> > > CPUs which have not reported within the timeout window
> > >
> > > Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> > > Can we be clarify that nothing bad happens if we lose the race? (and is it
> > > true?)
> >
> > IIUC in most cases the REBOOT vector will just do the job, stop_other_cpus()
> > won't need to send NMIs thus I believe in most cases this shouldn't increase
> > race.
> >
> > I don't know what kinda platform will need NMI to stop remote CPUs. For
> > instance, AFAICT my SPR machine with 192 CPUs never needed to send NMI in my
> > testings.
> >
> > Dave Yong tested on those once-problematic platforms and doing unconditional
> > wbinvd was fine. This patchset (albeit not upstreamed) has also been tested by
> > customers in their environment. I believe doing unconditional WBINVD is fine.
>
> Sounds too much like: "Someone tested it on a platform that used to have a
> problem and at least that one is fixed, but we really don't understand what the
> issue is".
Not exactly. I think commit 1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more
robust") has actually fixed this issue.
AFAICT the root cause is due to below race mentioned in the changelog:
CPU0 CPU1
stop_other_cpus()
send_IPIs(REBOOT); stop_this_cpu()
while (num_online_cpus() > 1); set_online(false);
proceed... -> hang
wbinvd()
So the key is to make sure the "proceed..." part can only happen after all
wbinvd()s are done in the remote CPUs.
To achieve this, the above commit introduced two rounds of sending-requests-and-
waiting:
1) The first round sends REBOOT IPI to remote CPUs and waits 1 seconds for
all remote CPUs to stop;
2) If the first round times out, the second round sends NMI to remote CPUs
and wait for 10ms, or forever (which is requested by the @wait parameter to
to the smp_ops.stop_other_cpus()).
For the kexec case, the kernel will just wait until all remote CPUs are stopped,
see:
kernel_kexec() ->
machine_shutdown() ->
native_machine_shutdown() ->
stop_other_cpus() ->
smp_ops->stop_other_cpus(1)
>
> I haven't tried to understand what race tglx was seeing, or what the consequence
> is. I think we should understand and explain why it's harmless, especially since
> this bring it up by linking the log that references it.
>
As talked to you offline, changing to doing wbinvd for bare-metal only increases
the time to stop remote CPUs thus the potential impact is the first round of
waiting may time out on some large systems (albeit it didn't happen at least on
my 192 CPUs machine), but this is fine since the second round of waiting will
actually wait until all wbinvd have been done and all remote CPUs have actually
been stopped.
In other words, AFAICT the "race" you mentioned (presumably it is the one I
mentioned above) cannot happen in kexec() after changing to do wbinvd for bare-
metal.