Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

From: Huang, Kai
Date: Thu Apr 11 2024 - 17:55:48 EST




On 12/04/2024 2:13 am, Tom Lendacky wrote:
On 4/10/24 17:26, Huang, Kai wrote:
On 11/04/2024 4:14 am, Tom Lendacky wrote:
On 4/10/24 11:08, Tom Lendacky wrote:
On 4/7/24 07:44, Kai Huang wrote:

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b8441147eb5e..5ba8a9c1e47a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -813,18 +813,16 @@ void __noreturn stop_this_cpu(void *dummy)
      mcheck_cpu_clear(c);
      /*
-     * Use wbinvd on processors that support SME. This provides
support
-     * for performing a successful kexec when going from SME inactive
-     * to SME active (or vice-versa). The cache must be cleared so
that
-     * if there are entries with the same physical address, both
with and
-     * without the encryption bit, they don't race each other when
flushed
-     * and potentially end up with the wrong entry being committed to
-     * memory.
+     * The kernel could leave caches in incoherent state on SME/TDX
+     * capable platforms.  Flush cache to avoid silent memory
+     * corruption for these platforms.
       *
-     * Test the CPUID bit directly because the machine might've
cleared
-     * X86_FEATURE_SME due to cmdline options.
+     * stop_this_cpu() is not a fast path, just do unconditional
+     * WBINVD for simplicity.  But only do WBINVD for bare-metal
+     * as TDX guests and SEV-ES/SEV-SNP guests will get unexpected
+     * (and unnecessary) #VE and may unable to handle.

In addition to Kirill's comment on #VE...

This last part of the comment reads a bit odd since you say
unconditional and then say only do WBINVD for bare-metal. Maybe
something like this makes it a bit clearer?:

For TDX and SEV-ES/SEV-SNP guests, a WBINVD may cause an exception
(#VE or #VC). However, all exception handling has been torn down at
this point, so this would cause the guest to crash. Since memory
within these types of guests is coherent only issue the WBINVD on
bare-metal.

Hmmm... actually it was the other WBINVD in patch #2 that caused the
crash, so what I wrote above isn't accurate. You might want to re-word
as appropriate.

Yeah that's why I used "may unable to handle" in the comment, as I
thought there's no need to be that specific?

Yes, makes sense.


I tend not to mention "memory within these types of guests is coherent".
 I mean the current upstream kernel _ONLY_ does WBINVD for SME, that
means for all non-SME environment there's no concern to do WBINVD here.

Here we only extend to do WBINVD on more environments, so as long as
there's no harm to do WBINVD for them it's OK.

How about below?

    /*
     * The kernel could leave caches in incoherent state on SME/TDX
     * capable platforms.  Flush cache to avoid silent memory
     * corruption for these platforms.
     *
     * For TDX and SEV-ES/SEV-SNP guests, a WBINVD causes an
     * exception (#VE or #VC), and the kernel may not be able
     * to handle such exception (e.g., TDX guest panics if it
     * sees #VE).  Since stop_this_cpu() isn't a fast path, just
     * issue the WBINVD on bare-metal instead of sprinkling
     * around vendor-specific checks.
     */

I think that's ok, but maybe just adding that the WBINVD is not
necessary for TDX and SEV-ES/SEV-SNP would make it clearer. Just my
opinion, though.


Yeah can do that.

Will add "WBINVD is not necessary for TDX and SEV-ES/SEV-SNP guests" before starting the "Since stop_this_cpu() ...".

Thanks for the feedback.