Re: [PATCH v4 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

From: Huang, Kai
Date: Mon Jun 03 2024 - 20:58:11 EST




On 1/06/2024 8:45 am, Tom Lendacky wrote:
On 5/22/24 21:49, Huang, Kai wrote:
On 18/04/2024 11:48 pm, Kai Huang wrote:
TL;DR:

Change to do unconditional WBINVD in stop_this_cpu() for bare metal
to cover kexec support for both AMD SME and Intel TDX, despite there
_was_ some issue preventing from doing so but now has it got fixed.

Long version:

Both AMD SME and Intel TDX can leave caches in an incoherent state due
to memory encryption, which can lead to silent memory corruption during
kexec.  To address this issue, it is necessary to flush the caches
before jumping to the second kernel.

Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
is supported by hardware.  To support TDX, instead of adding one more
vendor-specific check, it is proposed to perform unconditional WBINVD.
Kexec() is a slow path, and the additional WBINVD is acceptable for the
sake of simplicity and maintainability.


Hi Tom,

May I ask how does SME work with kdump in crash_kexec().  Looking at the code, AFAICT the crash_kexec() path doesn't use stop_this_cpu() to stop all other cpus.  Instead, kdump_nmi_shootdown_cpus() is called to send NMI to remote cpus and crash_nmi_callback() is invoked to stop them.

But the crash_nmi_callback() doesn't invoke WBINVD for SME AFAICT.  It does call the kdump_nmi_callback() callback where a WBINVD is performed for the SNP host:

void kdump_sev_callback(void)
{
         /*
          * Do wbinvd() on remote CPUs when SNP is enabled in order to
          * safely do SNP_SHUTDOWN on the local CPU.
          */
         if (cc_platform_has(CC_ATTR_HOST_SEV_SNP))
                 wbinvd();
}

So if I read correctly, what's the reason the WBINVD is skipped for SME in case of crash_kexec()?

The system is rebooted after a crash and doesn't continue directly on into a new kernel.


How about the kdump kernel itself? Would the stale cachelines potentially corrupt it?

And how about /proc/vmcore, which reflects the system RAM used by the first, crashed, kernel? Is it OK to have stale cachelines for it?