Re: [PATCH v2] x86/sev: Fix host kdump support for SNP

From: Sean Christopherson
Date: Tue Sep 03 2024 - 15:53:07 EST


On Tue, Sep 03, 2024, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@xxxxxxx>
> [ 1.671804] AMD-Vi: Using global IVHD EFR:0x841f77e022094ace, EFR2:0x0
> [ 1.679835] AMD-Vi: Translation is already enabled - trying to copy translation structures
> [ 1.689363] AMD-Vi: Copied DEV table from previous kernel.
> [ 1.864369] AMD-Vi: Completion-Wait loop timed out
> [ 2.038289] AMD-Vi: Completion-Wait loop timed out
> [ 2.212215] AMD-Vi: Completion-Wait loop timed out
> [ 2.386141] AMD-Vi: Completion-Wait loop timed out
> [ 2.560068] AMD-Vi: Completion-Wait loop timed out
> [ 2.733997] AMD-Vi: Completion-Wait loop timed out
> [ 2.907927] AMD-Vi: Completion-Wait loop timed out
> [ 3.081855] AMD-Vi: Completion-Wait loop timed out
> [ 3.225500] AMD-Vi: Completion-Wait loop timed out
> [ 3.231083] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> d out
> [ 3.579592] AMD-Vi: Completion-Wait loop timed out
> [ 3.753164] AMD-Vi: Completion-Wait loop timed out
> [ 3.815762] Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC
> [ 3.825347] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.11.0-rc3-next-20240813-snp-host-f2a41ff576cc-dirty #61
> [ 3.837188] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM100AB 10/17/2022
> [ 3.846215] Call Trace:
> [ 3.848939] <TASK>
> [ 3.851277] dump_stack_lvl+0x2b/0x90
> [ 3.855354] dump_stack+0x14/0x20
> [ 3.859050] panic+0x3b9/0x400
> [ 3.862454] panic_if_irq_remap+0x21/0x30
> [ 3.866925] setup_IO_APIC+0x8aa/0xa50
> [ 3.871106] ? __pfx_amd_iommu_enable_faulting+0x10/0x10
> [ 3.877032] ? __cpuhp_setup_state+0x5e/0xd0
> [ 3.881793] apic_intr_mode_init+0x6a/0xf0
> [ 3.886360] x86_late_time_init+0x28/0x40
> [ 3.890832] start_kernel+0x6a8/0xb50
> [ 3.894914] x86_64_start_reservations+0x1c/0x30
> [ 3.900064] x86_64_start_kernel+0xbf/0x110
> [ 3.904729] ? setup_ghcb+0x12/0x130
> [ 3.908716] common_startup_64+0x13e/0x141
> [ 3.913283] </TASK>
> [ 3.915715] in panic
> [ 3.918149] in panic_other_cpus_shutdown
> [ 3.922523] ---[ end Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC ]---
>
> This happens as SNP_SHUTDOWN_EX fails

Exactly what happens? I.e. why does the PSP (sorry, ASP) need to be full shutdown
in order to get a kdump? The changelogs for the original SNP panic/kdump support
are frustratingly unhelpful. They all describe what the patch does, but say
nothing about _why_.

> when SNP VMs are active as the firmware checks every encryption-capable ASID
> to verify that it is not in use by a guest and a DF_FLUSH is not required. If
> a DF_FLUSH is required, the firmware returns DFFLUSH_REQUIRED.
>
> To fix this, added support to do SNP_DECOMMISSION of all active SNP VMs
> in the panic notifier before doing SNP_SHUTDOWN_EX, but then
> SNP_DECOMMISSION tags all CPUs on which guest has been activated to do
> a WBINVD. This causes SNP_DF_FLUSH command failure with the following
> flow: SNP_DECOMMISSION -> SNP_SHUTDOWN_EX -> SNP_DF_FLUSH ->
> failure with WBINVD_REQUIRED.
>
> When panic notifier is invoked all other CPUs have already been
> shutdown, so it is not possible to do a wbinvd_on_all_cpus() after
> SNP_DECOMMISSION has been executed. This eventually causes SNP_SHUTDOWN_EX
> to fail after SNP_DECOMMISSION.
>
> Adding fix to do SNP_DECOMMISSION and subsequent WBINVD on all CPUs
> during NMI shutdown of CPUs as part of disabling virtualization on
> all CPUs via cpu_emergency_disable_virtualization ->
> svm_emergency_disable().
>
> SNP_DECOMMISSION unbinds the ASID from SNP context and marks the ASID
> as unusable and then transitions the SNP guest context page to a
> firmware page and SNP_SHUTDOWN_EX transitions all pages associated
> with the IOMMU to reclaim state which the hypervisor then transitions
> to hypervisor state, all these page state changes are in the RMP
> table, so there is no loss of guest data as such and the complete
> host memory is captured with the crashkernel boot. There are no
> processes which are being killed and host/guest memory is not
> being altered or modified in any way.
>
> This fixes and enables crashkernel/kdump on SNP host.

...

> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 714c517dd4b7..30f286a3afb0 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -16,6 +16,7 @@
> #include <linux/psp-sev.h>
> #include <linux/pagemap.h>
> #include <linux/swap.h>
> +#include <linux/delay.h>
> #include <linux/misc_cgroup.h>
> #include <linux/processor.h>
> #include <linux/trace_events.h>
> @@ -89,6 +90,8 @@ static unsigned int nr_asids;
> static unsigned long *sev_asid_bitmap;
> static unsigned long *sev_reclaim_asid_bitmap;
>
> +static DEFINE_SPINLOCK(snp_decommission_lock);
> +static void **snp_asid_to_gctx_pages_map;
> static int snp_decommission_context(struct kvm *kvm);
>
> struct enc_region {
> @@ -2248,6 +2251,9 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> goto e_free_context;
> }
>
> + if (snp_asid_to_gctx_pages_map)
> + snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = sev->snp_context;
> +
> return 0;
>
> e_free_context:
> @@ -2884,9 +2890,126 @@ static int snp_decommission_context(struct kvm *kvm)
> snp_free_firmware_page(sev->snp_context);
> sev->snp_context = NULL;
>
> + if (snp_asid_to_gctx_pages_map)
> + snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = NULL;
> +
> return 0;
> }
>
> +static void __snp_decommission_all(void)
> +{
> + struct sev_data_snp_addr data = {};
> + int ret, asid;
> +
> + if (!snp_asid_to_gctx_pages_map)
> + return;
> +
> + for (asid = 1; asid < min_sev_asid; asid++) {
> + if (snp_asid_to_gctx_pages_map[asid]) {
> + data.address = __sme_pa(snp_asid_to_gctx_pages_map[asid]);

NULL pointer deref if this races with snp_decommission_context() from task
context.

> + ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
> + if (!ret) {

And what happens if SEV_CMD_SNP_DECOMMISSION fails?

> + snp_free_firmware_page(snp_asid_to_gctx_pages_map[asid]);
> + snp_asid_to_gctx_pages_map[asid] = NULL;
> + }
> + }
> + }
> +}
> +
> +/*
> + * NOTE: called in NMI context from svm_emergency_disable().
> + */
> +void sev_emergency_disable(void)
> +{
> + static atomic_t waiting_for_cpus_synchronized;
> + static bool synchronize_cpus_initiated;
> + static bool snp_decommission_handled;
> + static atomic_t cpus_synchronized;
> +
> + if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
> + return;
> +
> + /*
> + * SNP_SHUTDOWN_EX fails when SNP VMs are active as the firmware checks

Define "active".

> + * every encryption-capable ASID to verify that it is not in use by a
> + * guest and a DF_FLUSH is not required. If a DF_FLUSH is required,
> + * the firmware returns DFFLUSH_REQUIRED. To address this, SNP_DECOMMISSION
> + * is required to shutdown all active SNP VMs, but SNP_DECOMMISSION tags all
> + * CPUs that guest was activated on to do a WBINVD. When panic notifier
> + * is invoked all other CPUs have already been shutdown, so it is not
> + * possible to do a wbinvd_on_all_cpus() after SNP_DECOMMISSION has been
> + * executed. This eventually causes SNP_SHUTDOWN_EX to fail after
> + * SNP_DECOMMISSION. To fix this, do SNP_DECOMMISSION and subsequent WBINVD
> + * on all CPUs during NMI shutdown of CPUs as part of disabling
> + * virtualization on all CPUs via cpu_emergency_disable_virtualization().
> + */
> +
> + spin_lock(&snp_decommission_lock);
> +
> + /*
> + * exit early for call from native_machine_crash_shutdown()
> + * as SNP_DECOMMISSION has already been done as part of
> + * NMI shutdown of the CPUs.
> + */
> + if (snp_decommission_handled) {
> + spin_unlock(&snp_decommission_lock);
> + return;
> + }
> +
> + /*
> + * Synchronize all CPUs handling NMI before issuing
> + * SNP_DECOMMISSION.
> + */
> + if (!synchronize_cpus_initiated) {
> + /*
> + * one CPU handling panic, the other CPU is initiator for
> + * CPU synchronization.
> + */
> + atomic_set(&waiting_for_cpus_synchronized, num_online_cpus() - 2);

And what happens when num_online_cpus() == 1?

> + synchronize_cpus_initiated = true;
> + /*
> + * Ensure CPU synchronization parameters are setup before dropping
> + * the lock to let other CPUs continue to reach synchronization.
> + */
> + wmb();
> +
> + spin_unlock(&snp_decommission_lock);
> +
> + /*
> + * This will not cause system to hang forever as the CPU
> + * handling panic waits for maximum one second for
> + * other CPUs to stop in nmi_shootdown_cpus().
> + */
> + while (atomic_read(&waiting_for_cpus_synchronized) > 0)
> + mdelay(1);
> +
> + /* Reacquire the lock once CPUs are synchronized */
> + spin_lock(&snp_decommission_lock);
> +
> + atomic_set(&cpus_synchronized, 1);
> + } else {
> + atomic_dec(&waiting_for_cpus_synchronized);
> + /*
> + * drop the lock to let other CPUs contiune to reach
> + * synchronization.
> + */
> + spin_unlock(&snp_decommission_lock);
> +
> + while (atomic_read(&cpus_synchronized) == 0)
> + mdelay(1);
> +
> + /* Try to re-acquire lock after CPUs are synchronized */
> + spin_lock(&snp_decommission_lock);
> + }

Yeah, no. This is horrific. If the panic path doesn't provide the necessary
infrastructure to ensure the necessary ordering between the initiating CPU and
responding CPUs, then rework the panic path.