Re: [PATCH v3] PCI: hv: Fix hibernation in case interrupts are not re-created

From: Lorenzo Pieralisi
Date: Fri Oct 02 2020 - 07:38:24 EST


On Fri, Oct 02, 2020 at 01:51:58AM -0700, Dexuan Cui wrote:
> pci_restore_msi_state() directly writes the MSI/MSI-X related registers
> via MMIO. On a physical machine, this works perfectly; for a Linux VM
> running on a hypervisor, which typically enables IOMMU interrupt remapping,
> the hypervisor usually should trap and emulate the MMIO accesses in order
> to re-create the necessary interrupt remapping table entries in the IOMMU,
> otherwise the interrupts can not work in the VM after hibernation.
>
> Hyper-V is different from other hypervisors in that it does not trap and
> emulate the MMIO accesses, and instead it uses a para-virtualized method,
> which requires the VM to call hv_compose_msi_msg() to notify the hypervisor
> of the info that would be passed to the hypervisor in the case of the
> trap-and-emulate method. This is not an issue to a lot of PCI device
> drivers, which destroy and re-create the interrupts across hibernation, so
> hv_compose_msi_msg() is called automatically. However, some PCI device
> drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia
> proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts
> across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(),
> otherwise the PCI device drivers can no longer receive interrupts after
> the VM resumes from hibernation.
>
> Hyper-V is also different in that chip->irq_unmask() may fail in a
> Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask()
> can not fail because unmasking an MSI/MSI-X register just means an MMIO
> write): during hibernation, when a CPU is offlined, the kernel tries
> to move the interrupt to the remaining CPUs that haven't been offlined
> yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails
> because the vmbus channel has been closed: here the early "return" in
> hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the
> desc->masked remains "true", so later after hibernation, the MSI interrupt
> always remains masked, which is incorrect. Refer to cpu_disable_common()
> -> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq():
>
> static bool migrate_one_irq(struct irq_desc *desc)
> {
> ...
> if (maskchip && chip->irq_mask)
> chip->irq_mask(d);
> ...
> err = irq_do_set_affinity(d, affinity, false);
> ...
> if (maskchip && chip->irq_unmask)
> chip->irq_unmask(d);
>
> Fix the issue by calling pci_msi_unmask_irq() unconditionally in
> hv_irq_unmask(). Also suppress the error message for hibernation because
> the hypercall failure during hibernation does not matter (at this time
> all the devices have been frozen). Note: the correct affinity info is
> still updated into the irqdata data structure in migrate_one_irq() ->
> irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> resumes, hv_pci_restore_msi_state() is able to correctly restore
> the interrupt with the correct affinity.
>
> Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> Reviewed-by: Jake Oshins <jakeo@xxxxxxxxxxxxx>
> Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx>
> ---
>
> Changes in v2:
> Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
> Added Jake's Reviewed-by.
>
> Changes in v3:
> Improved the commit log.
> Improved the comments.
> Improved the change in hv_irq_unmask(): removed the early "return"
> and call pci_msi_unmask_irq() unconditionally.
>
> drivers/pci/controller/pci-hyperv.c | 50 +++++++++++++++++++++++++++--
> 1 file changed, 47 insertions(+), 3 deletions(-)

Applied to pci/hv, thanks !

Lorenzo

> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index fc4c3a15e570..a9df492fbffa 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1276,11 +1276,25 @@ static void hv_irq_unmask(struct irq_data *data)
> exit_unlock:
> spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags);
>
> - if (res) {
> + /*
> + * During hibernation, when a CPU is offlined, the kernel tries
> + * to move the interrupt to the remaining CPUs that haven't
> + * been offlined yet. In this case, the below hv_do_hypercall()
> + * always fails since the vmbus channel has been closed:
> + * refer to cpu_disable_common() -> fixup_irqs() ->
> + * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
> + *
> + * Suppress the error message for hibernation because the failure
> + * during hibernation does not matter (at this time all the devices
> + * have been frozen). Note: the correct affinity info is still updated
> + * into the irqdata data structure in migrate_one_irq() ->
> + * irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> + * resumes, hv_pci_restore_msi_state() is able to correctly restore
> + * the interrupt with the correct affinity.
> + */
> + if (res && hbus->state != hv_pcibus_removing)
> dev_err(&hbus->hdev->device,
> "%s() failed: %#llx", __func__, res);
> - return;
> - }
>
> pci_msi_unmask_irq(data);
> }
> @@ -3372,6 +3386,34 @@ static int hv_pci_suspend(struct hv_device *hdev)
> return 0;
> }
>
> +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
> +{
> + struct msi_desc *entry;
> + struct irq_data *irq_data;
> +
> + for_each_pci_msi_entry(entry, pdev) {
> + irq_data = irq_get_irq_data(entry->irq);
> + if (WARN_ON_ONCE(!irq_data))
> + return -EINVAL;
> +
> + hv_compose_msi_msg(irq_data, &entry->msg);
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Upon resume, pci_restore_msi_state() -> ... -> __pci_write_msi_msg()
> + * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
> + * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
> + * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
> + * Table entries.
> + */
> +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
> +{
> + pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
> +}
> +
> static int hv_pci_resume(struct hv_device *hdev)
> {
> struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
> @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev)
>
> prepopulate_bars(hbus);
>
> + hv_pci_restore_msi_state(hbus);
> +
> hbus->state = hv_pcibus_installed;
> return 0;
> out:
> --
> 2.19.1
>