Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest

From: Yu Zhang

Date: Fri May 15 2026 - 10:11:42 EST

On Thu, May 14, 2026 at 06:13:24PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@xxxxxxxxxxxxxxxxxxx> Sent: Monday, May 11, 2026 9:24 AM
> >
> > Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
> > This driver implements stage-1 IO translation within the guest OS.
> > It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
> > for:
> > - Capability discovery
> > - Domain allocation, configuration, and deallocation
> > - Device attachment and detachment
> > - IOTLB invalidation
> >
> > The driver constructs x86-compatible stage-1 IO page tables in the
> > guest memory using consolidated IO page table helpers. This allows
> > the guest to manage stage-1 translations independently of vendor-
> > specific drivers (like Intel VT-d or AMD IOMMU).
> >
> > Hyper-V consumes this stage-1 IO page table when a device domain is
> > created and configured, and nests it with the host's stage-2 IO page
> > tables, therefore eliminating the VM exits for guest IOMMU mapping
> > operations. For unmapping operations, VM exits to perform the IOTLB
> > flush are still unavoidable.
> >
> > Hyper-V identifies each PCI pass-thru device by a logical device ID
> > in its hypercall interface. The vPCI driver (pci-hyperv) registers the
> > per-bus portion of this ID with the pvIOMMU driver during bus probe.
> > The pvIOMMU driver stores this mapping and combines it with the function
> > number of the endpoint PCI device to form the complete ID for hypercalls.
>
> As you are probably aware, Mukesh's patch series to support PCI
> pass-thru devices also needs to get the logical device ID. Maybe the
> registration mechanism needs to move somewhere that can be shared
> with his code.
>

Thank you so much for the review, Michael!

Yes, I looked at Mukesh's series and noticed his hv_pci_vmbus_device_id()
in pci-hyperv.c has the same dev_instance byte manipulation. We do need
a common registration mechanism.

Any suggestion on where to put it? drivers/hv/hv_common.c seems like a
natural place, but the register/lookup functions are currently only
meaningful when CONFIG_HYPERV_PVIOMMU is set. If Mukesh's pass-thru
code also needs them, we might need a new shared Kconfig option that
both can select. Open to better ideas.

Adding Mukesh to the loop. :)

> >
> > Co-developed-by: Wei Liu <wei.liu@xxxxxxxxxx>
> > Signed-off-by: Wei Liu <wei.liu@xxxxxxxxxx>
> > Co-developed-by: Easwar Hariharan <easwar.hariharan@xxxxxxxxxxxxxxxxxxx>
> > Signed-off-by: Easwar Hariharan <easwar.hariharan@xxxxxxxxxxxxxxxxxxx>
> > Signed-off-by: Yu Zhang <zhangyu1@xxxxxxxxxxxxxxxxxxx>
> > ---
> > arch/x86/hyperv/hv_init.c | 4 +
> > arch/x86/include/asm/mshyperv.h | 4 +
> > drivers/iommu/hyperv/Kconfig | 17 +
> > drivers/iommu/hyperv/Makefile | 1 +
> > drivers/iommu/hyperv/iommu.c | 705 ++++++++++++++++++++++++++++
> > drivers/iommu/hyperv/iommu.h | 54 +++
> > drivers/pci/controller/pci-hyperv.c | 19 +-
> > include/asm-generic/mshyperv.h | 12 +
> > 8 files changed, 815 insertions(+), 1 deletion(-)
> > create mode 100644 drivers/iommu/hyperv/iommu.c
> > create mode 100644 drivers/iommu/hyperv/iommu.h
> >
> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > index 323adc93f2dc..2c8ff8e06249 100644
> > --- a/arch/x86/hyperv/hv_init.c
> > +++ b/arch/x86/hyperv/hv_init.c
> > @@ -578,6 +578,10 @@ void __init hyperv_init(void)
> > old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev;
> > x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev;
> >
> > +#ifdef CONFIG_HYPERV_PVIOMMU
> > + x86_init.iommu.iommu_init = hv_iommu_init;
> > +#endif
> > +
> > hv_apic_init();
> >
> > x86_init.pci.arch_init = hv_pci_init;
> > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> > index f64393e853ee..20d947c2c758 100644
> > --- a/arch/x86/include/asm/mshyperv.h
> > +++ b/arch/x86/include/asm/mshyperv.h
> > @@ -313,6 +313,10 @@ static inline void mshv_vtl_return_hypercall(void) {}
> > static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> > #endif
> >
> > +#ifdef CONFIG_HYPERV_PVIOMMU
> > +int __init hv_iommu_init(void);
> > +#endif
> > +
> > #include <asm-generic/mshyperv.h>
> >
> > #endif
> > diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig
> > index 30f40d867036..9e658d5c9a77 100644
> > --- a/drivers/iommu/hyperv/Kconfig
> > +++ b/drivers/iommu/hyperv/Kconfig
> > @@ -8,3 +8,20 @@ config HYPERV_IOMMU
> > help
> > Stub IOMMU driver to handle IRQs to support Hyper-V Linux
> > guest and root partitions.
> > +
> > +if HYPERV_IOMMU
> > +config HYPERV_PVIOMMU
> > + bool "Microsoft Hypervisor para-virtualized IOMMU support"
> > + depends on X86 && HYPERV
>
> What is the intent w.r.t. 32-bit builds? Using X86 instead of X86_64
> allows it. I did a 32-bit build and didn't get any build failures, which is
> good. But I can't run it to see if the pvIOMMU actually works in a
> 32-bit build. I don't know how building X86_64 generic PT entries
> would fare.
>

Sorry, no intent to support 32-bit. I'll change to `depends on X86_64 && HYPERV`

[...]

> > +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
> > +{
> > + u64 status;
> > + unsigned long flags;
> > + struct hv_input_flush_device_domain *input;
> > +
> > + local_irq_save(flags);
> > +
> > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > + memset(input, 0, sizeof(*input));
> > + input->device_domain = hv_domain->device_domain;
>
> The previous version of this patch had code to set several other fields in
> the input. I wanted to confirm that not setting them in this version is
> intentional. Were they not needed?
>

Oh. The RFC v1 set partition_id, owner_vtl, domain_id.type, and domain_id.id
individually. In this version, I just simplified it to a struct assignment.
No functional change.

> > + status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input, NULL);
> > +
> > + local_irq_restore(flags);
> > +
> > + if (!hv_result_success(status))
> > + pr_err("HVCALL_FLUSH_DEVICE_DOMAIN failed, status %lld\n", status);
> > +}
> > +
> > +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev)
> > +{
> > + u64 status;
> > + unsigned long flags;
> > + struct hv_input_detach_device_domain *input;
> > + struct pci_dev *pdev;
> > + struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> > + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +
> > + /* See the attach function, only PCI devices for now */
> > + if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> > + return;
>
> Are these sanity checks necessary? The only caller is hv_iommu_attach_dev()
> and it has already done the checks.

You're right, they're redundant.

> > +
> > + pdev = to_pci_dev(dev);
> > +
> > + dev_dbg(dev, "detaching from domain %d\n", hv_domain->device_domain.domain_id.id);
> > +
> > + local_irq_save(flags);
> > +
> > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > + memset(input, 0, sizeof(*input));
> > + input->partition_id = HV_PARTITION_ID_SELF;
> > + if (hv_iommu_lookup_logical_dev_id(pdev, &input->device_id.as_uint64)) {
>
> As Sashiko and Jacob Pan pointed out, doing the lookup while interrupts are disabled
> is problematic. My suggestion would be to just do the lookup into a local variable
> before disabling interrupts (rather than using a raw spin lock as Jacob suggested).
>
> Same situation occurs in hv_iommu_attach_dev() and
> hv_iommu_get_logical_device_property().
>

Thanks! I would also prefer to look up before disabling interrupts.

> > + local_irq_restore(flags);
> > + dev_warn(&pdev->dev, "no IOMMU registration for vPCI bus on detach\n");
> > + return;
> > + }
> > + status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> > +
> > + local_irq_restore(flags);
> > +
> > + if (!hv_result_success(status))
> > + pr_err("HVCALL_DETACH_DEVICE_DOMAIN failed, status %lld\n", status);
> > +
> > + hv_flush_device_domain(hv_domain);
> > +
> > + vdev->hv_domain = NULL;
> > +}
> > +

[...]

> > +static int hv_iommu_get_logical_device_property(struct device *dev,
> > + u32 code,
> > + struct hv_output_get_logical_device_property *property)
> > +{
> > + u64 status, lid;
> > + unsigned long flags;
> > + int ret;
> > + struct hv_input_get_logical_device_property *input;
> > + struct hv_output_get_logical_device_property *output;
> > +
> > + local_irq_save(flags);
> > +
> > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > + output = *this_cpu_ptr(hyperv_pcpu_input_arg) + sizeof(*input);
>
> Nit: The other way to set output is:
>
> output = input + 1;
>
> I think this produces slightly better code because of not needing to
> reference the per-cpu variable hyperv_pcpu_input_arg a 2nd time.
>
>

Indeed! It's more elegant. :)

> > + memset(input, 0, sizeof(*input));
> > + input->partition_id = HV_PARTITION_ID_SELF;
> > + ret = hv_iommu_lookup_logical_dev_id(to_pci_dev(dev), &lid);
> > + if (ret) {
> > + local_irq_restore(flags);
> > + return ret;
> > + }
> > + input->logical_device_id = lid;
> > + input->code = code;
> > + status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY, input, output);
> > + *property = *output;
> > +
> > + local_irq_restore(flags);
> > +
> > + if (!hv_result_success(status))
> > + pr_err("HVCALL_GET_LOGICAL_DEVICE_PROPERTY failed, status %lld\n", status);
> > +
> > + return hv_result_to_errno(status);
> > +}
> > +

[...]

> > +static void hv_iommu_release_device(struct device *dev)
> > +{
> > + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > + struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > + if (pdev->ats_enabled)
> > + pci_disable_ats(pdev);
> > +
> > + dev_iommu_priv_set(dev, NULL);
> > + set_dma_ops(dev, NULL);
>
> Previous versions of this function did hv_iommu_detach_dev(). With that call
> removed from here, hv_iommu_detach_dev() is only called when attaching a
> domain to a device that already has a domain attached. Is it the case that
> Hyper-V doesn't require the detach as a cleanup step?
>

The IOMMU core attaches the device to release_domain (our blocking domain)
before calling release_device(), so I believe the explicit detach in the RFC
was redundant. I simply didn't realize that at the time.

Sorry I forgot to mention this in the changelog.

[...]

> > +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> > +{
> > + int ret;
> > + struct hv_iommu_domain *hv_domain;
> > + struct pt_iommu_x86_64_cfg cfg = {};
> > +
> > + hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
> > + if (!hv_domain)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > + if (ret) {
> > + kfree(hv_domain);
> > + return ERR_PTR(ret);
> > + }
> > +
> > + hv_domain->domain.geometry = hv_iommu_device->geometry;
> > + hv_domain->pt_iommu.nid = dev_to_node(dev);
> > +
> > + cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
> > + cfg.common.hw_max_oasz_lg2 = 52;
> > + cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 : 3;
> > +
> > + ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL);
> > + if (ret) {
> > + hv_delete_device_domain(hv_domain);
> > + kfree(hv_domain);
> > + return ERR_PTR(ret);
> > + }
> > +
> > + /* Constrain to page sizes the hypervisor supports */
> > + hv_domain->domain.pgsize_bitmap &= hv_iommu_device->pgsize_bitmap;
> > +
> > + hv_domain->domain.ops = &hv_iommu_paging_domain_ops;
> > +
> > + ret = hv_configure_device_domain(hv_domain, __IOMMU_DOMAIN_PAGING);
> > + if (ret) {
> > + pt_iommu_deinit(&hv_domain->pt_iommu);
> > + hv_delete_device_domain(hv_domain);
> > + kfree(hv_domain);
> > + return ERR_PTR(ret);
> > + }
> > +
> > + return &hv_domain->domain;
>
> I think this function would be better if the error paths did "goto"
> a cascading set of error labels. That's the typical pattern, and it's what you
> use in hv_iommu_init(), for example.
>

Good point. Will restructure to use goto-based error labels

> > +}
> > +
> > +static struct iommu_ops hv_iommu_ops = {
> > + .capable = hv_iommu_capable,
> > + .domain_alloc_paging = hv_iommu_domain_alloc_paging,
> > + .probe_device = hv_iommu_probe_device,
> > + .release_device = hv_iommu_release_device,
> > + .device_group = hv_iommu_device_group,
> > + .get_resv_regions = hv_iommu_get_resv_regions,
> > + .owner = THIS_MODULE,
> > + .identity_domain = &hv_identity_domain.domain,
> > + .blocked_domain = &hv_blocking_domain.domain,
> > + .release_domain = &hv_blocking_domain.domain,
> > +};
> > +
> > +static int hv_iommu_detect(struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> > +{
> > + u64 status;
> > + unsigned long flags;
> > + struct hv_input_get_iommu_capabilities *input;
> > + struct hv_output_get_iommu_capabilities *output;
> > +
> > + local_irq_save(flags);
> > +
> > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > + output = *this_cpu_ptr(hyperv_pcpu_input_arg) + sizeof(*input);
>
> Potentially use "output = input + 1" here as well.
>

Yes. Thanks!

[...]

> > @@ -3857,13 +3858,25 @@ static int hv_pci_probe(struct hv_device *hdev,
> >
> > hbus->state = hv_pcibus_probed;
> >
> > - ret = create_root_hv_pci_bus(hbus);
> > + /* Notify pvIOMMU before any device on the bus is scanned. */
> > + prefix = (hdev->dev_instance.b[5] << 24) |
> > + (hdev->dev_instance.b[4] << 16) |
> > + (hdev->dev_instance.b[7] << 8) |
> > + (hdev->dev_instance.b[6] & 0xf8);
>
> This assembling of the logical device id prefix duplicates the
> code in hv_irq_retarget_interrupt(). Could this code save the
> prefix in struct hv_pcibus_device, and then have
> hv_irq_retarget_interrupt() use it? Then it would be clear
> that HVCALL_RETARGET_INTERRUPT is using exactly the same
> logical device id as the IOMMU hypercalls.
>

Good point. I think we can do it. :)

> > +
> > + ret = hv_iommu_register_pci_bus(dom, prefix);
> > if (ret)
> > goto free_windows;
> >
> > + ret = create_root_hv_pci_bus(hbus);
> > + if (ret)
> > + goto unregister_pviommu;
> > +
> > mutex_unlock(&hbus->state_lock);
> > return 0;
> >
> > +unregister_pviommu:
> > + hv_iommu_unregister_pci_bus(dom);
> > free_windows:
> > hv_pci_free_bridge_windows(hbus);
> > exit_d0:
> > @@ -3974,8 +3987,10 @@ static int hv_pci_bus_exit(struct hv_device *hdev, bool
> > keep_devs)
> > static void hv_pci_remove(struct hv_device *hdev)
> > {
> > struct hv_pcibus_device *hbus;
> > + int dom;
> >
> > hbus = hv_get_drvdata(hdev);
> > + dom = hbus->bridge->domain_nr;
>
> Nit: Setting "dom" here feels a little weird because the value is only needed
> under the "if" statement. The value must be read before the root bus is
> removed, but even so moving it under the "if" statement would make more
> sense to me.
>

Sure. Thanks!

> > if (hbus->state == hv_pcibus_installed) {
> > tasklet_disable(&hdev->channel->callback_event);
> > hbus->state = hv_pcibus_removing;
> > @@ -3994,6 +4009,8 @@ static void hv_pci_remove(struct hv_device *hdev)
> > hv_pci_remove_slots(hbus);
> > pci_remove_root_bus(hbus->bridge->bus);
> > pci_unlock_rescan_remove();
> > +
> > + hv_iommu_unregister_pci_bus(dom);
> > }
> >
> > hv_pci_bus_exit(hdev, false);

B.R.
Yu