Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
From: Jacob Pan
Date: Wed Feb 05 2025 - 17:49:17 EST
Hi Nicolin,
On Fri, 10 Jan 2025 19:32:16 -0800
Nicolin Chen <nicolinc@xxxxxxxxxx> wrote:
> [ Background ]
> On ARM GIC systems and others, the target address of the MSI is
> translated by the IOMMU. For GIC, the MSI address page is called
> "ITS" page. When the IOMMU is disabled, the MSI address is programmed
> to the physical location of the GIC ITS page (e.g. 0x20200000). When
> the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI
> address is programmed to an allocated IO virtual address (a.k.a
> IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS
> page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage
> translation is enabled, IOVA will be still used to program the MSI
> address, though the mappings will be in two stages: IOVA (0xFFFF0000)
> ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for
> Intermediate Physical Address).
>
> If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA,
> the IOVA is dynamically allocated from the top of the IOVA space. If
> attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough
> device), the IOVA is fixed to an MSI window reported by the IOMMU
> driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE
> (IOVA==0x8000000) for ARM IOMMUs.
>
> So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in
> charge of the IOMMU translation (1-stage translation), since the IOVA
> for the ITS page is fixed and known by kernel. However, with virtual
> machine enabling a nested IOMMU translation (2-stage), a guest kernel
> directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA,
> mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space
> (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level
> IOVA to program the MSI address.
>
> There have been two approaches to solve this problem:
> 1. Create an identity mapping in the stage-1. VMM could insert a few
> RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel
> would fetch these RMR entries from the IORT and create an
> IOMMU_RESV_DIRECT region per iommu group for a direct mapping.
> Eventually, the mappings would look like: IOVA (0x8000000) === IPA
> (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel
> and VMM to agree on the IPA.
Should this RMR be in a separate range than MSI_IOVA_BASE? The guest
will have MSI_IOVA_BASE in a reserved region already, no?
e.g. # cat
/sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions
0x0000000008000000 0x00000000080fffff msi